# Data Cleaning

In [None]:
# As always
import pandas as pd
import numpy as np

### Building a dataframe by hand

In [None]:
some_data = [1, 2, 3, 4, 5] # list of numbers
some_more_data = ['a', 'b', 'c', 'd', 'e'] # list of letters
some_booleans = [True, False, True, True, True] # list of booleans
df = pd.DataFrame({'some_numbers':some_data, 'some_letters':some_more_data, 'some_bools':some_booleans})
df

We can checkout the datatypes of each column using `df.info()`

### Add a new column

We can add a new column using the `df['column']=` syntax. It accepts a scalar *or* list-like object

### Changing data types

In [None]:
classes  = ['Transfiguration', 'Charms', 'Potions', 'History of Magic', 'Defence Against the Dark Arts', 'Astronomy and Herbology']
ratings = ['5', 3.2, 1, 2, '4', '3.5'] # Note: Some of these are strings, floats, and ints
grades = pd.DataFrame({'courses':classes, 'ratings':ratings})
grades

Say we want to double the rating values

That's not what we expected – Let's drop that column using the `.drop(columns= )` method

Remember, `drop()` returns a dataframe, so we'll want to save it back onto the original

Let's try scaling ratings again

Before, our `ratings` column had mixed datatypes, floats and strings. 
<br>In python, when we multiply a string, we repeat it, so we got `"55"` instead of `10` for the first entry

To change the datatype, we can convert the `ratings` series into float values using `.astype()`

Check out what happens when we pass `'str'`, `'float'`, and `'int'` to `.astype()`

# Handling missing values

Missing data is a common issue, and an incredible headache in machine learning.

<br> Take, for example, this dataframe with several `NaN` and `None` values, both of which can be used to represent missing data

In [None]:
students = ['Bill Weasley', 'Charlie Weasley', 'Percy Weasley', 'Fred Weasley', np.nan, 'Ron Weasley', 'Ginny Weasley']
years = [np.nan, np.nan, 6, 4, 4, 2, 1]
interests = ['Dragons', 'Gringots', 'Ministry of Magic', 'Jokes', 'Jokes', np.nan, np.nan]

weasleys = pd.DataFrame({'name':students,'year' :years, 'interests':interests})
weasleys

### Identifying missing values

We can use `.isna()` on a dataframe or series


Here, we've specified the axis on which we'd like to apply the sum operation. 
<br>To sum *across* rows, to get one value *per column*, we used axis = 0.

- In general, **Rows: axis=0, Columns: axis=1**

If we wanted instead to see how many null values we have for every child, we'd swap the axis:

Remember, we want a breakdown *by row*, so we're counting the number of rows *across* each column

### Strategy 1: Dropping rows with missing values

By far the easiest way of dealing with missing data is just dropping rows that have missing values.

By default, `dropna()` will drop rows that have a null value for *any* column

You might also need to drop NAs only if it's present for a single column. 
<br>In this case, we can pass the `subset=` argument to `.dropna()` to select for current hogwarts students

### Strategy 2: Filling in missing values

We can fill in missing values in a variety of different ways. 
<br> We can use a specific value (like the mean), forward-fill, back-fill, or use a variety of more advanced imputation methods using ML

Replacing missing values with a specific value:

We can select a particular column to fill nulls with too

# Mapping Values

### Renaming Columns

We can rename columns to make them easier to refer to, more meaningful, or more concise

In [None]:
weasleys['house at hogwarts'] = 'Gryffindor'
weasleys

To do so, we'll use the `df.rename(columns=)` syntax, wheren `columns=` accepts a dictionary 

### Replacing Values

Along the same vein, we can use `.replace()` to map values

In [None]:
weasleys['hogwarts'] = [0,0,1,1,1,1,1]
weasleys

**Try it out:** Use the starter code below to add the graduation years of each Weasly and make sure to convert it to integers (int). 
*Hint:* `df.column.astype()` would be useful here

In [None]:
grad_years = ['1989', 1991.0, 1994.0, 1996, '1996', 1998, '1999']


## String processing

Often, a dataset will contain string representations of data that could be really useful if you could find some way to extract it. 

<br> Let's start off with a dataframe

In [None]:
roster = ['Oliver Wood', 'Angelina Johnson', 'Katie Bell', 'Alicia Spinnet', 'Fred Weasley', 'George weasley', 'Harry Potter'] 
role = ['Keeper','chaser','Chaser','chaser','beater','beater','SeEKeR']

quidditch = pd.DataFrame({'player':roster, 'role':role})
quidditch

It'd be great if we could work with just the first names of everyone. 

With normal python strings, this is pretty easy to do using the `.split()` function:

Let's try using that to extract the first names from the column `player`

Looks like we got an error: We can't use `split()` on the series object directly.
<br><br> Instead, we have to "vectorize" it using `.str`  first

The `.str` part is a pretty nifty tool when we wanna access special functions just to work with strings. We'll see this come up with special functions for dealing with time.

Before we move on, check out the object type of the output using `type()`

### Lambda apply functions

Lambda apply functions are a pretty helpful tool for cleaning, here's one quick example

Quite literally, this reads: <br>

For every element `x` in the series `quidditch.player`, take the first element of `x` and save it to a new column `first_name`
<br> In other words, we are **apply**ing the *anonymous* function `x[0]` for every row in the series

**Try it out!** Use lambda-apply to reverse the order of letters in the `role` column
<br>Hint: To reverse a string in python, use `[::-1]`

### Changing capitalization to better process text

Let's look at how many players are in each role

`chaser` and `Chaser` should be the same role, but because of a mismatch in cases, we're getting unique results.

<br>An easy way to solve this is by converting all the text to a uniform case

In [7]:
name = 'CHo chaNg'
print(name)


CHo chaNg


**Try it out**: How many Weasleys are on the team?

Hint: (1) Either standardize the case and use the `.str.contains()` method <br> OR (2) Standardize the case, pull the last name, then use `.value_counts()`

# Date & Time processing

In [None]:
person = ['Harry', 'Hermoine', 'Ron', 'Voldy']
birthdays = ['July 31st, 1980', '9-19-1979', '1980 Mar 1', '12//31// //1926']

bdays = pd.DataFrame({'person': person, 'birthday': birthdays})
bdays

Yikes! Let's see if we can clean up the time series data using `pd.to_datetime`

As you can see, `pd.to_datetime` is pretty powerful. In can read in quite a few time formats as strings, then convert them into a `Timestamp` series

In [None]:
type(bdays.birthday[0])

### Using pandas datetime objects

We can pull quite a lot just from a datetime timestamp using attributes

In [None]:
harry_bday = bdays.at[0, 'birthday'] # Taking the value for harry's bday
print(harry_bday) # the raw timestamp

**Try it out**: Is Harry's birthday a leap year?

Hint: Use the `.is_leap_year` method

### Make new columns from these datetime attributes 

Let's use this to make new columns that reflect these attributes.

We'll use the `.dt` accessor object to snag the month for each row, just like we did with `.str`

In [None]:
bdays['month'] = bdays.birthday.dt.month_name()
bdays

**Try It Out** with the some other columns

### Other uses for datetimes

Which people were born before the First Wizarding War (January 1970)?

To this, we'll have to convert Jan 1970 into a datetime object to allow for comparison

We can also do some quick maths quite easily:

For example, how much older is Hermione than Ron?

Note: this returns a `Timedelta` object, not `Timestamp`. We can get similar attributes

### How do we apply this?

When you get data that includes time as a variable, it'll be in one of many possible formats, and not always consistent throughout the whole dataset. 


`pd.to_datetime` makes the process of cleaning these incredibly easy!

Once cleaned, we can look at specific attributes such as month, day, and year **to gain insight we wouldn't otherwise have been able to access.**

There's a lot, lot more you can do with pandas datetimes - use business days, adjust for time zones - just about anything you'd imagine.

The docs for all of that is linked here: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#overview

# Merging DataFrames

Merging sources of data is super important:
    
<br> Sometimes you have data from two different sources that you'd like to have in one data frame to analyze. We can do that with `.concat()` and `.merge()`

In [None]:
spells = ['Summoning Charm', 'Patronus Charm', 'Disarming Charm', 'Killing Curse', 'Cruciatus Curse', 'Impediment Jinx', 'Dark Mark']
incantations = ['Accio', 'Excpecto Patronum', 'Expelliarmus', 'Avada Kedavra', 'Crucio', 'Impedimenta', 'Morsmordre']

incantations_df = pd.DataFrame({'Spell': spells, 'Incantation': incantations})
incantations_df

In [None]:
effects = ['Summons an Object', 'Spirit to Guard Against Dementors', 'Disarms an Opponent', 'Instantaneous Death', 'Excruciating Pain', 'Hinders Movement', 'Conjures Dark Mark']

effects_df = pd.DataFrame({'Spell': spells, 'Effect': effects})
effects_df

In [None]:
colors = ['None', 'Silver', 'Scarlet', 'Green', 'Red or None', 'Turquoise', 'Green']

colors_df = pd.DataFrame({'Spell': spells, 'Light Color': colors}).sample(frac=1).reset_index(drop=True)
colors_df

In [None]:
# Quick helper
from IPython.display import display_html
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style=\"display:inline\"'),raw=True)

In [None]:
display_side_by_side(incantations_df,effects_df,colors_df)

Note that each one of these dataframes have a column in common, `Spell`.

The order of the values may not be same, but we're still good to go

### `pd.Merge()`

Instead of working with three distinct dataframes, let's combine them into one df

To do so, we can call `.merge()` on two data tables and specify the column on which to merge as `on=`

The column we're merging on is called a **join key**. It may have different column names, but we can specify that in the join.

### Join logic

In [None]:
more_spells = ['Disarming Charm', 'Dark Mark', 'Imperius Curse', 'Sectumsempra', 'Levitation Charm']
more_incantations = ['Expelliarmus', 'Morsmordre', 'Imperio', 'Sectumsempra', 'Wingardium Leviosa']

more_incantations_df = pd.DataFrame({'Spell': more_spells, 'Incantation': more_incantations})
more_incantations_df

With the previous merges, we had the same number of observations in every dataframe.

<br>With some merges, not every row may align. Let's try to merge `more_incantations_df` with `effects_df`. Note how there are some dining halls in common, and some unique to each

In [None]:
display_side_by_side(more_incantations_df,effects_df)

We can do a few different merges now: 
1. If we want to retain **only** those in common, we use an `inner` join
2. If we want to keep **everything**, and keep placeholders for missing data, we use an `outer` join
3. If we want to keep just those in one table, and **lookup** values from another, we use a `left` join

### `pd.Concat()`

Another *similar* function is `.concat()` 

It's a little different from `.merge()`, since we'll have to pass in a `list` of dataframes instead

In [None]:
df = pd.concat([incantations_df, effects_df, colors_df])
df

That didn't quite work as expected, because `concat()` stacked the dataframes above each other, instead of combining information for common rows.

Note that it **didn't combine rows** when `merge()` easily could have.

One example of when `concat()` is appropriate is when we want to add on more information to a dataframe, but the **rows are the different** between the two

In [None]:
incantations_all = pd.concat([incantations_df, more_incantations_df]).reset_index(drop=True).drop_duplicates()
# The reset_index() allows us to prevent overlapping of the indices

incantations_all

`Concat` can also horizontally stack dataframes, usng the `axis=1` argument. 

Here's a case where it might be useful:

In [None]:
more_info_df = pd.DataFrame({'Dark Magic': ['No', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes'], 
                             'Type': ['Charm', 'Charm', 'Charm', 'Curse', 'Curse', 'Jinx', 'Curse']})
more_info_df

In [None]:
df = pd.concat([incantations_df, more_info_df], axis=1)
df

Note the difference between `.concat(axis=1)` and `.merge()`. We would use `.concat()` when there isn't a duplicate column (a key), and `.merge()` when there is one.

# TLDR

Data cleaning is one of the most important parts of any data workflow. Pandas provides incredibly powerful tools, if you can weld them properly. Here's a quick recap of some helpful functions. Not sure what parameters they accept? Use the `?function` shortcut to quickly pull the documentation.

Checking Data Types:

- `df.info()`
- `df.column.astype()`

Mapping Values:

- `df.drop()`
- `df.rename()`
- `df.columns.replace()`

Handling Missing Data (NaNs):

- `df.column.isna().sum()`
- `df.column.fillna()`

Working with Strings:

- `df.column.str.lower()`
- `df.column.str.split()`
- `df.column.str.contains()`

Working with DateTime:

- `pd.to_datetime(df.column)`
- `df.column.dt.month()`
- `df[df.column < pd.to_datetime('some date')]`

Joining Data:

- `pd.merge()`
- `pd.concat()`