# Data Cleaning

In [1]:
# As always
import pandas as pd
import numpy as np

### Building a dataframe by hand

In [2]:
some_data = [1, 2, 3, 4, 5] # list of numbers
some_more_data = ['a', 'b', 'c', 'd', 'e'] # list of letters
some_booleans = [True, False, True, True, True] # list of booleans
df = pd.DataFrame({'some_numbers':some_data, 'some_letters':some_more_data, 'some_bools':some_booleans})
df

Unnamed: 0,some_numbers,some_letters,some_bools
0,1,a,True
1,2,b,False
2,3,c,True
3,4,d,True
4,5,e,True


We can checkout the datatypes of each column using `df.info()`

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   some_numbers  5 non-null      int64 
 1   some_letters  5 non-null      object
 2   some_bools    5 non-null      bool  
dtypes: bool(1), int64(1), object(1)
memory usage: 213.0+ bytes


### Add a new column

We can add a new column using the `df['column']=` syntax. It accepts a scalar *or* list-like object

In [4]:
df['some_ones'] = 1.0
df['some_other_numbers'] = [11,12,13,14,15]

df

Unnamed: 0,some_numbers,some_letters,some_bools,some_ones,some_other_numbers
0,1,a,True,1.0,11
1,2,b,False,1.0,12
2,3,c,True,1.0,13
3,4,d,True,1.0,14
4,5,e,True,1.0,15


### Changing data types

In [5]:
classes  = ['Transfiguration', 'Charms', 'Potions', 'History of Magic', 'Defence Against the Dark Arts', 'Astronomy and Herbology']
ratings = ['5', 3.2, 1, 2, '4', '3.5'] # Note: Some of these are strings, floats, and ints
grades = pd.DataFrame({'courses':classes, 'ratings':ratings})
grades

Unnamed: 0,courses,ratings
0,Transfiguration,5.0
1,Charms,3.2
2,Potions,1.0
3,History of Magic,2.0
4,Defence Against the Dark Arts,4.0
5,Astronomy and Herbology,3.5


Say we want to double the rating values

In [6]:
grades['ratings_2x'] = grades['ratings']*2
grades

Unnamed: 0,courses,ratings,ratings_2x
0,Transfiguration,5.0,55
1,Charms,3.2,6.4
2,Potions,1.0,2
3,History of Magic,2.0,4
4,Defence Against the Dark Arts,4.0,44
5,Astronomy and Herbology,3.5,3.53.5


That's not what we expected – Let's drop that column using the `.drop(columns= )` method

Remember, `drop()` returns a dataframe, so we'll want to save it back onto the original

In [7]:
grades = grades.drop(columns=['ratings_2x'])
grades

Unnamed: 0,courses,ratings
0,Transfiguration,5.0
1,Charms,3.2
2,Potions,1.0
3,History of Magic,2.0
4,Defence Against the Dark Arts,4.0
5,Astronomy and Herbology,3.5


Let's try scaling ratings again

Before, our `ratings` column had mixed datatypes, floats and strings. 
<br>In python, when we multiply a string, we repeat it, so we got `"55"` instead of `10` for the first entry

In [8]:
# For example:
x = 'hello'
print(x * 2)

y = 2.71
print(y * 2)

hellohello
5.42


To change the datatype, we can convert the `ratings` series into float values using `.astype()`

In [9]:
grades['ratings'] = grades['ratings'].astype(float)

grades['ratings_2x'] = grades.ratings * 2
grades

Unnamed: 0,courses,ratings,ratings_2x
0,Transfiguration,5.0,10.0
1,Charms,3.2,6.4
2,Potions,1.0,2.0
3,History of Magic,2.0,4.0
4,Defence Against the Dark Arts,4.0,8.0
5,Astronomy and Herbology,3.5,7.0


Check out what happens when we pass `'str'`, `'float'`, and `'int'` to `.astype()`

In [10]:
grades['ratings'] = grades['ratings'].astype('str')
grades['ratings_2x_str'] = grades.ratings * 2
grades

Unnamed: 0,courses,ratings,ratings_2x,ratings_2x_str
0,Transfiguration,5.0,10.0,5.05.0
1,Charms,3.2,6.4,3.23.2
2,Potions,1.0,2.0,1.01.0
3,History of Magic,2.0,4.0,2.02.0
4,Defence Against the Dark Arts,4.0,8.0,4.04.0
5,Astronomy and Herbology,3.5,7.0,3.53.5


In [11]:
grades['ratings'] = grades['ratings'].astype('float')
grades

Unnamed: 0,courses,ratings,ratings_2x,ratings_2x_str
0,Transfiguration,5.0,10.0,5.05.0
1,Charms,3.2,6.4,3.23.2
2,Potions,1.0,2.0,1.01.0
3,History of Magic,2.0,4.0,2.02.0
4,Defence Against the Dark Arts,4.0,8.0,4.04.0
5,Astronomy and Herbology,3.5,7.0,3.53.5


In [12]:
grades['ratings'] = grades['ratings'].astype('int')
grades

Unnamed: 0,courses,ratings,ratings_2x,ratings_2x_str
0,Transfiguration,5,10.0,5.05.0
1,Charms,3,6.4,3.23.2
2,Potions,1,2.0,1.01.0
3,History of Magic,2,4.0,2.02.0
4,Defence Against the Dark Arts,4,8.0,4.04.0
5,Astronomy and Herbology,3,7.0,3.53.5


# Handling missing values

Missing data is a common issue, and an incredible headache in machine learning.

<br> Take, for example, this dataframe with several `NaN` and `None` values, both of which can be used to represent missing data

In [13]:
students = ['Bill Weasley', 'Charlie Weasley', 'Percy Weasley', 'Fred Weasley', np.nan, 'Ron Weasley', 'Ginny Weasley']
years = [np.nan, np.nan, 6, 4, 4, 2, 1]
interests = ['Dragons', 'Gringots', 'Ministry of Magic', 'Jokes', 'Jokes', np.nan, np.nan]

weasleys = pd.DataFrame({'name':students,'year' :years, 'interests':interests})
weasleys

Unnamed: 0,name,year,interests
0,Bill Weasley,,Dragons
1,Charlie Weasley,,Gringots
2,Percy Weasley,6.0,Ministry of Magic
3,Fred Weasley,4.0,Jokes
4,,4.0,Jokes
5,Ron Weasley,2.0,
6,Ginny Weasley,1.0,


### Identifying missing values

We can use `.isna()` on a dataframe or series

In [14]:
weasleys.isna()

Unnamed: 0,name,year,interests
0,False,True,False
1,False,True,False
2,False,False,False
3,False,False,False
4,True,False,False
5,False,False,True
6,False,False,True


In [15]:
weasleys.isna().sum(axis=0)

name         1
year         2
interests    2
dtype: int64


Here, we've specified the axis on which we'd like to apply the sum operation. 
<br>To sum *across* rows, to get one value *per column*, we used axis = 0.

- In general, **Rows: axis=0, Columns: axis=1**

If we wanted instead to see how many null values we have for every child, we'd swap the axis:

In [16]:
weasleys.isna().sum(axis=1)

0    1
1    1
2    0
3    0
4    1
5    1
6    1
dtype: int64

Remember, we want a breakdown *by row*, so we're counting the number of rows *across* each column

### Strategy 1: Dropping rows with missing values

By far the easiest way of dealing with missing data is just dropping rows that have missing values.

By default, `dropna()` will drop rows that have a null value for *any* column

In [17]:
weasleys_dropped = weasleys.dropna()
weasleys_dropped

Unnamed: 0,name,year,interests
2,Percy Weasley,6.0,Ministry of Magic
3,Fred Weasley,4.0,Jokes


You might also need to drop NAs only if it's present for a single column. 
<br>In this case, we can pass the `subset=` argument to `.dropna()` to select for current hogwarts students

In [18]:
weasleys.dropna(subset=['year'])

Unnamed: 0,name,year,interests
2,Percy Weasley,6.0,Ministry of Magic
3,Fred Weasley,4.0,Jokes
4,,4.0,Jokes
5,Ron Weasley,2.0,
6,Ginny Weasley,1.0,


### Strategy 2: Filling in missing values

We can fill in missing values in a variety of different ways. 
<br> We can use a specific value (like the mean), forward-fill, back-fill, or use a variety of more advanced imputation methods using ML

Replacing missing values with a specific value:

In [19]:
weasleys.fillna(0) # replace missing values with string

Unnamed: 0,name,year,interests
0,Bill Weasley,0.0,Dragons
1,Charlie Weasley,0.0,Gringots
2,Percy Weasley,6.0,Ministry of Magic
3,Fred Weasley,4.0,Jokes
4,0,4.0,Jokes
5,Ron Weasley,2.0,0
6,Ginny Weasley,1.0,0


We can select a particular column to fill nulls with too

In [20]:
weasleys['name'] = weasleys['name'].fillna('George Weasley')
weasleys

Unnamed: 0,name,year,interests
0,Bill Weasley,,Dragons
1,Charlie Weasley,,Gringots
2,Percy Weasley,6.0,Ministry of Magic
3,Fred Weasley,4.0,Jokes
4,George Weasley,4.0,Jokes
5,Ron Weasley,2.0,
6,Ginny Weasley,1.0,


# Mapping Values

### Renaming Columns

We can rename columns to make them easier to refer to, more meaningful, or more concise

In [21]:
weasleys['house at hogwarts'] = 'Gryffindor'
weasleys

Unnamed: 0,name,year,interests,house at hogwarts
0,Bill Weasley,,Dragons,Gryffindor
1,Charlie Weasley,,Gringots,Gryffindor
2,Percy Weasley,6.0,Ministry of Magic,Gryffindor
3,Fred Weasley,4.0,Jokes,Gryffindor
4,George Weasley,4.0,Jokes,Gryffindor
5,Ron Weasley,2.0,,Gryffindor
6,Ginny Weasley,1.0,,Gryffindor


To do so, we'll use the `df.rename(columns=)` syntax, wheren `columns=` accepts a dictionary 

In [22]:
weasleys = weasleys.rename(columns={'house at hogwarts':'house'})
weasleys

Unnamed: 0,name,year,interests,house
0,Bill Weasley,,Dragons,Gryffindor
1,Charlie Weasley,,Gringots,Gryffindor
2,Percy Weasley,6.0,Ministry of Magic,Gryffindor
3,Fred Weasley,4.0,Jokes,Gryffindor
4,George Weasley,4.0,Jokes,Gryffindor
5,Ron Weasley,2.0,,Gryffindor
6,Ginny Weasley,1.0,,Gryffindor


### Replacing Values

Along the same vein, we can use `.replace()` to map values

In [23]:
weasleys['hogwarts'] = [0,0,1,1,1,1,1]
weasleys

Unnamed: 0,name,year,interests,house,hogwarts
0,Bill Weasley,,Dragons,Gryffindor,0
1,Charlie Weasley,,Gringots,Gryffindor,0
2,Percy Weasley,6.0,Ministry of Magic,Gryffindor,1
3,Fred Weasley,4.0,Jokes,Gryffindor,1
4,George Weasley,4.0,Jokes,Gryffindor,1
5,Ron Weasley,2.0,,Gryffindor,1
6,Ginny Weasley,1.0,,Gryffindor,1


In [24]:
weasleys['hogwarts'] = weasleys.hogwarts.replace({1:'Current Student',0:'Alumni'})
weasleys

Unnamed: 0,name,year,interests,house,hogwarts
0,Bill Weasley,,Dragons,Gryffindor,Alumni
1,Charlie Weasley,,Gringots,Gryffindor,Alumni
2,Percy Weasley,6.0,Ministry of Magic,Gryffindor,Current Student
3,Fred Weasley,4.0,Jokes,Gryffindor,Current Student
4,George Weasley,4.0,Jokes,Gryffindor,Current Student
5,Ron Weasley,2.0,,Gryffindor,Current Student
6,Ginny Weasley,1.0,,Gryffindor,Current Student


**Try it out:** Use the starter code below to add the graduation years of each Weasly and make sure to convert it to integers (int). 
*Hint:* `df.column.astype()` would be useful here

In [29]:
grad_years = ['1989', 1991.0, 1994.0, 1996, '1996', 1998, '1999']
weasleys['graduating_year'] = grad_years
weasleys['graduating_year'] = weasleys['graduating_year'].astype('int')
weasleys

Unnamed: 0,name,year,interests,house,hogwarts,graduating_year
0,Bill Weasley,,Dragons,Gryffindor,Alumni,1989
1,Charlie Weasley,,Gringots,Gryffindor,Alumni,1991
2,Percy Weasley,6.0,Ministry of Magic,Gryffindor,Current Student,1994
3,Fred Weasley,4.0,Jokes,Gryffindor,Current Student,1996
4,George Weasley,4.0,Jokes,Gryffindor,Current Student,1996
5,Ron Weasley,2.0,,Gryffindor,Current Student,1998
6,Ginny Weasley,1.0,,Gryffindor,Current Student,1999


## String processing

Often, a dataset will contain string representations of data that could be really useful if you could find some way to extract it. 

<br> Let's start off with a dataframe

In [26]:
roster = ['Oliver Wood', 'Angelina Johnson', 'Katie Bell', 'Alicia Spinnet', 'Fred Weasley', 'George weasley', 'Harry Potter'] 
role = ['Keeper','chaser','Chaser','chaser','beater','beater','SeEKeR']

quidditch = pd.DataFrame({'player':roster, 'role':role})
quidditch

Unnamed: 0,player,role
0,Oliver Wood,Keeper
1,Angelina Johnson,chaser
2,Katie Bell,Chaser
3,Alicia Spinnet,chaser
4,Fred Weasley,beater
5,George weasley,beater
6,Harry Potter,SeEKeR


It'd be great if we could work with just the first names of everyone. 

With normal python strings, this is pretty easy to do using the `.split()` function:

In [27]:
name = 'Cedric Diggory'

print(name)
print(name.split())
print(name.split()[0])

Cedric Diggory
['Cedric', 'Diggory']
Cedric


Let's try using that to extract the first names from the column `player`

In [28]:
quidditch.player.split()

AttributeError: 'Series' object has no attribute 'split'

Looks like we got an error: We can't use `split()` on the series object directly.
<br><br> Instead, we have to "vectorize" it using `.str`  first

In [30]:
quidditch.player.str.split()

0         [Oliver, Wood]
1    [Angelina, Johnson]
2          [Katie, Bell]
3      [Alicia, Spinnet]
4        [Fred, Weasley]
5      [George, weasley]
6        [Harry, Potter]
Name: player, dtype: object

The `.str` part is a pretty nifty tool when we wanna access special functions just to work with strings. We'll see this come up with special functions for dealing with time.

Before we move on, check out the object type of the output using `type()`

In [31]:
type(quidditch.player.str.split()[4])

list

### Lambda apply functions

Lambda apply functions are a pretty helpful tool for cleaning, here's one quick example

In [32]:
quidditch['first_name'] = quidditch.player.str.split().apply(lambda x: x[0])
quidditch

Unnamed: 0,player,role,first_name
0,Oliver Wood,Keeper,Oliver
1,Angelina Johnson,chaser,Angelina
2,Katie Bell,Chaser,Katie
3,Alicia Spinnet,chaser,Alicia
4,Fred Weasley,beater,Fred
5,George weasley,beater,George
6,Harry Potter,SeEKeR,Harry


Quite literally, this reads: <br>

For every element `x` in the series `quidditch.player`, take the first element of `x` and save it to a new column `first_name`
<br> In other words, we are **apply**ing the *anonymous* function `x[0]` for every row in the series

**Try it out!** Use lambda-apply to reverse the order of letters in the `role` column
<br>Hint: To reverse a string in python, use `[::-1]`

In [33]:
quidditch['role_reversed'] = quidditch.role.apply(lambda x: x[::-1])
quidditch

Unnamed: 0,player,role,first_name,role_reversed
0,Oliver Wood,Keeper,Oliver,repeeK
1,Angelina Johnson,chaser,Angelina,resahc
2,Katie Bell,Chaser,Katie,resahC
3,Alicia Spinnet,chaser,Alicia,resahc
4,Fred Weasley,beater,Fred,retaeb
5,George weasley,beater,George,retaeb
6,Harry Potter,SeEKeR,Harry,ReKEeS


### Changing capitalization to better process text

Let's look at how many players are in each role

In [34]:
quidditch.role.value_counts()

chaser    2
beater    2
Chaser    1
SeEKeR    1
Keeper    1
Name: role, dtype: int64

`chaser` and `Chaser` should be the same role, but because of a mismatch in cases, we're getting unique results.

<br>An easy way to solve this is by converting all the text to a uniform case

In [35]:
name = 'CHo chaNg'
print(name)
print(name.lower())

CHo chaNg
cho chang


In [36]:
quidditch['role'] = quidditch['role'].str.capitalize()
quidditch

Unnamed: 0,player,role,first_name,role_reversed
0,Oliver Wood,Keeper,Oliver,repeeK
1,Angelina Johnson,Chaser,Angelina,resahc
2,Katie Bell,Chaser,Katie,resahC
3,Alicia Spinnet,Chaser,Alicia,resahc
4,Fred Weasley,Beater,Fred,retaeb
5,George weasley,Beater,George,retaeb
6,Harry Potter,Seeker,Harry,ReKEeS


In [37]:
quidditch.role.value_counts()

Chaser    3
Beater    2
Seeker    1
Keeper    1
Name: role, dtype: int64

**Try it out**: How many Weasleys are on the team?

Hint: (1) Either standardize the case and use the `.str.contains()` method <br> OR (2) Standardize the case, pull the last name, then use `.value_counts()`

In [38]:
quidditch.player.str.lower().str.contains('weasley').sum()

2

# Date & Time processing

In [39]:
person = ['Harry', 'Hermoine', 'Ron', 'Voldy']
birthdays = ['July 31st, 1980', '9-19-1979', '1980 Mar 1', '12//31// //1926']

bdays = pd.DataFrame({'person': person, 'birthday': birthdays})
bdays

Unnamed: 0,person,birthday
0,Harry,"July 31st, 1980"
1,Hermoine,9-19-1979
2,Ron,1980 Mar 1
3,Voldy,12//31// //1926


Yikes! Let's see if we can clean up the time series data using `pd.to_datetime`

In [40]:
bdays['birthday'] = pd.to_datetime(bdays['birthday'])
bdays

Unnamed: 0,person,birthday
0,Harry,1980-07-31
1,Hermoine,1979-09-19
2,Ron,1980-03-01
3,Voldy,1926-12-31


As you can see, `pd.to_datetime` is pretty powerful. In can read in quite a few time formats as strings, then convert them into a `Timestamp` series

In [41]:
type(bdays.birthday[0])

pandas._libs.tslibs.timestamps.Timestamp

### Using pandas datetime objects

We can pull quite a lot just from a datetime timestamp using attributes

In [42]:
harry_bday = bdays.at[0, 'birthday'] # Taking the value for harry's bday
print(harry_bday) # the raw timestamp

1980-07-31 00:00:00


In [43]:
harry_bday.month # The month, encoded as an int

7

**Try it out**: Is Harry's birthday a leap year?

Hint: Use the `.is_leap_year` method

### Make new columns from these datetime attributes 

Let's use this to make new columns that reflect these attributes.

We'll use the `.dt` accessor object to snag the month for each row, just like we did with `.str`

In [44]:
bdays['month'] = bdays.birthday.dt.month_name()
bdays

Unnamed: 0,person,birthday,month
0,Harry,1980-07-31,July
1,Hermoine,1979-09-19,September
2,Ron,1980-03-01,March
3,Voldy,1926-12-31,December


Try it out with the some other columns

In [45]:
bdays['is_leap'] = bdays.birthday.dt.is_leap_year
bdays['day'] = bdays.birthday.dt.dayofweek
bdays

Unnamed: 0,person,birthday,month,is_leap,day
0,Harry,1980-07-31,July,True,3
1,Hermoine,1979-09-19,September,False,2
2,Ron,1980-03-01,March,True,5
3,Voldy,1926-12-31,December,False,4


### Other uses for datetimes

Which people were born before the First Wizarding War (January 1970)?

To this, we'll have to convert Jan 1970 into a datetime object to allow for comparison

In [46]:
bdays[bdays.birthday < pd.to_datetime('January 1970')]

Unnamed: 0,person,birthday,month,is_leap,day
3,Voldy,1926-12-31,December,False,4


We can also do some quick maths quite easily:

For example, how much older is Hermione than Ron?

In [47]:
diff = bdays.birthday[2] - bdays.birthday[1]

print(diff)
print(type(diff))

164 days 00:00:00
<class 'pandas._libs.tslibs.timedeltas.Timedelta'>


Note: this returns a `Timedelta` object, not `Timestamp`. We can get similar attributes

In [48]:
diff.days

164

### How do we apply this?

When you get data that includes time as a variable, it'll be in one of many possible formats, and not always consistent throughout the whole dataset. 


`pd.to_datetime` makes the process of cleaning these incredibly easy!

Once cleaned, we can look at specific attributes such as month, day, and year **to gain insight we wouldn't otherwise have been able to access.**

There's a lot, lot more you can do with pandas datetimes - use business days, adjust for time zones - just about anything you'd imagine.

The docs for all of that is linked here: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#overview

# Merging DataFrames

Merging sources of data is super important:
    
<br> Sometimes you have data from two different sources that you'd like to have in one data frame to analyze. We can do that with `.concat()` and `.merge()`

In [49]:
spells = ['Summoning Charm', 'Patronus Charm', 'Disarming Charm', 'Killing Curse', 'Cruciatus Curse', 'Impediment Jinx', 'Dark Mark']
incantations = ['Accio', 'Excpecto Patronum', 'Expelliarmus', 'Avada Kedavra', 'Crucio', 'Impedimenta', 'Morsmordre']

incantations_df = pd.DataFrame({'Spell': spells, 'Incantation': incantations})
incantations_df

Unnamed: 0,Spell,Incantation
0,Summoning Charm,Accio
1,Patronus Charm,Excpecto Patronum
2,Disarming Charm,Expelliarmus
3,Killing Curse,Avada Kedavra
4,Cruciatus Curse,Crucio
5,Impediment Jinx,Impedimenta
6,Dark Mark,Morsmordre


In [50]:
effects = ['Summons an Object', 'Spirit to Guard Against Dementors', 'Disarms an Opponent', 'Instantaneous Death', 'Excruciating Pain', 'Hinders Movement', 'Conjures Dark Mark']

effects_df = pd.DataFrame({'Spell': spells, 'Effect': effects})
effects_df

Unnamed: 0,Spell,Effect
0,Summoning Charm,Summons an Object
1,Patronus Charm,Spirit to Guard Against Dementors
2,Disarming Charm,Disarms an Opponent
3,Killing Curse,Instantaneous Death
4,Cruciatus Curse,Excruciating Pain
5,Impediment Jinx,Hinders Movement
6,Dark Mark,Conjures Dark Mark


In [51]:
colors = ['None', 'Silver', 'Scarlet', 'Green', 'Red or None', 'Turquoise', 'Green']

colors_df = pd.DataFrame({'Spell': spells, 'Light Color': colors}).sample(frac=1).reset_index(drop=True)
colors_df

Unnamed: 0,Spell,Light Color
0,Summoning Charm,
1,Killing Curse,Green
2,Dark Mark,Green
3,Impediment Jinx,Turquoise
4,Patronus Charm,Silver
5,Disarming Charm,Scarlet
6,Cruciatus Curse,Red or None


In [52]:
# Quick helper
from IPython.display import display_html
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style=\"display:inline\"'),raw=True)

In [53]:
display_side_by_side(incantations_df,effects_df,colors_df)

Unnamed: 0,Spell,Incantation
0,Summoning Charm,Accio
1,Patronus Charm,Excpecto Patronum
2,Disarming Charm,Expelliarmus
3,Killing Curse,Avada Kedavra
4,Cruciatus Curse,Crucio
5,Impediment Jinx,Impedimenta
6,Dark Mark,Morsmordre

Unnamed: 0,Spell,Effect
0,Summoning Charm,Summons an Object
1,Patronus Charm,Spirit to Guard Against Dementors
2,Disarming Charm,Disarms an Opponent
3,Killing Curse,Instantaneous Death
4,Cruciatus Curse,Excruciating Pain
5,Impediment Jinx,Hinders Movement
6,Dark Mark,Conjures Dark Mark

Unnamed: 0,Spell,Light Color
0,Summoning Charm,
1,Killing Curse,Green
2,Dark Mark,Green
3,Impediment Jinx,Turquoise
4,Patronus Charm,Silver
5,Disarming Charm,Scarlet
6,Cruciatus Curse,Red or None


Note that each one of these dataframes have a column in common, `Spell`.

The order of the values may not be same, but we're still good to go

### `pd.Merge()`

Instead of working with three distinct dataframes, let's combine them into one df

To do so, we can call `.merge()` on two data tables and specify the column on which to merge as `on=`

In [54]:
spells = pd.merge(incantations_df, effects_df, on='Spell')
spells

Unnamed: 0,Spell,Incantation,Effect
0,Summoning Charm,Accio,Summons an Object
1,Patronus Charm,Excpecto Patronum,Spirit to Guard Against Dementors
2,Disarming Charm,Expelliarmus,Disarms an Opponent
3,Killing Curse,Avada Kedavra,Instantaneous Death
4,Cruciatus Curse,Crucio,Excruciating Pain
5,Impediment Jinx,Impedimenta,Hinders Movement
6,Dark Mark,Morsmordre,Conjures Dark Mark


The column we're merging on is called a **join key**. It may have different column names, but we can specify that in the join.

In [55]:
spells = spells.merge(colors_df, on='Spell')
spells

Unnamed: 0,Spell,Incantation,Effect,Light Color
0,Summoning Charm,Accio,Summons an Object,
1,Patronus Charm,Excpecto Patronum,Spirit to Guard Against Dementors,Silver
2,Disarming Charm,Expelliarmus,Disarms an Opponent,Scarlet
3,Killing Curse,Avada Kedavra,Instantaneous Death,Green
4,Cruciatus Curse,Crucio,Excruciating Pain,Red or None
5,Impediment Jinx,Impedimenta,Hinders Movement,Turquoise
6,Dark Mark,Morsmordre,Conjures Dark Mark,Green


### Join logic

In [56]:
more_spells = ['Disarming Charm', 'Dark Mark', 'Imperius Curse', 'Sectumsempra', 'Levitation Charm']
more_incantations = ['Expelliarmus', 'Morsmordre', 'Imperio', 'Sectumsempra', 'Wingardium Leviosa']

more_incantations_df = pd.DataFrame({'Spell': more_spells, 'Incantation': more_incantations})
more_incantations_df

Unnamed: 0,Spell,Incantation
0,Disarming Charm,Expelliarmus
1,Dark Mark,Morsmordre
2,Imperius Curse,Imperio
3,Sectumsempra,Sectumsempra
4,Levitation Charm,Wingardium Leviosa


With the previous merges, we had the same number of observations in every dataframe.

<br>With some merges, not every row may align. Let's try to merge `more_incantations_df` with `effects_df`. Note how there are some dining halls in common, and some unique to each

In [57]:
display_side_by_side(more_incantations_df,effects_df)

Unnamed: 0,Spell,Incantation
0,Disarming Charm,Expelliarmus
1,Dark Mark,Morsmordre
2,Imperius Curse,Imperio
3,Sectumsempra,Sectumsempra
4,Levitation Charm,Wingardium Leviosa

Unnamed: 0,Spell,Effect
0,Summoning Charm,Summons an Object
1,Patronus Charm,Spirit to Guard Against Dementors
2,Disarming Charm,Disarms an Opponent
3,Killing Curse,Instantaneous Death
4,Cruciatus Curse,Excruciating Pain
5,Impediment Jinx,Hinders Movement
6,Dark Mark,Conjures Dark Mark


We can do a few different merges now: 
1. If we want to retain **only** those in common, we use an `inner` join
2. If we want to keep **everything**, and keep placeholders for missing data, we use an `outer` join
3. If we want to keep just those in one table, and **lookup** values from another, we use a `left` join

In [58]:
pd.merge(effects_df, more_incantations_df, on='Spell', how='inner')

Unnamed: 0,Spell,Effect,Incantation
0,Disarming Charm,Disarms an Opponent,Expelliarmus
1,Dark Mark,Conjures Dark Mark,Morsmordre


In [59]:
pd.merge(effects_df, more_incantations_df, on='Spell', how='outer')

Unnamed: 0,Spell,Effect,Incantation
0,Summoning Charm,Summons an Object,
1,Patronus Charm,Spirit to Guard Against Dementors,
2,Disarming Charm,Disarms an Opponent,Expelliarmus
3,Killing Curse,Instantaneous Death,
4,Cruciatus Curse,Excruciating Pain,
5,Impediment Jinx,Hinders Movement,
6,Dark Mark,Conjures Dark Mark,Morsmordre
7,Imperius Curse,,Imperio
8,Sectumsempra,,Sectumsempra
9,Levitation Charm,,Wingardium Leviosa


In [60]:
pd.merge(effects_df, more_incantations_df, on='Spell', how='left')

Unnamed: 0,Spell,Effect,Incantation
0,Summoning Charm,Summons an Object,
1,Patronus Charm,Spirit to Guard Against Dementors,
2,Disarming Charm,Disarms an Opponent,Expelliarmus
3,Killing Curse,Instantaneous Death,
4,Cruciatus Curse,Excruciating Pain,
5,Impediment Jinx,Hinders Movement,
6,Dark Mark,Conjures Dark Mark,Morsmordre


### `pd.Concat()`

Another *similar* function is `.concat()` 

It's a little different from `.merge()`, since we'll have to pass in a `list` of dataframes instead

In [61]:
df = pd.concat([incantations_df, effects_df, colors_df])
df

Unnamed: 0,Spell,Incantation,Effect,Light Color
0,Summoning Charm,Accio,,
1,Patronus Charm,Excpecto Patronum,,
2,Disarming Charm,Expelliarmus,,
3,Killing Curse,Avada Kedavra,,
4,Cruciatus Curse,Crucio,,
5,Impediment Jinx,Impedimenta,,
6,Dark Mark,Morsmordre,,
0,Summoning Charm,,Summons an Object,
1,Patronus Charm,,Spirit to Guard Against Dementors,
2,Disarming Charm,,Disarms an Opponent,


That didn't quite work as expected, because `concat()` stacked the dataframes above each other, instead of combining information for common rows.

Note that it **didn't combine rows** when `merge()` easily could have.

One example of when `concat()` is appropriate is when we want to add on more information to a dataframe, but the **rows are the different** between the two

In [62]:
incantations_all = pd.concat([incantations_df, more_incantations_df]).reset_index(drop=True).drop_duplicates()
# The reset_index() allows us to prevent overlapping of the indices

incantations_all

Unnamed: 0,Spell,Incantation
0,Summoning Charm,Accio
1,Patronus Charm,Excpecto Patronum
2,Disarming Charm,Expelliarmus
3,Killing Curse,Avada Kedavra
4,Cruciatus Curse,Crucio
5,Impediment Jinx,Impedimenta
6,Dark Mark,Morsmordre
9,Imperius Curse,Imperio
10,Sectumsempra,Sectumsempra
11,Levitation Charm,Wingardium Leviosa


`Concat` can also horizontally stack dataframes, usng the `axis=1` argument. 

Here's a case where it might be useful:

In [63]:
more_info_df = pd.DataFrame({'Dark Magic': ['No', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes'], 
                             'Type': ['Charm', 'Charm', 'Charm', 'Curse', 'Curse', 'Jinx', 'Curse']})
more_info_df

Unnamed: 0,Dark Magic,Type
0,No,Charm
1,No,Charm
2,No,Charm
3,Yes,Curse
4,Yes,Curse
5,No,Jinx
6,Yes,Curse


In [64]:
df = pd.concat([incantations_df, more_info_df], axis=1)
df

Unnamed: 0,Spell,Incantation,Dark Magic,Type
0,Summoning Charm,Accio,No,Charm
1,Patronus Charm,Excpecto Patronum,No,Charm
2,Disarming Charm,Expelliarmus,No,Charm
3,Killing Curse,Avada Kedavra,Yes,Curse
4,Cruciatus Curse,Crucio,Yes,Curse
5,Impediment Jinx,Impedimenta,No,Jinx
6,Dark Mark,Morsmordre,Yes,Curse


Note the difference between `.concat(axis=1)` and `.merge()`. We would use `.concat()` when there isn't a duplicate column (a key), and `.merge()` when there is one.

# TLDR

Data cleaning is one of the most important parts of any data workflow. Pandas provides incredibly powerful tools, if you can weld them properly. Here's a quick recap of some helpful functions. Not sure what parameters they accept? Use the `?function` shortcut to quickly pull the documentation.

Checking Data Types:

- `df.info()`
- `df.column.astype()`

Mapping Values:

- `df.drop()`
- `df.rename()`
- `df.columns.replace()`

Handling Missing Data (NaNs):

- `df.column.isna().sum()`
- `df.column.fillna()`

Working with Strings:

- `df.column.str.lower()`
- `df.column.str.split()`
- `df.column.str.contains()`

Working with DateTime:

- `pd.to_datetime(df.column)`
- `df.column.dt.month()`
- `df[df.column < pd.to_datetime('some date')]`

Joining Data:

- `pd.merge()`
- `pd.concat()`