<img src="https://courses.edx.org/asset-v1:ACCA+ML001+2T2021+type@asset+block@acca-logo.jpg" alt="ACCA logo" style="width: 400px;"/>

# Python for data analysis
## Part 4 - Merging datasets & handling missing data

* **Course:** __Machine learning with Python for finance professionals__ by ACCA
* **Instructor:** [Coefficient](https://coefficient.ai) / [@CoefficientData](https://twitter.com/CoefficientData)

---

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

In [2]:
# Read in the Dream Destination hotel data
orders = pd.read_excel("Hotel Industry - Orders Database - 2019.xlsx",
                       sheet_name="Order Database")
orders.head(3)

Unnamed: 0,Booking ID,Date of Booking,Year,Time,Customer ID,Gender,Age,Origin Country,State,Location,Destination Country,Destination City,No. Of People,Check-in date,No. Of Days,Check-Out Date,Rooms,Hotel Name,Hotel Rating
0,DDID57035,2019-01-01,2019,13:23:47,ID10297,Female,51,Indonesia,Tambora,Jakarta,Ireland,Tallaght,2,2019-03-24,1,2019-03-25,1,Blooming Bed And Breakfast,4.2
1,DDSG57036,2019-01-01,2019,16:14:22,SG10307,Male,46,Singapore,Central,Novena,Maldives,Viligili,4,2019-01-15,2,2019-01-17,2,Four Points,4.3
2,DDMY57037,2019-01-01,2019,09:49:48,MY10283,Female,25,Malaysia,Johor,Johor Bahru,Canada,North York,5,2019-01-16,9,2019-01-25,3,Hotel Joy Stick,3.8


<div class="alert alert-block alert-info" style="background-color: #BA001E; border: 0px; -moz-border-radius: 10px; -webkit-border-radius: 10px;">
<h2 style="color: white">
Merging DataFrames
</h2><br>
</div>

We will show you how to construct two dataframes:
- `cities`: this maps each country to its capital city. We construct this by passing a list of dictionaries to the `pd.DataFrame()` function.
- `continents`: this maps each country to its continent. We construct this by passing a list of lists to the `pd.DataFrame()` function.

In [3]:
# cities = list of dictionaries
cities_data = [
    {'country': 'Denmark', 'capital': 'Copenhagen'},
    {'country': 'France', 'capital': 'Paris'},
    {'country': 'China', 'capital': 'Beijing'},
    {'country': 'Colombia', 'capital': 'Bogotá'},
]
cities_data

[{'country': 'Denmark', 'capital': 'Copenhagen'},
 {'country': 'France', 'capital': 'Paris'},
 {'country': 'China', 'capital': 'Beijing'},
 {'country': 'Colombia', 'capital': 'Bogotá'}]

In [4]:
# Note how it figures out the column names automatically
cities = pd.DataFrame(cities_data)
cities

Unnamed: 0,country,capital
0,Denmark,Copenhagen
1,France,Paris
2,China,Beijing
3,Colombia,Bogotá


In [5]:
# continents = list of lists
continents_data = [
    ['Denmark', 'Europe'],
    ['France', 'Europe'],
    ['China', 'Asia'],
    ['Kenya', 'Africa'],
]

In [6]:
# We need to specify the column names this time
continents = pd.DataFrame(continents_data, columns=['country', 'continent'])
continents

Unnamed: 0,country,continent
0,Denmark,Europe
1,France,Europe
2,China,Asia
3,Kenya,Africa


We have two dataframes that we'd like to join. We need to decide _how_ to join though. There are several options.

<img src="https://courses.edx.org/asset-v1:ACCA+ML001+2T2021+type@asset+block@joins.png" alt="matplotlib" style="width: 600px;"/>

In the examples below, we will assume that we are joining on country, that `cities` is the "left dataframe" and `continents` is the "right dataframe".

| Join Type | Description | Result |
| -- | -- | -- |
| **Inner** | Keep all rows present in both dataframes. | Denmark, France, and China only. |
| **Left** | Keep all the rows present in the "left" dataframe. If there isn't a match in the "right" dataframe, you will see missing values. | Denmark, France, China, Colombia (all the countries in the left dataframe). |
| **Right** | Keep all the rows present in the "right" dataframe. If there isn't a match in the "left" dataframe, you will see missing values. | Denmark, France, China, Kenya (all the countries in the right dataframe). |
| **Outer** | Keep all the rows present in **both** dataframes, with missing values present where a match cannot be found. | Denmark, France, China, Columbia _and_ Kenya. |

We will be using the [pandas `pd.merge()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html), more advanced examples are available on the docs page.

In [7]:
# Let's try an inner join
pd.merge(left=cities, right=continents, on='country', how='inner')

Unnamed: 0,country,capital,continent
0,Denmark,Copenhagen,Europe
1,France,Paris,Europe
2,China,Beijing,Asia


Does this match the table above? It only retains the countries present in both `cities` and `continents`. How about a left join?

In [8]:
# Let's try a left join
left = pd.merge(left=cities, right=continents, on='country', how='left')
left

Unnamed: 0,country,capital,continent
0,Denmark,Copenhagen,Europe
1,France,Paris,Europe
2,China,Beijing,Asia
3,Colombia,Bogotá,


What's happened here? We have Colombia in the `cities` dataframe, but not in the `continents` dataframe. It's still there (because we did a left join), but a placeholder value `NaN` has been inserted for Colombia's continent.

# 

<div class="alert alert-block alert-info" style="background-color: #BA001E; border: 0px; -moz-border-radius: 10px; -webkit-border-radius: 10px;">
<h2 style="color: white">
Missing Data
</h2><br>
</div>

The `NaN` value is how NumPy and pandas represent a missing value in data. In Python, it's `np.nan`.

In [9]:
np.nan

nan

In [10]:
# We can check if a value is NaN with this function
pd.isna(np.nan)

True

In [11]:
# It works on whole dataframes
pd.isna(left)

Unnamed: 0,country,capital,continent
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,True


In [12]:
# Or a single column
pd.isna(left.continent)

0    False
1    False
2    False
3     True
Name: continent, dtype: bool

In [13]:
# We can use this to highlight just the rows with missing data as follows
missing_continent = pd.isna(left.continent)
left[missing_continent]

Unnamed: 0,country,capital,continent
3,Colombia,Bogotá,


In [14]:
# Or to remove them (read the tilde (~) sign here as "not", i.e. it flips True to False and vice versa)
left = left[~missing_continent]
left

Unnamed: 0,country,capital,continent
0,Denmark,Copenhagen,Europe
1,France,Paris,Europe
2,China,Beijing,Asia


---

> ### 🚩 Exercises
> 1. Right join `cities` and `continents` on country and assign it to a variable called `right`.
> 2. Try running `right.fillna('-')`. What does this do? If in doubt, [check the documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html).
> 3. Update the capital column in your `right` dataframe with `right.capital.fillna('-')`.

In [15]:
# 1. Right join `cities` and `continents` on country as assign it to a variable called `right`.

# ✏️ ENTER YOUR SOLUTION HERE

right = pd.merge(left=cities, right=continents, on='country')



In [16]:
# 2. Try running `right.fillna('-')`. What does this do?

# ✏️ ENTER YOUR SOLUTION HERE

right.fillna('-')


Unnamed: 0,country,capital,continent
0,Denmark,Copenhagen,Europe
1,France,Paris,Europe
2,China,Beijing,Asia


In [17]:
# 3. Update the capital column in your `right` dataframe with `right.capital.fillna('-')`.

# ✏️ ENTER YOUR SOLUTION HERE

right = right.fillna('-')
right

Unnamed: 0,country,capital,continent
0,Denmark,Copenhagen,Europe
1,France,Paris,Europe
2,China,Beijing,Asia


We will try a full outer join in the next section.

---

# 

<div class="alert alert-block alert-info" style="background-color: #BA001E; border: 0px; -moz-border-radius: 10px; -webkit-border-radius: 10px;">
<h2 style="color: white">
loc vs iloc
</h2><br>
</div>

In [18]:
# Let's outer join cities and continents
df = pd.merge(left=cities, right=continents, on='country', how='outer')
df

Unnamed: 0,country,capital,continent
0,Denmark,Copenhagen,Europe
1,France,Paris,Europe
2,China,Beijing,Asia
3,Colombia,Bogotá,
4,Kenya,,Africa


In [19]:
# Notice the bold numbers on the left hand side, these are created when a dataframe is created.
# But when the dataframe is sorted, they stay "stuck onto" the row initially assigned.
df.sort_values('country')

Unnamed: 0,country,capital,continent
2,China,Beijing,Asia
3,Colombia,Bogotá,
0,Denmark,Copenhagen,Europe
1,France,Paris,Europe
4,Kenya,,Africa


### .iloc = "integer location"

In [20]:
# We can select rows by their "integer location", i.e. positional location from top.
# This returns the top row in df, i.e. Denmark, as a Series representing the row.
df.iloc[0]

country         Denmark
capital      Copenhagen
continent        Europe
Name: 0, dtype: object

In [21]:
# If we apply .iloc[0] to the sorted dataframe, it takes the top row still, i.e. China.
df.sort_values('country').iloc[0]

country        China
capital      Beijing
continent       Asia
Name: 2, dtype: object

In [22]:
# iloc treats dataframes just like Python lists, i.e. we can slice and use negative indexes.
df.iloc[:3]

Unnamed: 0,country,capital,continent
0,Denmark,Copenhagen,Europe
1,France,Paris,Europe
2,China,Beijing,Asia


In [23]:
df.sort_values('country').iloc[-1:]

Unnamed: 0,country,capital,continent
4,Kenya,,Africa


### .loc = "index location"

In [24]:
# .loc[0] returns the ROW WITH INDEX 0
df.loc[0]

country         Denmark
capital      Copenhagen
continent        Europe
Name: 0, dtype: object

In [25]:
# This means that, even if we sorted the dataframe, it still returns Denmark
df.sort_values('country').loc[0]

country         Denmark
capital      Copenhagen
continent        Europe
Name: 0, dtype: object

In [26]:
# This becomes especially useful with datetime indexes. Let's create one now.

# First, create a date column using the pd.date_range() function
df['date'] = pd.date_range(start='2021-01-01', periods=len(df), freq='D')
df

Unnamed: 0,country,capital,continent,date
0,Denmark,Copenhagen,Europe,2021-01-01
1,France,Paris,Europe,2021-01-02
2,China,Beijing,Asia,2021-01-03
3,Colombia,Bogotá,,2021-01-04
4,Kenya,,Africa,2021-01-05


In [27]:
# Then set this as the index.
df.set_index('date')

Unnamed: 0_level_0,country,capital,continent
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-01,Denmark,Copenhagen,Europe
2021-01-02,France,Paris,Europe
2021-01-03,China,Beijing,Asia
2021-01-04,Colombia,Bogotá,
2021-01-05,Kenya,,Africa


In [28]:
# Like everything in pandas, this is not "applied" or "saved" unless you
# reassign back into the df variable.
df

Unnamed: 0,country,capital,continent,date
0,Denmark,Copenhagen,Europe,2021-01-01
1,France,Paris,Europe,2021-01-02
2,China,Beijing,Asia,2021-01-03
3,Colombia,Bogotá,,2021-01-04
4,Kenya,,Africa,2021-01-05


In [29]:
# Let's reassign now
df = df.set_index('date')
df

Unnamed: 0_level_0,country,capital,continent
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-01,Denmark,Copenhagen,Europe
2021-01-02,France,Paris,Europe
2021-01-03,China,Beijing,Asia
2021-01-04,Colombia,Bogotá,
2021-01-05,Kenya,,Africa


In [30]:
# Now we can compare .loc vs .iloc properly!
df.sort_values('country').iloc[:2]

Unnamed: 0_level_0,country,capital,continent
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-03,China,Beijing,Asia
2021-01-04,Colombia,Bogotá,


In [31]:
# Because the index is now dates, we can use the date values in .loc
df.sort_values('country').loc['2021-01-05']

Unnamed: 0_level_0,country,capital,continent
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-05,Kenya,,Africa


### Summary
`.iloc` treats the dataframe like a list, while `.loc` treats it like a dictionary where the dataframe's index is the key and the row is the value. [This StackOverflow post](https://stackoverflow.com/questions/31593201/how-are-iloc-and-loc-different) is quite useful if you're feeling a bit confused on this topic.

---

> ### 🚩 Exercises
> For each of the following examples, try to guess what they might return, then try out each example to test your understanding:
> - `df.sort_values('continent').iloc[-2]`
> - `df.sort_values('continent').loc[-2]`
> - `df.loc['2021-01-02':'2021-01-04']`

In [32]:
df.sort_values('continent').iloc[-2]

country      France
capital       Paris
continent    Europe
Name: 2021-01-02 00:00:00, dtype: object

In [33]:
df.sort_values('continent').loc[-2]

KeyError: -2

In [34]:
df.loc['2021-01-02':'2021-01-04']

Unnamed: 0_level_0,country,capital,continent
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-02,France,Paris,Europe
2021-01-03,China,Beijing,Asia
2021-01-04,Colombia,Bogotá,


---

### Updating a dataframe with .loc
`.loc` can also be used to insert a value into a specific cell in the dataframe. Let's fill in the missing values by hand. The syntax for this is:

```python
DATAFRAME.loc[ROW_FILTER, COLUMN] = NEW_VALUE
```

In [35]:
# We're going to update the capital of Kenya to read "Nairobi".
df

Unnamed: 0_level_0,country,capital,continent
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-01,Denmark,Copenhagen,Europe
2021-01-02,France,Paris,Europe
2021-01-03,China,Beijing,Asia
2021-01-04,Colombia,Bogotá,
2021-01-05,Kenya,,Africa


In [36]:
row_filter = (df.country == 'Kenya')
df.loc[row_filter, 'capital'] = 'Nairobi'
df

Unnamed: 0_level_0,country,capital,continent
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-01,Denmark,Copenhagen,Europe
2021-01-02,France,Paris,Europe
2021-01-03,China,Beijing,Asia
2021-01-04,Colombia,Bogotá,
2021-01-05,Kenya,Nairobi,Africa


In [37]:
df

Unnamed: 0_level_0,country,capital,continent
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-01,Denmark,Copenhagen,Europe
2021-01-02,France,Paris,Europe
2021-01-03,China,Beijing,Asia
2021-01-04,Colombia,Bogotá,
2021-01-05,Kenya,Nairobi,Africa


---

> ### 🚩 Exercise
> Using the same technique, update the continent for Colombia to read "South America".

In [38]:
# ✏️ ENTER YOUR SOLUTION HERE
row_filter = (df.country == 'Colombia')
df.loc[row_filter, 'continent'] = 'South America'
df



Unnamed: 0_level_0,country,capital,continent
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-01,Denmark,Copenhagen,Europe
2021-01-02,France,Paris,Europe
2021-01-03,China,Beijing,Asia
2021-01-04,Colombia,Bogotá,South America
2021-01-05,Kenya,Nairobi,Africa


### How to deal with `SettingWithCopyWarning`
From time to time, you get a `SettingWithCopyWarning` message when you try to change an existing dataframe, e.g. by updating or adding a new column. This often happens after filtering a dataframe (on either rows or columns).

Let's see this in action.

In [39]:
# Create a filtered dataframe
df_europe = df[df.continent == 'Europe']
df_europe

Unnamed: 0_level_0,country,capital,continent
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-01-01,Denmark,Copenhagen,Europe
2021-01-02,France,Paris,Europe


In [40]:
# Try adding a new column
df_europe['Population'] = 747933843
df_europe

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_europe['Population'] = 747933843


Unnamed: 0_level_0,country,capital,continent,Population
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2021-01-01,Denmark,Copenhagen,Europe,747933843
2021-01-02,France,Paris,Europe,747933843


In [41]:
df_europe

Unnamed: 0_level_0,country,capital,continent,Population
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2021-01-01,Denmark,Copenhagen,Europe,747933843
2021-01-02,France,Paris,Europe,747933843


You can see that the new column has been added, but there are some dangers in ignoring this warning! Because `df_europe` is merely a subset of `df`, pandas doesn't duplicate the data "in memory" unless you ask it to.

Why not? Imagine you're working with a 1GB dataset containing 100 million rows. pandas can handle this, but if your machine only has 4GB of working memory (RAM) then you don't want to create copies of the dataframe unless absolutely necessary!

In the above case, and **in nearly all situations involving `SettingWithCopyWarning` the fix is simply to create an explicit copy of your data when filtering**.

In [42]:
df_europe = df[df.continent == 'Europe'].copy()  # add .copy() on the end when you filter => no warning!
df_europe['Population'] = 747933843
df_europe

Unnamed: 0_level_0,country,capital,continent,Population
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2021-01-01,Denmark,Copenhagen,Europe,747933843
2021-01-02,France,Paris,Europe,747933843


---

---
<div class="alert alert-block alert-info">
    <b>Please proceed to the next part of the course when you are ready.</b>
</div>