# Data Cleaning

Dataset:
* [Rainfall](https://africaopendata.org/dataset/messy-data-for-data-cleaning-exercise)

Firstly, I load the dataset through the `read_excel()` function. To work properly, this function requires the `openpyxl` library installed. To install it, you can run `pip install openpyxl`.

In [1]:
import pandas as pd

df = pd.read_excel('../../Datasets/rainfall.xlsx')
df.head()

Unnamed: 0.1,Unnamed: 0,Seasonal rainfall in Lake Victoria and Simiyu,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8
0,,,,,,,,,
1,,"Month, period",Lake Victoria,Simiyu,,,,,
2,,"Jan,2001-2019",3.176mm,2.90847,,,,,
3,,"Feb,2001-2019",3.477mm,1.8mm,,,,,
4,,"Mar,2001-2019",4.68705,2.98105,,,,,


The dataset is not loaded correctly, because column names are wrong. So, I need to read again the dataset, by skipping the first 2 rows:

In [59]:
df = pd.read_excel('source/rainfall.xlsx', skiprows=2)
df.head()

Unnamed: 0.1,Unnamed: 0,"Month, period",Lake Victoria,Simiyu,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8
0,,"Jan,2001-2019",3.176mm,2.90847,,,,,
1,,"Feb,2001-2019",3.477mm,1.8mm,,,,,
2,,"Mar,2001-2019",4.68705,2.98105,,,,,
3,,"Apr,2001-2019",7.00453,4.75358,,,,,
4,,"May,2001-2019",9.36279,4.07747,,,,,


There are some additional columns, which I can remove through the `usecols` parameter:

In [60]:
df = pd.read_excel('source/rainfall.xlsx', skiprows=2, usecols='B:D')
df.head(20)

Unnamed: 0,"Month, period",Lake Victoria,Simiyu
0,"Jan,2001-2019",3.176mm,2.90847
1,"Feb,2001-2019",3.477mm,1.8mm
2,"Mar,2001-2019",4.68705,2.98105
3,"Apr,2001-2019",7.00453,4.75358
4,"May,2001-2019",9.36279,4.07747
5,"Jun,2001-2019",3.43021,1.04695
6,"Jul,2001-2019",1.76442,0.195211
7,"Aug,2001-2019",2.81263,0.333632
8,"Sep,2001-2019",3.97889,1.20584
9,"Oct,2001-2019",5.31842,2.45474


There are two rows completely empty. I can drop them through the `dropna()` function:

In [61]:
df.dropna(inplace=True, axis=0)
df.head(20)

Unnamed: 0,"Month, period",Lake Victoria,Simiyu
0,"Jan,2001-2019",3.176mm,2.90847
1,"Feb,2001-2019",3.477mm,1.8mm
2,"Mar,2001-2019",4.68705,2.98105
3,"Apr,2001-2019",7.00453,4.75358
4,"May,2001-2019",9.36279,4.07747
5,"Jun,2001-2019",3.43021,1.04695
6,"Jul,2001-2019",1.76442,0.195211
7,"Aug,2001-2019",2.81263,0.333632
8,"Sep,2001-2019",3.97889,1.20584
9,"Oct,2001-2019",5.31842,2.45474


The first column contains two columns. I split them using the `split()` function.

In [62]:
splitted_columns = df['Month, period'].str.split(',',expand=True)
splitted_columns

Unnamed: 0,0,1
0,Jan,2001-2019
1,Feb,2001-2019
2,Mar,2001-2019
3,Apr,2001-2019
4,May,2001-2019
5,Jun,2001-2019
6,Jul,2001-2019
7,Aug,2001-2019
8,Sep,2001-2019
9,Oct,2001-2019


In [63]:
df['Month'] = splitted_columns[0]
df['Period'] = splitted_columns[1]
df.head(15)

Unnamed: 0,"Month, period",Lake Victoria,Simiyu,Month,Period
0,"Jan,2001-2019",3.176mm,2.90847,Jan,2001-2019
1,"Feb,2001-2019",3.477mm,1.8mm,Feb,2001-2019
2,"Mar,2001-2019",4.68705,2.98105,Mar,2001-2019
3,"Apr,2001-2019",7.00453,4.75358,Apr,2001-2019
4,"May,2001-2019",9.36279,4.07747,May,2001-2019
5,"Jun,2001-2019",3.43021,1.04695,Jun,2001-2019
6,"Jul,2001-2019",1.76442,0.195211,Jul,2001-2019
7,"Aug,2001-2019",2.81263,0.333632,Aug,2001-2019
8,"Sep,2001-2019",3.97889,1.20584,Sep,2001-2019
9,"Oct,2001-2019",5.31842,2.45474,Oct,2001-2019


I drop the `Month,period` column

In [64]:
df.drop('Month, period', axis=1, inplace=True)
df.head(15)

Unnamed: 0,Lake Victoria,Simiyu,Month,Period
0,3.176mm,2.90847,Jan,2001-2019
1,3.477mm,1.8mm,Feb,2001-2019
2,4.68705,2.98105,Mar,2001-2019
3,7.00453,4.75358,Apr,2001-2019
4,9.36279,4.07747,May,2001-2019
5,3.43021,1.04695,Jun,2001-2019
6,1.76442,0.195211,Jul,2001-2019
7,2.81263,0.333632,Aug,2001-2019
8,3.97889,1.20584,Sep,2001-2019
9,5.31842,2.45474,Oct,2001-2019


Some columns contain the string `mm`, so I define a function, which eliminates it.

In [None]:
def remove_mm(x):
    if type(x) is str:
        return x.replace('mm', '')
    else:
        return x

I apply the previous function to the columns `Lake Victoria` and `Simiyu`:

In [65]:
df['Lake Victoria'] = df['Lake Victoria'].apply(lambda x: remove_mm(x))
df['Simiyu'] = df['Simiyu'].apply(lambda x: remove_mm(x))
df.head(20)

Unnamed: 0,Lake Victoria,Simiyu,Month,Period
0,3.176,2.90847,Jan,2001-2019
1,3.477,1.8,Feb,2001-2019
2,4.68705,2.98105,Mar,2001-2019
3,7.00453,4.75358,Apr,2001-2019
4,9.36279,4.07747,May,2001-2019
5,3.43021,1.04695,Jun,2001-2019
6,1.76442,0.195211,Jul,2001-2019
7,2.81263,0.333632,Aug,2001-2019
8,3.97889,1.20584,Sep,2001-2019
9,5.31842,2.45474,Oct,2001-2019


Now I calculate the number of rows and columns:

In [None]:
df.shape

I describe the type of each column

In [66]:
df.dtypes

Lake Victoria    object
Simiyu           object
Month            object
Period           object
dtype: object

The `Lake Victoria` and `Simiyu` columns should be float. So I convert them to float:

In [68]:
df["Lake Victoria"] = pd.to_numeric(df["Lake Victoria"])
df["Simiyu"] = pd.to_numeric(df["Simiyu"])

In [69]:
df.dtypes

Lake Victoria    float64
Simiyu           float64
Month             object
Period            object
dtype: object

I describe the dataset:

In [None]:
df.describe(include='all')

Finally, I build the report.

In [None]:
from pandas_profiling import ProfileReport

profile = ProfileReport(df, title="rainfall")
profile.to_file("rainfall.html") 