# Data Cleaning and Preprocessing

There are many problems that often occur in raw data such as inconsistencies, missing values and duplicate entries. We want to tackle these issues to make the data easier to analyse. In this tutorial, we will explore the common techniques used in pandas to clean and preprocess data. These include handling missing values, detecting and removing duplicates, converting data types, working with categorical and string data, and performing basic data transformations. 

## Identify and handle missing data

In [2]:
import pandas as pd
from pathlib import Path

data = pd.read_excel(Path.cwd() / "data/Financial Sample.xlsx", header=2)

# Print rows that contain null values
null_rows = data[data.isnull().any(axis=1)]

print(null_rows)

             Segment                   Country    Product Discount Band  \
0         Government                    Canada  Carretera           NaN   
1         Government                   Germany  Carretera           NaN   
2          Midmarket                    France  Carretera           NaN   
3          Midmarket                   Germany  Carretera           NaN   
4          Midmarket                    Mexico  Carretera           NaN   
5         Government                   Germany  Carretera           NaN   
6          Midmarket                   Germany    Montana           NaN   
7   Channel Partners                    Canada    Montana           NaN   
8         Government                    France    Montana           NaN   
9   Channel Partners                   Germany    Montana           NaN   
10         Midmarket                    Mexico    Montana           NaN   
11        Enterprise                    Canada    Montana           NaN   
12    Small Business     

In [5]:
# Print columns that contain null values
null_columns = data.columns[data.isnull().any()].tolist()

print(null_columns)

['Discount Band']


You can drop null values using dropna()

In [14]:
data_no_nulls = data.dropna()

null_rows = data_no_nulls[data_no_nulls.isnull().any(axis=1)]

print(null_rows)
print(data_no_nulls)

Empty DataFrame
Columns: [Segment, Country, Product, Discount Band, Units Sold, Manufacturing Price, Sale Price, Gross Sales, Discounts,  Sales, COGS, Profit, Date, Month Number, Month Name, Year]
Index: []
              Segment                   Country   Product Discount Band  \
53         Government                    France     Paseo           Low   
54          Midmarket                    France     Paseo           Low   
55         Government                    France     Paseo           Low   
56         Government                    France      Velo           Low   
57         Government                    Canada       VTT           Low   
..                ...                       ...       ...           ...   
695    Small Business                    France  Amarilla          High   
696    Small Business                    Mexico  Amarilla          High   
697        Government                    Mexico   Montana          High   
698        Government                    Ca

It is often useful to provide a threshold for dropna() this is the number of columns containing null that you will then drop. Often when reading in xlsx files it will include notes that happen below the actual data, so dropna() is very useful for getting rid of these rows. 

There are scenarios were instead of removing these entries completely you wish to replace it with something else. You can do so using fillna(). So here we are considering discounts so it is reasonable to replace null by 0.

In [18]:
data = pd.read_excel(Path.cwd() / "data/Financial Sample.xlsx", header=2)
data = data.fillna(0)

print(data[data["Discounts"] == 0])

             Segment                   Country    Product Discount Band  \
0         Government                    Canada  Carretera             0   
1         Government                   Germany  Carretera             0   
2          Midmarket                    France  Carretera             0   
3          Midmarket                   Germany  Carretera             0   
4          Midmarket                    Mexico  Carretera             0   
5         Government                   Germany  Carretera             0   
6          Midmarket                   Germany    Montana             0   
7   Channel Partners                    Canada    Montana             0   
8         Government                    France    Montana             0   
9   Channel Partners                   Germany    Montana             0   
10         Midmarket                    Mexico    Montana             0   
11        Enterprise                    Canada    Montana             0   
12    Small Business     

A common situation I have uncountered is where we have a column for year and quarter and because the data is in order only Q1 has a year. In this scenario we want to do a forward fill ffill()

In [19]:
data = {
    'Year': [2000, None, None, None, 2001, None, None, None, 2002, None, None, None, 2003, None, None],
    'Quarter': ['Q1', 'Q2', 'Q3', 'Q4', 'Q1', 'Q2', 'Q3', 'Q4', 'Q1', 'Q2', 'Q3', 'Q4', 'Q1', 'Q2', 'Q3']
}
df = pd.DataFrame(data)

print(df)

      Year Quarter
0   2000.0      Q1
1      NaN      Q2
2      NaN      Q3
3      NaN      Q4
4   2001.0      Q1
5      NaN      Q2
6      NaN      Q3
7      NaN      Q4
8   2002.0      Q1
9      NaN      Q2
10     NaN      Q3
11     NaN      Q4
12  2003.0      Q1
13     NaN      Q2
14     NaN      Q3


In [20]:
df_filled = df.ffill()

print(df_filled)

      Year Quarter
0   2000.0      Q1
1   2000.0      Q2
2   2000.0      Q3
3   2000.0      Q4
4   2001.0      Q1
5   2001.0      Q2
6   2001.0      Q3
7   2001.0      Q4
8   2002.0      Q1
9   2002.0      Q2
10  2002.0      Q3
11  2002.0      Q4
12  2003.0      Q1
13  2003.0      Q2
14  2003.0      Q3


Simiarly there are other fill functions such as bfill(). Please see documentation https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.bfill.html

## Identify and handle duplicate data

## Identify and handle data types

## String Operations