# Level 5: Data Cleaning

Data in the real world is often messy. It can have missing values, duplicates, incorrect data types, and inconsistent formatting. Data cleaning is the process of fixing these issues to prepare the data for analysis. It's often the most time-consuming part of a data science project.

In [1]:
import pandas as pd
import numpy as np

Let's create a messy dataset to work with.

In [2]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Alice', 'Eva', 'Frank', 'Grace', 'Henry', 'Ivy'],
    'Age': [25, 30, np.nan, 40, 25, 22, 45, np.nan, 50, 22],
    'City': ['NY', 'LA', 'Chicago', 'NY', 'NY', 'LA', 'Chicago', 'Boston', 'LA', 'Boston'],
    'JoinDate': ['2022-01-10', '2021-11-20', '2022-03-15', '2020-07-30', '2022-01-10', '2023-05-25', '2019-08-01', '2022-09-12', np.nan, '2023-05-25'],
    'Salary': [' $70,000 ', '$80,000', '$65,000', '$90,000', ' $70,000 ', '$75,000', '$120,000', '$68,000', '$130,000', '75000'],
    'Department': ['HR', 'Engineering', 'Sales', 'Engineering', 'HR', 'Marketing', 'Sales', 'HR', 'Engineering', 'Marketing']
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City,JoinDate,Salary,Department
0,Alice,25.0,NY,2022-01-10,"$70,000",HR
1,Bob,30.0,LA,2021-11-20,"$80,000",Engineering
2,Charlie,,Chicago,2022-03-15,"$65,000",Sales
3,David,40.0,NY,2020-07-30,"$90,000",Engineering
4,Alice,25.0,NY,2022-01-10,"$70,000",HR
5,Eva,22.0,LA,2023-05-25,"$75,000",Marketing
6,Frank,45.0,Chicago,2019-08-01,"$120,000",Sales
7,Grace,,Boston,2022-09-12,"$68,000",HR
8,Henry,50.0,LA,,"$130,000",Engineering
9,Ivy,22.0,Boston,2023-05-25,75000,Marketing


## 5.1 Handling Missing Data

### Detecting Missing Data

In [3]:
# Check for missing values
df.isna()

Unnamed: 0,Name,Age,City,JoinDate,Salary,Department
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,True,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False
5,False,False,False,False,False,False
6,False,False,False,False,False,False
7,False,True,False,False,False,False
8,False,False,False,True,False,False
9,False,False,False,False,False,False


In [4]:
# Get a count of missing values per column
df.isna().sum()

Name          0
Age           2
City          0
JoinDate      1
Salary        0
Department    0
dtype: int64

### Dropping Missing Data (`.dropna()`)

In [5]:
# Drop rows with any missing values
df.dropna()

Unnamed: 0,Name,Age,City,JoinDate,Salary,Department
0,Alice,25.0,NY,2022-01-10,"$70,000",HR
1,Bob,30.0,LA,2021-11-20,"$80,000",Engineering
3,David,40.0,NY,2020-07-30,"$90,000",Engineering
4,Alice,25.0,NY,2022-01-10,"$70,000",HR
5,Eva,22.0,LA,2023-05-25,"$75,000",Marketing
6,Frank,45.0,Chicago,2019-08-01,"$120,000",Sales
9,Ivy,22.0,Boston,2023-05-25,75000,Marketing


In [6]:
# Drop columns with any missing values
df.dropna(axis='columns')

Unnamed: 0,Name,City,Salary,Department
0,Alice,NY,"$70,000",HR
1,Bob,LA,"$80,000",Engineering
2,Charlie,Chicago,"$65,000",Sales
3,David,NY,"$90,000",Engineering
4,Alice,NY,"$70,000",HR
5,Eva,LA,"$75,000",Marketing
6,Frank,Chicago,"$120,000",Sales
7,Grace,Boston,"$68,000",HR
8,Henry,LA,"$130,000",Engineering
9,Ivy,Boston,75000,Marketing


### Filling Missing Data (`.fillna()`)

In [7]:
# Fill missing 'Age' with the mean age
mean_age = df['Age'].mean()
df['Age'].fillna(mean_age)

0    25.000
1    30.000
2    32.375
3    40.000
4    25.000
5    22.000
6    45.000
7    32.375
8    50.000
9    22.000
Name: Age, dtype: float64

In [8]:
# Forward-fill: propagate the last valid observation forward
df['JoinDate'].fillna(method='ffill')

  df['JoinDate'].fillna(method='ffill')


0    2022-01-10
1    2021-11-20
2    2022-03-15
3    2020-07-30
4    2022-01-10
5    2023-05-25
6    2019-08-01
7    2022-09-12
8    2022-09-12
9    2023-05-25
Name: JoinDate, dtype: object

### Interpolation (`.interpolate()`)
A more advanced way to fill missing numerical data.

In [9]:
df['Age'].interpolate()

0    25.0
1    30.0
2    35.0
3    40.0
4    25.0
5    22.0
6    45.0
7    47.5
8    50.0
9    22.0
Name: Age, dtype: float64

## 5.2 Handling Duplicates

### Finding Duplicates (`.duplicated()`)

In [10]:
# Check for duplicate rows
df.duplicated()

0    False
1    False
2    False
3    False
4     True
5    False
6    False
7    False
8    False
9    False
dtype: bool

### Removing Duplicates (`.drop_duplicates()`)

In [11]:
df.drop_duplicates()

Unnamed: 0,Name,Age,City,JoinDate,Salary,Department
0,Alice,25.0,NY,2022-01-10,"$70,000",HR
1,Bob,30.0,LA,2021-11-20,"$80,000",Engineering
2,Charlie,,Chicago,2022-03-15,"$65,000",Sales
3,David,40.0,NY,2020-07-30,"$90,000",Engineering
5,Eva,22.0,LA,2023-05-25,"$75,000",Marketing
6,Frank,45.0,Chicago,2019-08-01,"$120,000",Sales
7,Grace,,Boston,2022-09-12,"$68,000",HR
8,Henry,50.0,LA,,"$130,000",Engineering
9,Ivy,22.0,Boston,2023-05-25,75000,Marketing


In [12]:
# Keep the last occurrence instead of the first
df.drop_duplicates(keep='last')

Unnamed: 0,Name,Age,City,JoinDate,Salary,Department
1,Bob,30.0,LA,2021-11-20,"$80,000",Engineering
2,Charlie,,Chicago,2022-03-15,"$65,000",Sales
3,David,40.0,NY,2020-07-30,"$90,000",Engineering
4,Alice,25.0,NY,2022-01-10,"$70,000",HR
5,Eva,22.0,LA,2023-05-25,"$75,000",Marketing
6,Frank,45.0,Chicago,2019-08-01,"$120,000",Sales
7,Grace,,Boston,2022-09-12,"$68,000",HR
8,Henry,50.0,LA,,"$130,000",Engineering
9,Ivy,22.0,Boston,2023-05-25,75000,Marketing


## 5.3 Data Type Conversion

### `.astype()`
Convert a column to a specific data type.

In [13]:
# The 'Age' column is float due to NaNs. Let's fill NaNs and convert to int.
df_cleaned = df.copy()
df_cleaned['Age'] = df_cleaned['Age'].fillna(0).astype(int)
df_cleaned.dtypes

Name          object
Age            int64
City          object
JoinDate      object
Salary        object
Department    object
dtype: object

### `pd.to_datetime()`
Convert a column of strings to datetime objects.

In [14]:
df_cleaned['JoinDate'] = pd.to_datetime(df_cleaned['JoinDate'])
df_cleaned.dtypes

Name                  object
Age                    int64
City                  object
JoinDate      datetime64[ns]
Salary                object
Department            object
dtype: object

### `pd.to_numeric()`
Convert a column to a numeric type, with error handling.

In [15]:
# The 'Salary' column is an object type with extra characters. Let's clean it.
salary_cleaned = df['Salary'].str.replace('$', '').str.replace(',', '').str.strip()
df_cleaned['Salary'] = pd.to_numeric(salary_cleaned)
df_cleaned.dtypes

Name                  object
Age                    int64
City                  object
JoinDate      datetime64[ns]
Salary                 int64
Department            object
dtype: object

## 5.4 String Operations (`.str`)

The `.str` accessor provides a host of vectorized string methods.

In [16]:
df['Name'].str.lower()

0      alice
1        bob
2    charlie
3      david
4      alice
5        eva
6      frank
7      grace
8      henry
9        ivy
Name: Name, dtype: object

In [17]:
df['Salary'].str.strip()

0     $70,000
1     $80,000
2     $65,000
3     $90,000
4     $70,000
5     $75,000
6    $120,000
7     $68,000
8    $130,000
9       75000
Name: Salary, dtype: object

In [18]:
# Check which departments contain 'Eng'
df['Department'].str.contains('Eng')

0    False
1     True
2    False
3     True
4    False
5    False
6    False
7    False
8     True
9    False
Name: Department, dtype: bool

In [19]:
# Replace 'HR' with 'Human Resources'
df['Department'].str.replace('HR', 'Human Resources')

0    Human Resources
1        Engineering
2              Sales
3        Engineering
4    Human Resources
5          Marketing
6              Sales
7    Human Resources
8        Engineering
9          Marketing
Name: Department, dtype: object

### Splitting Columns

In [20]:
# Let's say we have a column 'FullName'
df_split = pd.DataFrame({'FullName': ['John Smith', 'Jane Doe']})
df_split[['FirstName', 'LastName']] = df_split['FullName'].str.split(' ', expand=True)
df_split

Unnamed: 0,FullName,FirstName,LastName
0,John Smith,John,Smith
1,Jane Doe,Jane,Doe
