# Data Cleaning in Jupyter Notebook for Data Engineers
In this Jupyter Notebook, we will practice data cleaning techniques on a sample dataset. We'll focus on handling missing data, either by removing it or replacing it with constructed values. To follow along, create a new Jupyter Notebook in your preferred environment and copy the provided code snippets.

We'll use the popular Python libraries Pandas and NumPy for this exercise. Make sure to install them if you haven't already:


In [1]:

%pip install pandas numpy

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



1. Importing Libraries

In [2]:

import pandas as pd
import numpy as np

2. Loading the Sample Dataset

We will create a sample dataset with some missing values for this exercise. You can replace this with your own dataset if you prefer.


In [3]:


data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
        'age': [24, np.nan, 22, 29, 25, np.nan, 30],
        'city': ['New York', 'Los Angeles', 'Chicago', np.nan, 'San Francisco', 'Seattle', 'Austin'],
        'income': [50000, 48000, np.nan, 53000, 45000, 55000, np.nan]}

df = pd.DataFrame(data)
df

Unnamed: 0,name,age,city,income
0,Alice,24.0,New York,50000.0
1,Bob,,Los Angeles,48000.0
2,Charlie,22.0,Chicago,
3,David,29.0,,53000.0
4,Eve,25.0,San Francisco,45000.0
5,Frank,,Seattle,55000.0
6,Grace,30.0,Austin,


3. Removing Missing Data
#### Exercise 1: Remove rows with missing data
Use the dropna() method to remove any rows containing missing values.


In [4]:


df_cleaned = df.dropna()
df_cleaned

Unnamed: 0,name,age,city,income
0,Alice,24.0,New York,50000.0
4,Eve,25.0,San Francisco,45000.0


#### Exercise 2: Remove columns with missing data
Use the dropna() method with axis=1 to remove any columns containing missing values.

In [5]:

df_cleaned = df.dropna(axis=1)
df_cleaned

Unnamed: 0,name
0,Alice
1,Bob
2,Charlie
3,David
4,Eve
5,Frank
6,Grace


4. Replacing Missing Data with Constructed Values
#### Exercise 3: Replace missing values with a constant
Use the fillna() method to replace any missing values with a constant value (e.g., 0).


In [6]:

df_filled = df.fillna(0)
df_filled

Unnamed: 0,name,age,city,income
0,Alice,24.0,New York,50000.0
1,Bob,0.0,Los Angeles,48000.0
2,Charlie,22.0,Chicago,0.0
3,David,29.0,0,53000.0
4,Eve,25.0,San Francisco,45000.0
5,Frank,0.0,Seattle,55000.0
6,Grace,30.0,Austin,0.0


#### Exercise 4: Replace missing values using forward fill
Use the fillna() method with method='ffill' to fill missing values with the previous value in the same column.

In [7]:

df_filled = df.fillna(method='ffill')
df_filled

Unnamed: 0,name,age,city,income
0,Alice,24.0,New York,50000.0
1,Bob,24.0,Los Angeles,48000.0
2,Charlie,22.0,Chicago,48000.0
3,David,29.0,Chicago,53000.0
4,Eve,25.0,San Francisco,45000.0
5,Frank,25.0,Seattle,55000.0
6,Grace,30.0,Austin,55000.0


#### Exercise 5: Replace missing values using backward fill
Use the fillna() method with method='bfill' to fill missing values with the next value in the same column.


In [8]:

df_filled = df.fillna(method='bfill')
df_filled

Unnamed: 0,name,age,city,income
0,Alice,24.0,New York,50000.0
1,Bob,22.0,Los Angeles,48000.0
2,Charlie,22.0,Chicago,53000.0
3,David,29.0,San Francisco,53000.0
4,Eve,25.0,San Francisco,45000.0
5,Frank,30.0,Seattle,55000.0
6,Grace,30.0,Austin,


#### Exercise 6: Replace missing values with the mean
Use the fillna() method in combination with mean() to replace missing values with the mean value of the corresponding column.


In [10]:
df_filled = df.fillna(df.mean(numeric_only=True))
df_filled

Unnamed: 0,name,age,city,income
0,Alice,24.0,New York,50000.0
1,Bob,26.0,Los Angeles,48000.0
2,Charlie,22.0,Chicago,50200.0
3,David,29.0,,53000.0
4,Eve,25.0,San Francisco,45000.0
5,Frank,26.0,Seattle,55000.0
6,Grace,30.0,Austin,50200.0


#### Exercise 7: Replace missing values with the median
Use the fillna() method in combination with median() to replace missing values with the median value of the corresponding column.


In [12]:

df_filled = df.fillna(df.median(numeric_only=True))
df_filled

Unnamed: 0,name,age,city,income
0,Alice,24.0,New York,50000.0
1,Bob,25.0,Los Angeles,48000.0
2,Charlie,22.0,Chicago,50000.0
3,David,29.0,,53000.0
4,Eve,25.0,San Francisco,45000.0
5,Frank,25.0,Seattle,55000.0
6,Grace,30.0,Austin,50000.0


#### Exercise 8: Replace missing values with the mode
Use the fillna() method in combination with mode() to replace missing values with the mode value of the corresponding column.


In [13]:

df_filled = df.fillna(df.mode().iloc[0])
df_filled

Unnamed: 0,name,age,city,income
0,Alice,24.0,New York,50000.0
1,Bob,22.0,Los Angeles,48000.0
2,Charlie,22.0,Chicago,45000.0
3,David,29.0,Austin,53000.0
4,Eve,25.0,San Francisco,45000.0
5,Frank,22.0,Seattle,55000.0
6,Grace,30.0,Austin,45000.0


#### 5. Conclusion
In this Jupyter Notebook, we practiced data cleaning techniques such as removing and replacing missing data. We used Pandas and NumPy to perform these operations on a sample dataset. As a data engineer, it's important to master these techniques, as they form the foundation of the data preprocessing stage in any data project.

Moving forward, you can explore more advanced techniques for handling missing data, such as using machine learning algorithms for imputation or considering domain-specific knowledge to decide how to handle missing values. You can also practice data cleaning on larger, real-world datasets to become more proficient in handling various data issues.

Additionally, consider exploring other aspects of data preprocessing, such as data transformation, feature scaling, and encoding categorical variables. Building a strong foundation in data cleaning and preprocessing will help you tackle complex data engineering tasks and improve the quality of your data pipelines.