Let's imagine you have a dataset that contains information about customers of an e-commerce store. The dataset includes the following columns:

customer_id: Unique identifier for each customer
age: Age of the customer
gender: Gender of the customer
email: Email address of the customer
purchase_amount: The amount spent by the customer on purchases

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Create the DataFrame
data = {
    'customer_id': [1, 2, 3, 4, 5],
    'age': [25, np.nan, 30, 35, np.nan],
    'gender': ['M', 'F', np.nan, 'M', 'F'],
    'email': ['example1@example.com', 'example2@example.com', np.nan, 'example4@example.com', 'example5@example.com'],
    'purchase_amount': [100.0, 50.0, np.nan, 200.0, np.nan]
}

df = pd.DataFrame(data)
df

Unnamed: 0,customer_id,age,gender,email,purchase_amount
0,1,25.0,M,example1@example.com,100.0
1,2,,F,example2@example.com,50.0
2,3,30.0,,,
3,4,35.0,M,example4@example.com,200.0
4,5,,F,example5@example.com,


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   customer_id      5 non-null      int64  
 1   age              3 non-null      float64
 2   gender           4 non-null      object 
 3   email            4 non-null      object 
 4   purchase_amount  3 non-null      float64
dtypes: float64(2), int64(1), object(2)
memory usage: 332.0+ bytes


# Handling missing values

In [4]:
# Case 1: Drop rows with missing values
df_dropped = df.dropna()
df_dropped

Unnamed: 0,customer_id,age,gender,email,purchase_amount
0,1,25.0,M,example1@example.com,100.0
3,4,35.0,M,example4@example.com,200.0


In case you want to drop column with missing value

In [5]:
df_dropped_col = df.drop(columns=['age','gender','email','purchase_amount'])
df_dropped_col


Unnamed: 0,customer_id
0,1
1,2
2,3
3,4
4,5


## Imputation

In [6]:
# Case 2: Fill missing values with specific values
df_filled = df.fillna({'age': 0, 'gender': 'Unknown','email':'examplehandler@example.com', 'purchase_amount': 0.0})
print("DataFrame after filling missing values with specific values:")
print(df_filled)
print()

DataFrame after filling missing values with specific values:
   customer_id   age   gender                       email  purchase_amount
0            1  25.0        M        example1@example.com            100.0
1            2   0.0        F        example2@example.com             50.0
2            3  30.0  Unknown  examplehandler@example.com              0.0
3            4  35.0        M        example4@example.com            200.0
4            5   0.0        F        example5@example.com              0.0



In [7]:
# Case 3: Fill missing values with mean/median
df_mean_filled = df.copy()
df_mean_filled['age'] = df_mean_filled['age'].fillna(df_mean_filled['age'].mean().round())
df_mean_filled['purchase_amount'] = df_mean_filled['purchase_amount'].fillna(df_mean_filled['purchase_amount'].mean().round())
print("DataFrame after filling missing values with mean:")
print(df_mean_filled)
print()

DataFrame after filling missing values with mean:
   customer_id   age gender                 email  purchase_amount
0            1  25.0      M  example1@example.com            100.0
1            2  30.0      F  example2@example.com             50.0
2            3  30.0    NaN                   NaN            117.0
3            4  35.0      M  example4@example.com            200.0
4            5  30.0      F  example5@example.com            117.0



In [8]:
# Case 4: Forward fill missing values
df_ffill = df.copy()
df_ffill = df.ffill()
print("DataFrame after forward filling missing values:")
print(df_ffill)
print()

DataFrame after forward filling missing values:
   customer_id   age gender                 email  purchase_amount
0            1  25.0      M  example1@example.com            100.0
1            2  25.0      F  example2@example.com             50.0
2            3  30.0      F  example2@example.com             50.0
3            4  35.0      M  example4@example.com            200.0
4            5  35.0      F  example5@example.com            200.0



In [9]:
df_bfill_imp = df.copy()
df_bfill_imp['gender'] = df_bfill_imp['gender'].fillna(method='bfill')
df_bfill_imp

Unnamed: 0,customer_id,age,gender,email,purchase_amount
0,1,25.0,M,example1@example.com,100.0
1,2,,F,example2@example.com,50.0
2,3,30.0,M,,
3,4,35.0,M,example4@example.com,200.0
4,5,,F,example5@example.com,


In [10]:
# Case 5: Backward fill missing values
df_bfill = df.copy()
df_bfill['age'] = df_bfill['age'].fillna(method='bfill')
print("DataFrame after backward filling missing values:")
print(df_bfill)
print()

DataFrame after backward filling missing values:
   customer_id   age gender                 email  purchase_amount
0            1  25.0      M  example1@example.com            100.0
1            2  30.0      F  example2@example.com             50.0
2            3  30.0    NaN                   NaN              NaN
3            4  35.0      M  example4@example.com            200.0
4            5   NaN      F  example5@example.com              NaN



In [11]:
# Case 5: Backward fill missing values
df_bfill = df.copy()
df_bfill['age'].fillna(method='bfill', inplace=True)
df_bfill['gender'].fillna(method='bfill', inplace=True)
df_bfill['email'].fillna(method='bfill', inplace=True)
df_bfill['purchase_amount'].fillna(method='bfill', inplace=True)

print("DataFrame after backward filling missing values:")
print(df_bfill)
print()

DataFrame after backward filling missing values:
   customer_id   age gender                 email  purchase_amount
0            1  25.0      M  example1@example.com            100.0
1            2  30.0      F  example2@example.com             50.0
2            3  30.0      M  example4@example.com            200.0
3            4  35.0      M  example4@example.com            200.0
4            5   NaN      F  example5@example.com              NaN

