# Handling Missing Data in Pandas
**`09-missing-data.ipynb`**

Missing data is very common in real-world datasets. Pandas provides **powerful tools** to detect, remove, and fill missing values efficiently.

---



## Step 1: Import Libraries


In [1]:

import pandas as pd
import numpy as np


---



## Step 2: Create Sample DataFrame with Missing Values

In [2]:
data = {
    "Name": ["Alice", "Bob", "Charlie", "David", "Eva"],
    "Age": [25, np.nan, 28, 35, np.nan],
    "Salary": [50000, 60000, np.nan, 65000, 52000],
    "Department": ["HR", "IT", "Finance", np.nan, "HR"]
}

df = pd.DataFrame(data)
print(df)


      Name   Age   Salary Department
0    Alice  25.0  50000.0         HR
1      Bob   NaN  60000.0         IT
2  Charlie  28.0      NaN    Finance
3    David  35.0  65000.0        NaN
4      Eva   NaN  52000.0         HR



---



## Step 3: Detect Missing Values

In [3]:
# Boolean mask for missing values
print(df.isnull())

# Count missing values per column
print(df.isnull().sum())

# Count non-missing values
print(df.notnull().sum())


    Name    Age  Salary  Department
0  False  False   False       False
1  False   True   False       False
2  False  False    True       False
3  False  False   False        True
4  False   True   False       False
Name          0
Age           2
Salary        1
Department    1
dtype: int64
Name          5
Age           3
Salary        4
Department    4
dtype: int64



---



## Step 4: Drop Missing Values

In [4]:
# Drop rows with any missing value
df_dropped = df.dropna()
print("Rows after dropping any missing value:\n", df_dropped)

# Drop rows where specific column is missing
df_drop_age = df.dropna(subset=['Age'])
print("Rows after dropping missing Age:\n", df_drop_age)

# Drop columns with missing values
df_drop_col = df.dropna(axis=1)
print("Columns after dropping missing values:\n", df_drop_col)

Rows after dropping any missing value:
     Name   Age   Salary Department
0  Alice  25.0  50000.0         HR
Rows after dropping missing Age:
       Name   Age   Salary Department
0    Alice  25.0  50000.0         HR
2  Charlie  28.0      NaN    Finance
3    David  35.0  65000.0        NaN
Columns after dropping missing values:
       Name
0    Alice
1      Bob
2  Charlie
3    David
4      Eva



---



## Step 5: Fill Missing Values



In [5]:
# Fill with constant value
df_filled = df.fillna(0)
print(df_filled)

# Fill with mean (numerical columns)
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
print(df)

# Fill with forward-fill method
df['Department'].fillna(method='ffill', inplace=True)
print(df)

      Name   Age   Salary Department
0    Alice  25.0  50000.0         HR
1      Bob   0.0  60000.0         IT
2  Charlie  28.0      0.0    Finance
3    David  35.0  65000.0          0
4      Eva   0.0  52000.0         HR
      Name        Age   Salary Department
0    Alice  25.000000  50000.0         HR
1      Bob  29.333333  60000.0         IT
2  Charlie  28.000000  56750.0    Finance
3    David  35.000000  65000.0        NaN
4      Eva  29.333333  52000.0         HR
      Name        Age   Salary Department
0    Alice  25.000000  50000.0         HR
1      Bob  29.333333  60000.0         IT
2  Charlie  28.000000  56750.0    Finance
3    David  35.000000  65000.0    Finance
4      Eva  29.333333  52000.0         HR


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Salary'].fillna(df['Salary'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are sett


---



## Step 6: Interpolation



In [6]:

# Linear interpolation for numerical columns
data = {
    "Time": [1, 2, 3, 4, 5],
    "Value": [10, np.nan, 30, np.nan, 50]
}
df_interp = pd.DataFrame(data)
print("Before Interpolation:\n", df_interp)

df_interp['Value'] = df_interp['Value'].interpolate()
print("After Interpolation:\n", df_interp)

Before Interpolation:
    Time  Value
0     1   10.0
1     2    NaN
2     3   30.0
3     4    NaN
4     5   50.0
After Interpolation:
    Time  Value
0     1   10.0
1     2   20.0
2     3   30.0
3     4   40.0
4     5   50.0



---



## Step 7: Replace Specific Values

In [7]:
# Replace all NaN with a string
df.replace(np.nan, "Missing", inplace=True)
print(df)


      Name        Age   Salary Department
0    Alice  25.000000  50000.0         HR
1      Bob  29.333333  60000.0         IT
2  Charlie  28.000000  56750.0    Finance
3    David  35.000000  65000.0    Finance
4      Eva  29.333333  52000.0         HR



---


## Step 8: Summary

* **Detect missing values** using `isnull()` and `notnull()`.
* **Remove missing values** using `dropna()` (rows or columns).
* **Fill missing values** using `fillna()` with constants, mean, median, forward/backward fill.
* **Interpolate** numerical data for smooth filling.
* **Replace** specific values with `replace()`.
* Handling missing data is **critical** for data cleaning and preprocessing before analysis or modeling.

---