# 📊 Handling Missing Data in Pandas — Lecture

## 1️⃣ Introduction
In real-world datasets, missing values are very common.
- Data collection errors
- Incomplete records
- Human input mistakes

**Why handle missing data?**
- Ignoring missing values can lead to wrong results in analysis and machine learning models.

## 2️⃣ Detecting Missing Data
### Theory
- `isnull()` → Returns `True` where values are missing.
- `notnull()` → Returns `True` where values are not missing.

In [None]:

import pandas as pd
import numpy as np

data = {
    'Name': ['Ali', 'Sara', 'Umar', 'Asim'],
    'Age': [25, np.nan, 30, np.nan],
    'City': ['Karachi', 'Lahore', np.nan, 'Islamabad']
}

df = pd.DataFrame(data)
print(df)

# Detect missing values
print(df.isnull())

# Count missing values per column
print(df.isnull().sum())


   Name   Age       City
0   Ali  25.0    Karachi
1  Sara   NaN     Lahore
2  Umar  30.0        NaN
3  Asim   NaN  Islamabad
    Name    Age   City
0  False  False  False
1  False   True  False
2  False  False   True
3  False   True  False
Name    0
Age     2
City    1
dtype: int64


In [None]:
df.isnull().sum()

Unnamed: 0,0
Name,0
Age,2
City,1


In [None]:
df.isnull().sum()

Unnamed: 0,0
Name,0
Age,2
City,1


In [None]:
df

Unnamed: 0,Name,Age,City
0,Ali,25.0,Karachi
1,Sara,,Lahore
2,Umar,30.0,
3,Asim,,Islamabad


In [None]:
df.isnull()

Unnamed: 0,Name,Age,City
0,False,False,False
1,False,True,False
2,False,False,True
3,False,True,False


### ✅ Student Task 1
- Create a DataFrame with at least 3 columns and some missing values.
- Use `isnull()` and `notnull()` to detect and count missing values.

## 3️⃣ Handling Missing Data
### A. Removing Missing Data
#### Theory
- `dropna()` → Removes rows or columns containing missing values.

In [None]:

# Drop rows with any missing value
cleaned_df = df.dropna()
print(cleaned_df)


  Name   Age     City
0  Ali  25.0  Karachi


In [None]:
df

Unnamed: 0,Name,Age,City
0,Ali,25.0,Karachi
1,Sara,27.5,Lahore
2,Umar,30.0,Unknown
3,Asim,27.5,Islamabad


### ✅ Student Task 2
- Try `dropna()` on your DataFrame.
- Observe how many rows are removed.

### B. Filling Missing Data
#### Theory
- `fillna(value)` → Replaces missing values.
- Fill strategies:
  - With a constant (like `0` or `'Unknown'`)
  - With the mean, median, or mode value

In [None]:

# Fill missing Age with mean
mean_age = df['Age'].mean()
df['Age'] = df['Age'].fillna(mean_age)

# Fill missing City with 'Unknown'
df['City'] = df['City'].fillna('Unknown')

print(df)


   Name   Age       City
0   Ali  25.0    Karachi
1  Sara  27.5     Lahore
2  Umar  30.0    Unknown
3  Asim  27.5  Islamabad


### ✅ Student Task 3
- Fill missing numeric values with the mean.
- Fill missing categorical values with `'Unknown'`.

## 4️⃣ Forward Fill and Backward Fill
#### Theory
- `fillna(method='ffill')` → Fill missing values with previous value.
- `fillna(method='bfill')` → Fill missing values with next value.

In [None]:

df2 = pd.DataFrame({
    'A': [1, np.nan, np.nan, 4],
    'B': [np.nan, 2, 3, 4]
})

print(df2.fillna(method='ffill'))
print(df2.fillna(method='bfill'))




     A    B
0  1.0  NaN
1  1.0  2.0
2  1.0  3.0
3  4.0  4.0
     A    B
0  1.0  2.0
1  4.0  2.0
2  4.0  3.0
3  4.0  4.0


  print(df2.fillna(method='ffill'))
  print(df2.fillna(method='bfill'))


In [None]:
df2.fillna(method='bfill')

  df2.fillna(method='bfill')


Unnamed: 0,A,B
0,1.0,2.0
1,4.0,2.0
2,4.0,3.0
3,4.0,4.0


### ✅ Student Task 4
- Practice `fillna()` with `method='ffill'` and `method='bfill'` on your DataFrame.

## ✅ Summary Table
| Function               | Usage                                   |
|-----------------------|-----------------------------------------|
| `isnull()`            | Detect missing values                   |
| `notnull()`           | Detect non-missing values              |
| `dropna()`            | Remove rows or columns with missing data|
| `fillna(value)`       | Replace missing data with a value      |
| `fillna(method=...)`  | Forward fill, backward fill            |

## ✅ Assignment for Students
### Given DataFrame

In [None]:

import pandas as pd
import numpy as np

patient_data = pd.DataFrame({
    'PatientID': [1, 2, 3, 4, 5],
    'Name': ['Ali', 'Sara', 'Umar', 'Asim', 'Areeba'],
    'Age': [25, np.nan, 40, np.nan, 35],
    'City': ['Karachi', 'Lahore', np.nan, 'Islamabad', np.nan],
    'BloodPressure': [120, np.nan, 130, np.nan, 110]
})

patient_data


Unnamed: 0,PatientID,Name,Age,City,BloodPressure
0,1,Ali,25.0,Karachi,120.0
1,2,Sara,,Lahore,
2,3,Umar,40.0,,130.0
3,4,Asim,,Islamabad,
4,5,Areeba,35.0,,110.0


### Assignment Tasks
1. Detect and count missing values in each column.
2. Fill missing `Age` values with the mean age.
3. Fill missing `City` values with `'Unknown'`.
4. Fill missing `BloodPressure` using forward fill method.
5. Submit the cleaned DataFrame with all missing values handled.