# Part 1.1: Handling Missing Data

Missing data is a common problem in real-world datasets. This notebook covers the standard techniques for identifying and handling missing values.

In [6]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

data = {
    'age': [25, 30, np.nan, 40, 45],
    'salary': [50000, 60000, 70000, np.nan, 90000],
    'city': ['New York', 'Los Angeles', 'Chicago', 'New York', np.nan]
}
df = pd.DataFrame(data)
print("Original DataFrame with Missing Values:")
df

Original DataFrame with Missing Values:


Unnamed: 0,age,salary,city
0,25.0,50000.0,New York
1,30.0,60000.0,Los Angeles
2,,70000.0,Chicago
3,40.0,,New York
4,45.0,90000.0,


### Detecting Missing Data
The first step is to identify which columns contain missing values and how many.

In [7]:
print("Missing values per column:")
print(df.isna().sum())

Missing values per column:
age       1
salary    1
city      1
dtype: int64


### Strategy 1: Dropping Missing Data
The simplest approach is to remove rows or columns with missing data. This is only recommended if the amount of missing data is small.

In [8]:
df_dropped = df.dropna()
print("DataFrame after dropping rows with any missing values:")
print(df_dropped)

DataFrame after dropping rows with any missing values:
    age   salary         city
0  25.0  50000.0     New York
1  30.0  60000.0  Los Angeles


### Strategy 2: Imputation
Imputation involves filling in the missing values. The method depends on the data type.

#### Numerical Imputation (Mean/Median)
For numerical columns like 'age' and 'salary', we can use the mean or median. The median is generally preferred if there are outliers.

In [9]:
df_imputed = df.copy()
num_imputer = SimpleImputer(strategy='median')
df_imputed[['age', 'salary']] = num_imputer.fit_transform(df_imputed[['age', 'salary']])
print("DataFrame after numerical imputation:")
print(df_imputed)

DataFrame after numerical imputation:
    age   salary         city
0  25.0  50000.0     New York
1  30.0  60000.0  Los Angeles
2  35.0  70000.0      Chicago
3  40.0  65000.0     New York
4  45.0  90000.0          NaN


#### Categorical Imputation (Most Frequent)
For categorical columns like 'city', we can fill missing values with the most frequent category (the mode).

In [10]:
cat_imputer = SimpleImputer(strategy='most_frequent')
df_imputed[['city']] = cat_imputer.fit_transform(df_imputed[['city']])
print("DataFrame after all imputation:")
print(df_imputed)
print("\nFinal check for missing values:")
print(df_imputed.isna().sum())

DataFrame after all imputation:
    age   salary         city
0  25.0  50000.0     New York
1  30.0  60000.0  Los Angeles
2  35.0  70000.0      Chicago
3  40.0  65000.0     New York
4  45.0  90000.0     New York

Final check for missing values:
age       0
salary    0
city      0
dtype: int64


More advanced techniques like `KNNImputer` (which uses the values of neighboring points to impute) also exist and can be more accurate.