 # Handling Missing Values in Data (Numerical & Categorical Features)

## What are missing Values?

 Missing values are empty or null entries in a dataset where a value is expected but not recorded. These gaps, often appearing as blank cells or placeholders like "NaN," "NA," or "NULL," can significantly affect the accuracy and reliability of data analysis and machine learning models.
## Common causes of missing data
Missing data is a common issue in data science, with a variety of potential causes:
- **Human error:** Mistakes during data collection or entry are a frequent cause of missing values.
- **Technical problems:** Equipment or sensor failures during data collection or errors during data transfer can lead to missing data.
- **Non-response:** In surveys, participants may refuse to answer certain questions due to their sensitive nature, such as income or personal health.
- **Study attrition:** In longitudinal studies, participants may drop out before the study is completed, resulting in missing data for later time points.
- **Missing by design:** Sometimes, a data collection strategy is intentionally designed to produce missing values. 


## 🛠 1. Understanding Types of Missing Data

Before handling missing values, it is crucial to understand why they occur. There are three main types:

 🔹 A. **Missing Completely at Random (MCAR)**

- Missing values occur randomly without any pattern.

- Example: A survey respondent forgets to answer a question by accident.

- **Solution:** Dropping missing values (dropna()) is safe if the missing percentage is low.

🔹 B. **Missing at Random (MAR)**

- The missing data is related to some known variables but not the missing variable itself.

- Example: People with higher incomes are less likely to report their salaries in a survey.

- **Solution:** Use fillna() with an appropriate imputation method (mean, median, or regression-based imputation).

🔹 C. **Missing Not at Random (MNAR)**

- Missing values depend on the missing variable itself.

- Example: People with low salaries are more likely to hide their income.

- **Solution:** Imputation is tricky. Try to understand the cause before deciding on a method.

## 🛠 2. How to Detect Missing Values?
Before handling missing data, you need to check for it.

**✅ Check for Missing Data in Pandas**


In [64]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
# Sample DataFrame

data = {
   'Name': ['Alice', 'Bob', 'Charlie', None, 'David'],
    'Age': [25, None, 30, 22, None],
    'Salary': [50000, 60000, None, 70000, None],
    'Department': ['HR', 'Finance', 'IT', 'Marketing', None]
}


df = pd.DataFrame(data)

# Count missing values in each column
print(df.isnull().sum())

# Percentage of missing values
print("Percentage of Missing Values:\n", (df.isnull().sum() / len(df) * 100))


Name          1
Age           2
Salary        2
Department    1
dtype: int64
Percentage of Missing Values:
 Name          20.0
Age           40.0
Salary        40.0
Department    20.0
dtype: float64


# 🛠 3. Techniques to Handle Missing  Numerical Data
Now, let's explore different methods to handle missing values.

  ## 🔹 **A. Drop Missing Values (dropna())**

**Best when:** Missing data is low (<5%) and randomly distributed (MCAR).

In [65]:
df_cleaned = df.dropna()  # Drops rows with missing values


✅ **Pros:** Simple, avoids bias.

❌ **Cons:** Reduces dataset size, may lose valuable data.

➡ **Alternative:** Drop columns if many values are missing.

In [66]:
df.drop(columns=['Department'], inplace=True)  # Drops the entire column


 ## 🔹 **B. Fill Missing Values (fillna)**
**Best when:** Missing data is not random (MAR) or removing data is not an option.

1️⃣ **Fill with a Fixed Value**

- Replace missing values with a constant (e.g., 0, 'Unknown').

In [67]:
df.fillna({'Name': 'Unknown', 'Age': 0}, inplace=True)


✅ Simple but not always accurate.

2️⃣ **Fill with Mean, Median, or Mode**

 **Best for numerical data.**

- Mean: Use when the data is normally distributed (no outliers).

- Median: Use when the data is skewed or has outliers.

- Mode: Use for categorical data (e.g., filling missing city names).

In [68]:
df['Age'].fillna(df['Age'].mean(), inplace=True)   # Mean imputation
df['Salary'].fillna(df['Salary'].median(), inplace=True)  # Median imputation


✅ **Pros:** Keeps data size, works well for small missing values.

❌ **Cons:** Can introduce bias if missing data is not random.

3️⃣**Forward Fill (ffill) & Backward Fill (bfill)**

 **Best for time-series data.**

- ffill (Forward Fill): Fill missing values with the previous value.

- bfill (Backward Fill): Fill missing values with the next value.

In [69]:
df.fillna(method='ffill', inplace=True)  # Forward fill
df.fillna(method='bfill', inplace=True)  # Backward fill


✅ Works well for time-dependent data.
            
❌ Can create unrealistic values if trend changes.

## 🔹 **C. Advanced Imputation Methods**
**Best for large datasets with complex missing patterns.**

1️⃣**K-Nearest Neighbors (KNN) Imputation**

- Finds the most similar (nearest) rows and fills missing values based on them.

In [70]:
from sklearn.impute import KNNImputer
import numpy as np

imputer = KNNImputer(n_neighbors=3)

df[['Age', 'Salary']] = imputer.fit_transform(df[['Age', 'Salary']])


✅ More accurate than simple imputation.

❌ Computationally expensive for large datasets.

2️⃣ **Regression Imputation**
- Predict missing values using regression models.

- Example: Predict missing salaries using Age and Experience.

In [71]:
import pandas as pd
import numpy as np

# Sample dataset
np.random.seed(42)
data = {
    'Age': [25, 30, 35, 40, 45, 50, 55, 60],
    'Experience': [1, 3, 7, 10, 15, 20, 25, 30],
    'Salary': [25000, 30000, np.nan, 50000, np.nan, 70000, 80000, np.nan]
}

df = pd.DataFrame(data)
print("Original Data with Missing Values:\n", df)

# Split data into known and missing salary
train_data = df[df['Salary'].notnull()]
test_data = df[df['Salary'].isnull()]


from sklearn.linear_model import LinearRegression

# Features & Target
X_train = train_data[['Age', 'Experience']]
y_train = train_data['Salary']

# Train Linear Regression
reg = LinearRegression()
reg.fit(X_train, y_train)

# Predict missing salaries
X_test = test_data[['Age', 'Experience']]
predicted_salary = reg.predict(X_test)

# Fill missing values
df.loc[df['Salary'].isnull(), 'Salary'] = predicted_salary
print("\nData After Regression Imputation:\n", df)



Original Data with Missing Values:
    Age  Experience   Salary
0   25           1  25000.0
1   30           3  30000.0
2   35           7      NaN
3   40          10  50000.0
4   45          15      NaN
5   50          20  70000.0
6   55          25  80000.0
7   60          30      NaN

Data After Regression Imputation:
    Age  Experience        Salary
0   25           1  25000.000000
1   30           3  30000.000000
2   35           7  40798.611111
3   40          10  50000.000000
4   45          15  59479.166667
5   50          20  70000.000000
6   55          25  80000.000000
7   60          30  90729.166667


✅ Regression imputation uses relationships among variables → more accurate than mean/median.

✅ Keeps data consistent with trends.

❌ Assumes a linear relationship (may not always be true).

❌ Can underestimate variability (imputed values look too "perfect").

 ## 🔹 **D. Multiple Imputation (MICE)**

**Best when multiple variables are missing in different patterns.**

- MICE (Multiple Imputation by Chained Equations) fills missing values multiple times and averages results.

In [72]:
import pandas as pd
import numpy as np
from sklearn.experimental import enable_iterative_imputer  # Needed for IterativeImputer
from sklearn.impute import IterativeImputer

# Sample Dataset with Missing Values
np.random.seed(42)
data = {
    'Age': [25, 30, 35, 40, 45, 50, 55, 60],
    'Experience': [1, 3, 7, 10, 15, 20, 25, 30],
    'Salary': [25000, 30000, np.nan, 50000, np.nan, 70000, 80000, np.nan]
}
df = pd.DataFrame(data)
print("Original Data:\n", df)

# Apply MICE (Multiple Imputation)
imputer = IterativeImputer(max_iter=10, random_state=42)
df_imputed = imputer.fit_transform(df)

# Convert back to DataFrame
df_imputed = pd.DataFrame(df_imputed, columns=df.columns)
print("\nAfter Multiple Imputation:\n", df_imputed)



Original Data:
    Age  Experience   Salary
0   25           1  25000.0
1   30           3  30000.0
2   35           7      NaN
3   40          10  50000.0
4   45          15      NaN
5   50          20  70000.0
6   55          25  80000.0
7   60          30      NaN

After Multiple Imputation:
     Age  Experience        Salary
0  25.0         1.0  25000.000000
1  30.0         3.0  30000.000000
2  35.0         7.0  40827.756774
3  40.0        10.0  50000.000000
4  45.0        15.0  59497.728638
5  50.0        20.0  70000.000000
6  55.0        25.0  80000.000000
7  60.0        30.0  90642.401287


✔ Preserves relationships between variables.

✔ Reflects uncertainty (not a single “perfect guess”).

✔ Produces statistically valid estimates.

❌ More computationally expensive than mean/median.

❌ Slightly complex to implement.

# 🔍 Handling Missing Values in Categorical Data
Handling missing values in categorical variables (e.g., "City", "Gender", "Product Category") is different from numerical data. The goal is to retain meaningful patterns without introducing bias.

 ## 🛠 **1. Identify Missing Values in Categorical Data**

Before deciding on a method, check the number and percentage of missing values.

In [73]:
import pandas as pd

# Sample Data
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Gender': ['Female', 'Male', None, 'Male', None],
        'City': ['New York', None, 'Los Angeles', 'Chicago', 'Chicago']}

df = pd.DataFrame(data)

# Count missing values
print(df.isnull().sum())

# Percentage of missing values
print(df.isnull().sum() / len(df) * 100)



Name      0
Gender    2
City      1
dtype: int64
Name       0.0
Gender    40.0
City      20.0
dtype: float64


📌 If the missing percentage is high (>20%), imputation is necessary!

 # 🛠 **2. Methods to Handle Missing Categorical Data**

Different methods apply depending on the missing data percentage and its pattern.

 ## 🔹 **A. Drop Missing Values (Only if Missingness is Low)**

 **Best when:** Missing values are very few (<5% of total data) and random (MCAR).

In [74]:
df_dropped = df.dropna(subset=['Gender', 'City'])


✅ Simple and effective.
    
❌ Risk: Losing valuable data, especially if missing values are not random (MAR).

 ## 🔹 B. **Fill Missing Values with Mode (Most Frequent Value)**

 **Best when:** A category is dominant and missing values are random.

In [75]:
# Fill missing values with Mode (Most Frequent Value)
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
df['City'].fillna(df['City'].mode()[0], inplace=True)


✅ Works well for low missingness and categorical data with one dominant category.
    
❌ Risk: If missing values are not random, mode imputation might introduce bias.

## 🔹 **C. Fill Missing Values with a New Category ("Unknown"/"Other")**

**Best when:** Missing values have meaning (e.g., survey responses not provided).

In [76]:
df['Gender'].fillna('Unknown', inplace=True)
df['City'].fillna('Other', inplace=True)


✅ Retains data integrity without assuming incorrect values.
    
❌ Risk: Can affect analysis if "Unknown" behaves differently from real categories.

 ## 🔹 **D. Fill Missing Values Using Predictive Modeling**

 **Best when:** Data is Missing at Random (MAR) (i.e., missingness is related to another variable).

- Use classification models like K-Nearest Neighbors (KNN), Logistic Regression, or Decision Trees.

1️⃣**Use KNN Imputation for Categorical Data**

In [77]:
from sklearn.impute import KNNImputer
from sklearn.preprocessing import LabelEncoder

# Convert categorical variables to numeric
encoder = LabelEncoder()
df['Gender_encoded'] = encoder.fit_transform(df['Gender'].astype(str))  # Convert to numbers

# Apply KNN Imputer
imputer = KNNImputer(n_neighbors=3)
df['Gender_imputed'] = imputer.fit_transform(df[['Gender_encoded']])

# Convert back to categorical
df['Gender'] = encoder.inverse_transform(df['Gender_imputed'].astype(int))
df.drop(columns=['Gender_encoded', 'Gender_imputed'], inplace=True)


✅ More advanced and effective when missing values depend on other variables.
    
❌ Risk: Computationally expensive; needs sufficient data to work well.

 ## 🔹 **E. Fill Missing Values Using Probabilistic Imputation**

**Best when:** Data follows a specific pattern but is not predictable with features.

- Assign missing values randomly based on existing category distribution.

In [78]:
import numpy as np

# Fill missing 'City' based on probability distribution
df['City'].fillna(np.random.choice(df['City'].dropna().unique()), inplace=True)


✅ Maintains distribution of existing data.

❌ Risk: Introduces randomness that may not reflect true values.

 ### **✨ Key Insights**

- Missing values are a common issue in real-world datasets.

- Choosing the right method depends on the type of data and business problem.

- Proper handling of missing values improves the quality of data and accuracy of machine learning models.

 ### **🙌 Acknowledgment**
Thank you for exploring this notebook!

If you found it useful, please upvote 👍 and leave a comment 💬 — feedback is always welcome!