# Course Title: Data Analytics & Statistics in Python
## Metropolia University of Applied Sciences
## Week 3: Handling Missing Data
### Date: 19.3.2025

<table "<table style="width: 100%;">
  <tr>
    <td style="text-align: left; vertical-align: middle;">
      <ul style="list-style: none; padding-left: 0;">
        <li><strong>Instructor</strong>: Hamed Ahmadinia, Ph.D</li>
        <li><strong>Email</strong>: hamed.ahmadinia@metropolia.fi
        <li><strong>Web</strong>: www.ahmadinia.fi</li>          </li>
      </ul>
    </td>
  </tr>
</table>

## Handling Missing Data in Python
**This notebook follows our lesson on missing data and covers different techniques to handle missing values in datasets.**

We'll explore:
- **Deleting missing values** (Listwise & Pairwise Deletion)
- **Basic Imputation** (Mean, Median, Mode)
- **Advanced Imputation** (KNN, MICE)
- **Time-Series Handling** (Forward Fill, Backward Fill, Interpolation)
- **Machine Learning for Missing Data** (Random Forest, XGBoost)
- **Deep Learning Methods** (Autoencoders)

**Let's get started! 🚀**

In [1]:
# Import necessary libraries
import seaborn as sns  # Seaborn for data visualization
import pandas as pd  # Pandas for data manipulation and analysis
import numpy as np  # NumPy for numerical computations

# Imputation techniques from Scikit-learn
from sklearn.impute import KNNImputer  # KNN Imputer for filling missing values using K-Nearest Neighbors
from sklearn.ensemble import RandomForestRegressor  # Random Forest for machine learning-based imputation
from sklearn.model_selection import train_test_split  # Splitting data into training and testing sets

# TensorFlow and Keras for building deep learning models
from tensorflow import keras  # High-level API for building and training deep learning models
from tensorflow.keras import layers  # Layers module for defining neural network architecture

# Label Encoding from Scikit-learn
from sklearn.preprocessing import LabelEncoder  # Convert categorical variables into numerical labels for ML models

ModuleNotFoundError: No module named 'sklearn'

## Step 1: Load an Dataset
To demonstrate missing data handling techniques, we'll use the **Titanic dataset** from Seaborn.

This dataset contains information about passengers on the Titanic, including:
- Age (some values missing)
- Cabin (many values missing)
- Embarked (some values missing)

Let's load the dataset and check for missing values.

In [None]:
# Load Titanic dataset
df = sns.load_dataset('titanic')

# Show the first few rows
df.head()

## Step 2: Detect Missing Values in Titanic Dataset
Let's check how much data is missing in each column.

In [None]:
# Check missing values
print(df.isnull().sum())

## Step 3: Handle Missing Data
We'll now apply different techniques to handle missing values in the Titanic dataset.

In [None]:
# Ensure column names are lowercase
df.columns = df.columns.str.lower()

# Mean Imputation for 'age'
df['age'] = df['age'].fillna(df['age'].mean())

# Mode Imputation for 'embarked'
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])

# Drop 'cabin' column if it exists
if 'cabin' in df.columns:
    df = df.drop(columns=['cabin'])

# Check missing values again
print(df.isnull().sum())

## Step 4: Advanced Imputation (KNN)
We can use **KNN Imputation** to predict missing values based on other features.

In [None]:
# Load Titanic dataset
df = sns.load_dataset('titanic')

# Selecting only numerical features for KNN Imputation
numerical_features = ['age', 'fare', 'pclass', 'sibsp', 'parch']  # Include relevant numerical columns

# Initialize KNN Imputer (using k=5 neighbors)
knn_imputer = KNNImputer(n_neighbors=5)

# Apply KNN imputation to numerical columns
df[numerical_features] = knn_imputer.fit_transform(df[numerical_features])

# Check missing values
print(df.isnull().sum())

## Step 5: Advanced Imputation (KNN, MICE)
Using **K-Nearest Neighbors (KNN)** to fill missing values based on similar passengers.

In [None]:
# Load Titanic dataset
df = sns.load_dataset('titanic')

# Select only numeric columns for KNN
knn_imputer = KNNImputer(n_neighbors=5)
df[['age', 'fare']] = knn_imputer.fit_transform(df[['age', 'fare']])

# Check missing values
df.isnull().sum()

## Step 6: Time-Series Methods (Forward Fill, Backward Fill, Interpolation)
These methods are useful when working with time-dependent data.

In [None]:
# Load Titanic dataset
df = sns.load_dataset('titanic')

# Forward Fill and Backward Fill
df['age'] = df['age'].ffill()
df['fare'] = df['fare'].bfill()

# Interpolation
df['age'] = df['age'].interpolate()

# Check missing values
df.isnull().sum()

## Step 7: Machine Learning for Imputation (Predicting Missing Values with Random Forest)
Using Random Forest to predict and fill missing values in the 'age' column.

In [None]:
# Load Titanic dataset
df = sns.load_dataset('titanic')

# Drop rows where 'age' is missing for training
df_ml = df.dropna(subset=['age'])
X = df_ml[['fare', 'parch', 'sibsp']]
y = df_ml['age']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
rf = RandomForestRegressor()
rf.fit(X_train, y_train)

# Predict missing values only if they exist
if df['age'].isnull().sum() > 0:
    df.loc[df['age'].isnull(), 'age'] = rf.predict(df[df['age'].isnull()][['fare', 'parch', 'sibsp']])
else:
    print('No missing values left in age.')

## Step 8: Deep Learning for Imputation (Autoencoders to Fill Missing Data)
Using a neural network (Autoencoder) to learn patterns and fill missing values.

In [None]:
# Load Titanic dataset
df = sns.load_dataset('titanic')

# Convert categorical variables to numbers
categorical_cols = ['sex', 'embarked', 'who', 'class', 'alive', 'deck', 'embark_town']
df_encoded = df.copy()

for col in categorical_cols:
    df_encoded[col] = LabelEncoder().fit_transform(df_encoded[col].astype(str))

# Fill missing values with column mean and convert all values to float
df_filled = df_encoded.fillna(df_encoded.mean()).astype(np.float32).values  # Ensures TensorFlow compatibility

# Define Autoencoder model with Input layer
input_dim = df_filled.shape[1]
autoencoder = keras.Sequential([
    keras.Input(shape=(input_dim,)),  # Corrected Input Layer
    layers.Dense(64, activation='relu'),
    layers.Dense(32, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(input_dim, activation='linear')
])

# Compile and Train
autoencoder.compile(optimizer='adam', loss='mse')
autoencoder.fit(df_filled, df_filled, epochs=10, batch_size=16, verbose=1)

# Predict and fill missing values
df_filled_autoencoder = autoencoder.predict(df_filled)
df_filled_autoencoder = pd.DataFrame(df_filled_autoencoder, columns=df_encoded.columns)

# Check missing values
print(df_filled_autoencoder.isnull().sum())


## Step 9: Comparing Methods (Checking which Imputation Works Best)
Compare different imputation methods to see which one works best.

In [None]:
# Load Titanic dataset
df = sns.load_dataset('titanic')

# Select numerical columns
numeric_cols = df.select_dtypes(include=['number']).columns

# Mean Imputation
df_mean = df.copy()
df_mean[numeric_cols] = df_mean[numeric_cols].fillna(df_mean[numeric_cols].mean())

# KNN Imputation
knn_imputer = KNNImputer(n_neighbors=5)
df_knn = df.copy()
df_knn[numeric_cols] = knn_imputer.fit_transform(df_knn[numeric_cols])

# Debugging: Show missing values after KNN imputation
print("Missing values after KNN Imputation:")
print(df_knn.isnull().sum())

# ML-Based Imputation (Random Forest)
df_ml_imputed = df.copy()

# Train Random Forest only if 'age' has missing values
if df['age'].isnull().sum() > 0:
    # Prepare dataset for ML-based imputation
    df_train = df_ml_imputed.dropna(subset=['age'])  # Remove missing target values
    df_test = df_ml_imputed[df_ml_imputed['age'].isnull()]  # Rows where 'age' is missing
    
    X_train = df_train[['fare', 'parch', 'sibsp']]
    y_train = df_train['age']
    
    X_test = df_test[['fare', 'parch', 'sibsp']]
    
    # Train a Random Forest model
    rf = RandomForestRegressor(n_estimators=100, random_state=42)
    rf.fit(X_train, y_train)
    
    # Predict missing values in 'age'
    df_ml_imputed.loc[df_ml_imputed['age'].isnull(), 'age'] = rf.predict(X_test)
    
    print("ML-Based Imputation Done!")
else:
    print("No missing values in 'age'.")

# Display results
print("\nFirst 5 rows after ML Imputation:")
print(df_ml_imputed[['age', 'fare', 'parch', 'sibsp']].head())

## Step 10: Summary & Best Practices
✅ **For small missing data (<5%)**, simple imputation is fine.

✅ **For 5-30% missing values**, try KNN or MICE.

✅ **For large missing data (>30%)**, use ML or deep learning.

🚀 **Always test different methods and compare the results!**