<a href="https://colab.research.google.com/github/deekshith2302/Data-analytics-lab/blob/main/Untitled2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:

import numpy as np
import pandas as pd

# Create a sample dataset
data = {
    'Age': [25, 27, 29, np.nan, 32, 35, np.nan, 40, 42, 45],
    'Salary': [50000, 54000, 58000, 60000, np.nan, 65000, 70000, 72000, 75000, np.nan],
    'Experience': [1, 2, 3, 4, 5, np.nan, 7, 8, 9, 10]
}

df = pd.DataFrame(data)
print("Original DataFrame with Missing Values:")
print(df)

Original DataFrame with Missing Values:
    Age   Salary  Experience
0  25.0  50000.0         1.0
1  27.0  54000.0         2.0
2  29.0  58000.0         3.0
3   NaN  60000.0         4.0
4  32.0      NaN         5.0
5  35.0  65000.0         NaN
6   NaN  70000.0         7.0
7  40.0  72000.0         8.0
8  42.0  75000.0         9.0
9  45.0      NaN        10.0


1. Simple Imputation (Univariate)

**Concept: **This is the most basic approach. You replace missing values with a summary statistic of that column (mean, median, mode) or a constant value. It treats every feature independently.

Mean: Good for normally distributed numerical data.

Median: Better if the data has outliers.

Most Frequent (Mode): Used for categorical data.

In [2]:
from sklearn.impute import SimpleImputer

# Create a copy of the dataframe to keep the original safe
df_simple = df.copy()

# Initialize the imputer (strategy can be 'mean', 'median', 'most_frequent', 'constant')
imputer = SimpleImputer(strategy='mean')

# Fit and transform the data
# Note: output is a numpy array, so we convert it back to DataFrame
df_simple_imputed = pd.DataFrame(imputer.fit_transform(df_simple), columns=df_simple.columns)

print("\n--- Simple Imputation (Mean) ---")
print(df_simple_imputed)


--- Simple Imputation (Mean) ---
      Age   Salary  Experience
0  25.000  50000.0    1.000000
1  27.000  54000.0    2.000000
2  29.000  58000.0    3.000000
3  34.375  60000.0    4.000000
4  32.000  63000.0    5.000000
5  35.000  65000.0    5.444444
6  34.375  70000.0    7.000000
7  40.000  72000.0    8.000000
8  42.000  75000.0    9.000000
9  45.000  63000.0   10.000000


2. K-Nearest Neighbors (KNN) ImputationConcept:

This method finds the k rows (neighbors) that are most similar to the row with the missing value. It then averages the values of those neighbors to fill the gap.

This is often more accurate than simple imputation because it accounts for the correlation between rows.

Key Parameter: n_neighbors (the number of neighbors to use).

In [5]:
from sklearn.impute import KNNImputer

df_knn = df.copy()

# Initialize KNN Imputer
# n_neighbors=3 means it looks at the 3 most similar rows
knn_imputer = KNNImputer(n_neighbors=3)

df_knn_imputed = pd.DataFrame(knn_imputer.fit_transform(df_knn), columns=df_knn.columns)

print("\n--- KNN Imputation ---")
print(df_knn_imputed)


--- KNN Imputation ---
         Age        Salary  Experience
0  25.000000  50000.000000    1.000000
1  27.000000  54000.000000    2.000000
2  29.000000  58000.000000    3.000000
3  35.333333  60000.000000    4.000000
4  32.000000  62666.666667    5.000000
5  35.000000  65000.000000    7.666667
6  39.000000  70000.000000    7.000000
7  40.000000  72000.000000    8.000000
8  42.000000  75000.000000    9.000000
9  45.000000  72333.333333   10.000000


3. Multivariate Imputation by Chained Equations (MICE)

Concept: Also known as Iterative Imputation. This is a sophisticated method that models each feature with missing values as a function of other features.

It fills missing values with a placeholder (e.g., mean).

It treats the column with missing values as the "target" and runs a regression model (like BayesianRidge) using other columns as features to predict the true value.

It repeats this process multiple times until the values converge.

In [4]:
from sklearn.experimental import enable_iterative_imputer  # Explicitly enable
from sklearn.impute import IterativeImputer

df_mice = df.copy()

# Initialize MICE Imputer
# random_state ensures reproducibility
mice_imputer = IterativeImputer(max_iter=10, random_state=0)

df_mice_imputed = pd.DataFrame(mice_imputer.fit_transform(df_mice), columns=df_mice.columns)

print("\n--- MICE (Iterative) Imputation ---")
print(df_mice_imputed)


--- MICE (Iterative) Imputation ---
         Age        Salary  Experience
0  25.000000  50000.000000     1.00000
1  27.000000  54000.000000     2.00000
2  29.000000  58000.000000     3.00000
3  31.300571  60000.000000     4.00000
4  32.000000  61563.560421     5.00000
5  35.000000  65000.000000     5.63457
6  38.400176  70000.000000     7.00000
7  40.000000  72000.000000     8.00000
8  42.000000  75000.000000     9.00000
9  45.000000  79054.549645    10.00000


. Time-Series Imputation (Forward/Backward Fill)

Concept: If your data is time-series data (ordered by time), using the mean is dangerous because it ignores trends. instead, we propagate the last known value forward or the next valid value backward.

FFill (Forward Fill): Takes the previous valid value and fills it forward.

BFill (Backward Fill): Takes the next valid value and fills it backward.



In [3]:
df_time = df.copy()

# Forward Fill
df_ffill = df_time.ffill()

# Backward Fill
df_bfill = df_time.bfill()

# Linear Interpolation (Connecting the dots)
# This assumes a linear relationship between time steps
df_interp = df_time.interpolate(method='linear')

print("\n--- Time Series: Linear Interpolation ---")
print(df_interp)


--- Time Series: Linear Interpolation ---
    Age   Salary  Experience
0  25.0  50000.0         1.0
1  27.0  54000.0         2.0
2  29.0  58000.0         3.0
3  30.5  60000.0         4.0
4  32.0  62500.0         5.0
5  35.0  65000.0         6.0
6  37.5  70000.0         7.0
7  40.0  72000.0         8.0
8  42.0  75000.0         9.0
9  45.0  75000.0        10.0
