                                    ***** Assignment 5 *****

Question 1


1. What is the difference between MCAR (Missing Completely at Random), MAR (Missing at Random), and MNAR (Missing Not at Random) in the context of missing data analysis?

Missing Completely at Random (MCAR):

Definition: Data is considered MCAR if the probability of a data point being missing is completely independent of any observed or unobserved data. In other words, the missingness is entirely random and does not depend on the values of any variables in the dataset.

Implication: If data is MCAR, the missing data can be considered a random subset of the data, and the analysis of the remaining data will not be biased due to the missing values.

Example: A survey respondent accidentally skips a question due to a printing error on the survey form.

Missing at Random (MAR):

Definition: Data is considered MAR if the probability of a data point being missing is related to the observed data but not the missing data itself. In other words, the missingness can be explained by other observed variables in the dataset.

Implication: If data is MAR, the missing data can be predicted using the observed data, and appropriate imputation methods can be used to handle the missing values without introducing significant bias.

Example: In a medical study, older patients are less likely to report their income. The missingness of income data is related to the age of the patients, which is observed.

Missing Not at Random (MNAR):

Definition: Data is considered MNAR if the probability of a data point being missing is related to the value of the missing data itself. In other words, the missingness is directly related to the unobserved data.

Implication: If data is MNAR, the missing data cannot be predicted using the observed data alone, and the analysis may be biased if the missing data is not properly accounted for. Special techniques or assumptions are often required to handle MNAR data.

Example: In a survey about alcohol consumption, individuals who drink heavily may be less likely to report their drinking habits. The missingness of the data is related to the actual value of alcohol consumption.

Question 2


2. Why is imputation necessary in data analysis, and what happens if imputation is used in machine learning models?

Imputation is the process of replacing missing data with substituted values. It is necessary in data analysis for several reasons:

Why Imputation is Necessary:

* Completeness: 
Many statistical and machine learning algorithms require a complete dataset to function correctly. Missing values can cause these algorithms to fail or produce biased results.

* Bias Reduction:
Proper imputation methods can reduce the bias introduced by missing data. Without imputation, the analysis might be based on a non-representative subset of the data, leading to incorrect conclusions.

* Preserving Data: 
Imputation allows for the retention of all available data. Simply removing rows or columns with missing values can lead to a significant loss of information, especially if the missing data is substantial.

* Improving Model Performance: 
Imputation can improve the performance of machine learning models by providing a more complete and representative dataset for training.

What Happens if Imputation is Used in Machine Learning Models:

* Improved Accuracy:
Imputation can lead to improved model accuracy by providing a more complete dataset. This allows the model to learn from all available information, rather than just a subset.

* Reduced Bias:
Proper imputation methods can reduce the bias introduced by missing data, leading to more reliable and generalizable models.

* Handling Missing Data During Prediction:
Imputation methods can be applied consistently during both training and prediction phases, ensuring that the model can handle missing data in new, unseen data.

* Potential Risks:

    - Incorrect Imputation: If the imputation method is not appropriate for the data, it can introduce bias and lead to incorrect conclusions. For example, using mean imputation for data that is not normally distributed can distort the data.

    - Overfitting: Imputation can sometimes lead to overfitting, especially if the imputed values are not representative of the true underlying data distribution.

    - Loss of Variability: Simple imputation methods like mean or median imputation can reduce the variability in the data, potentially leading to less robust models.

Question 3


3. How can KNN imputation be implemented in Python for handling missing data, and comparison with other imputation methods? Write a Python code to demonstrate KNN imputation using a suitable example.


K-Nearest Neighbors (KNN) imputation is a method that uses the values of the nearest neighbors to impute missing data. It is more sophisticated than simple imputation methods like mean or median imputation and can capture the structure of the data better.

* Implementation of KNN Imputation in Python:
    
    > To implement KNN imputation in Python, we can use the KNNImputer class from the sklearn.impute module.

In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.impute import SimpleImputer

# Create a sample dataset with missing values
data = {
    "Name": ["Alice", "Bob", "Charlie", "David", "Eve"],
    "Age": [25, 30, 35, np.nan, 28],
    "Score": [85, 90, np.nan, 88, 95]
}

df = pd.DataFrame(data)

# Display the original dataset
print("Original Dataset with missing values:")
print(df)

# KNN Imputation
knn_imputer = KNNImputer(n_neighbors=2)
df_knn_imputed = df.copy()
df_knn_imputed[['Age', 'Score']] = knn_imputer.fit_transform(df[['Age', 'Score']])

# Display the dataset after KNN imputation
print("\nDataset with missing values imputed using KNN:")
print(df_knn_imputed)

# Mean Imputation for comparison
mean_imputer = SimpleImputer(strategy='mean')
df_mean_imputed = df.copy()
df_mean_imputed[['Age', 'Score']] = mean_imputer.fit_transform(df[['Age', 'Score']])

# Display the dataset after mean imputation
print("\nDataset with missing values imputed using mean:")
print(df_mean_imputed)

# Median Imputation for comparison
median_imputer = SimpleImputer(strategy='median')
df_median_imputed = df.copy()
df_median_imputed[['Age', 'Score']] = median_imputer.fit_transform(df[['Age', 'Score']])

# Display the dataset after median imputation
print("\nDataset with missing values imputed using median:")
print(df_median_imputed)

Original Dataset with missing values:
      Name   Age  Score
0    Alice  25.0   85.0
1      Bob  30.0   90.0
2  Charlie  35.0    NaN
3    David   NaN   88.0
4      Eve  28.0   95.0

Dataset with missing values imputed using KNN:
      Name   Age  Score
0    Alice  25.0   85.0
1      Bob  30.0   90.0
2  Charlie  35.0   92.5
3    David  27.5   88.0
4      Eve  28.0   95.0

Dataset with missing values imputed using mean:
      Name   Age  Score
0    Alice  25.0   85.0
1      Bob  30.0   90.0
2  Charlie  35.0   89.5
3    David  29.5   88.0
4      Eve  28.0   95.0

Dataset with missing values imputed using median:
      Name   Age  Score
0    Alice  25.0   85.0
1      Bob  30.0   90.0
2  Charlie  35.0   89.0
3    David  29.0   88.0
4      Eve  28.0   95.0


* Comparison:

    * KNN Imputation: Uses the nearest neighbors to impute missing values, which can capture the structure of the data better.

    * Mean Imputation: Replaces missing values with the mean of the column, which can be simple but may introduce bias if the data is not normally distributed.

    * Median Imputation: Replaces missing values with the median of the column, which can be more robust to outliers compared to mean imputation.

Question 4


Write a Python code example to demonstrate mode imputation using the data provided below.

In [2]:
import pandas as pd
import numpy as np

# Creating the DataFrame with missing values
data = {
    'Category': ['A', 'B', None, 'A', 'B', 'C', None, 'C', 'A', 'B'],
    'Subcategory': ['X', None, 'Y', 'Z', 'X', None, 'Y', 'X', None, 'Z'],
    'Value': [10, np.nan, 30, 40, 50, np.nan, 70, 80, 90, np.nan],
    'Score': [np.nan, 88, 77, np.nan, 95, 85, np.nan, 92, 100, 78]
}

df = pd.DataFrame(data)

# # Impute missing values using mode for categorical columns
# for col in ['Category', 'Subcategory']:
#     df[col].fillna(df[col].mode()[0], inplace=True)

# Impute missing values using mode for numerical columns
for col in ['Value', 'Score']:
    df[col].fillna(df[col].mode()[0], inplace=True)

# Display the DataFrame after imputation
print(df)


  Category Subcategory  Value  Score
0        A           X   10.0   77.0
1        B        None   10.0   88.0
2     None           Y   30.0   77.0
3        A           Z   40.0   77.0
4        B           X   50.0   95.0
5        C        None   10.0   85.0
6     None           Y   70.0   77.0
7        C           X   80.0   92.0
8        A        None   90.0  100.0
9        B           Z   10.0   78.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mode()[0], inplace=True)
