<a href="https://colab.research.google.com/github/Saisathvik-Achanta/Jyesta-Data-Science-Internship/blob/main/handling_missing_inputs_using_various_imputations_techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Handle missing values in a randomly generated dataset using mean, median, mode, k-nearest neighbors (knn), and iterative imputation techniques.

## Generate a random dataset

### Subtask:
Create a dataset with some columns containing missing values.


**Reasoning**:
Generate a NumPy array with random integers, introduce missing values, convert it to a Pandas DataFrame, assign column names, and store it in `df_original`.



In [10]:
import numpy as np
import pandas as pd

# Generate a NumPy array of random integers and change its data type to float
data = np.random.randint(1, 21, size=(100, 5)).astype(float)

# Randomly introduce missing values (approximately 10%)
missing_percentage = 0.1
num_missing = int(np.prod(data.shape) * missing_percentage)
missing_indices = np.random.choice(data.size, num_missing, replace=False)
data.ravel()[missing_indices] = np.nan

# Convert to a Pandas DataFrame
df_original = pd.DataFrame(data)

# Assign meaningful column names
df_original.columns = [f'Feature_{i+1}' for i in range(df_original.shape[1])]

# Display the first few rows to verify
display(df_original.head())

Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5
0,11.0,14.0,7.0,15.0,9.0
1,17.0,5.0,19.0,19.0,9.0
2,17.0,12.0,9.0,4.0,8.0
3,13.0,8.0,11.0,15.0,12.0
4,5.0,5.0,14.0,18.0,9.0


**Reasoning**:
The previous attempt failed because `np.nan` is a float and cannot be directly assigned to an integer array. To fix this, the data type of the array needs to be changed to float before introducing missing values.



In [2]:
# Generate a NumPy array of random integers and change its data type to float
data = np.random.randint(1, 21, size=(100, 5)).astype(float)

# Randomly introduce missing values (approximately 10%)
missing_percentage = 0.1
num_missing = int(np.prod(data.shape) * missing_percentage)
missing_indices = np.random.choice(data.size, num_missing, replace=False)
data.ravel()[missing_indices] = np.nan

# Convert to a Pandas DataFrame
df_original = pd.DataFrame(data)

# Assign meaningful column names
df_original.columns = [f'Feature_{i+1}' for i in range(df_original.shape[1])]

# Display the first few rows to verify
display(df_original.head())

Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5
0,17.0,8.0,19.0,12.0,9.0
1,19.0,11.0,5.0,11.0,4.0
2,18.0,15.0,14.0,16.0,5.0
3,15.0,12.0,11.0,18.0,7.0
4,9.0,8.0,3.0,16.0,14.0


## Handle missing values using mean imputation

### Subtask:
Impute the missing values in one column using the mean of the non-missing values in that column.


**Reasoning**:
Create a copy of the original DataFrame, select a column with missing values, calculate its mean, and impute the missing values in that column with the calculated mean.



In [3]:
# Create a copy of the original DataFrame
df_mean_imputed = df_original.copy()

# Select a column with missing values (assuming 'Feature_2' has missing values based on the head display)
column_to_impute = 'Feature_2'

# Calculate the mean of the non-missing values in the selected column
mean_value = df_mean_imputed[column_to_impute].mean()

# Fill the missing values (NaN) in the selected column with the calculated mean
df_mean_imputed[column_to_impute] = df_mean_imputed[column_to_impute].fillna(mean_value)

# Display the first few rows to verify the imputation
display(df_mean_imputed.head())

# Display the number of missing values in the imputed column to confirm
print(f"Number of missing values in '{column_to_impute}' after mean imputation: {df_mean_imputed[column_to_impute].isnull().sum()}")

Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5
0,17.0,8.0,19.0,12.0,9.0
1,19.0,11.0,5.0,11.0,4.0
2,18.0,15.0,14.0,16.0,5.0
3,15.0,12.0,11.0,18.0,7.0
4,9.0,8.0,3.0,16.0,14.0


Number of missing values in 'Feature_2' after mean imputation: 0


## Handle missing values using median imputation

### Subtask:
Impute the missing values in another column using the median of the non-missing values in that column.


**Reasoning**:
Create a copy of the original DataFrame, select a column with missing values, calculate the median of that column, fill the missing values with the median, display the head of the updated DataFrame, and print the number of missing values in the imputed column.



In [4]:
# Create a copy of the original DataFrame
df_median_imputed = df_original.copy()

# Select a column with missing values (choosing 'Feature_3' as per instructions/suggestion)
column_to_impute_median = 'Feature_3'

# Calculate the median of the non-missing values in the selected column
median_value = df_median_imputed[column_to_impute_median].median()

# Fill the missing values (NaN) in the selected column with the calculated median
df_median_imputed[column_to_impute_median] = df_median_imputed[column_to_impute_median].fillna(median_value)

# Display the first few rows to verify the imputation
display(df_median_imputed.head())

# Print the number of missing values in the imputed column to confirm
print(f"Number of missing values in '{column_to_impute_median}' after median imputation: {df_median_imputed[column_to_impute_median].isnull().sum()}")

Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5
0,17.0,8.0,19.0,12.0,9.0
1,19.0,11.0,5.0,11.0,4.0
2,18.0,15.0,14.0,16.0,5.0
3,15.0,12.0,11.0,18.0,7.0
4,9.0,8.0,3.0,16.0,14.0


Number of missing values in 'Feature_3' after median imputation: 0


## Handle missing values using mode imputation

### Subtask:
Impute the missing values in another column using the mode of the non-missing values in that column.


**Reasoning**:
The task is to impute missing values in a column using the mode. This involves creating a copy, selecting a column with NaNs, calculating the mode, filling NaNs with the mode, and verifying the result. These steps can be done in a single code block.



In [5]:
# 1. Create a copy of the original DataFrame
df_mode_imputed = df_original.copy()

# 2. Select a column from df_mode_imputed that still contains missing values for mode imputation.
# We can check the original DataFrame for columns with missing values not yet imputed.
# Let's pick 'Feature_5' as it appears to have missing values in the head display and was not imputed with mean or median.
column_to_impute_mode = 'Feature_5'

# 3. Calculate the mode of the non-missing values in the selected column.
# The mode() method returns a Series, even if there's only one mode.
# We take the first value [0] in case there are multiple modes.
mode_value = df_mode_imputed[column_to_impute_mode].mode()[0]

# 4. Fill the missing values (NaN) in the selected column of df_mode_imputed with the calculated mode.
df_mode_imputed[column_to_impute_mode] = df_mode_imputed[column_to_impute_mode].fillna(mode_value)

# 5. Display the first few rows of df_mode_imputed to visually verify the imputation.
display(df_mode_imputed.head())

# 6. Print the number of missing values in the imputed column of df_mode_imputed to confirm that the imputation was successful.
print(f"Number of missing values in '{column_to_impute_mode}' after mode imputation: {df_mode_imputed[column_to_impute_mode].isnull().sum()}")

Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5
0,17.0,8.0,19.0,12.0,9.0
1,19.0,11.0,5.0,11.0,4.0
2,18.0,15.0,14.0,16.0,5.0
3,15.0,12.0,11.0,18.0,7.0
4,9.0,8.0,3.0,16.0,14.0


Number of missing values in 'Feature_5' after mode imputation: 0


## Handle missing values using k-nearest neighbors (knn) imputation

### Subtask:
Impute the missing values in another column using the KNN imputer.


**Reasoning**:
Impute the missing values in a column using the KNN imputer as per the instructions.



In [6]:
# 1. Create a copy of the original DataFrame and name it df_knn_imputed.
df_knn_imputed = df_original.copy()

# 2. Select a column from df_knn_imputed that still contains missing values for KNN imputation.
# Based on the head display of df_original, 'Feature_1' or 'Feature_4' still have missing values.
# Let's pick 'Feature_1'.
column_to_impute_knn = 'Feature_1'

# 3. Import the KNNImputer from sklearn.impute.
from sklearn.impute import KNNImputer

# 4. Instantiate a KNNImputer object.
knn_imputer = KNNImputer(n_neighbors=5)

# 5. Reshape the selected column into a 2D array.
data_to_impute = df_knn_imputed[[column_to_impute_knn]]

# 6. Fit the imputer to the selected column's data and transform it.
imputed_values = knn_imputer.fit_transform(data_to_impute)

# 7. Replace the original selected column in df_knn_imputed with the imputed values.
df_knn_imputed[column_to_impute_knn] = imputed_values

# 8. Display the first few rows of df_knn_imputed to visually verify the imputation.
display(df_knn_imputed.head())

# 9. Print the number of missing values in the imputed column of df_knn_imputed to confirm.
print(f"Number of missing values in '{column_to_impute_knn}' after KNN imputation: {df_knn_imputed[column_to_impute_knn].isnull().sum()}")

Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5
0,17.0,8.0,19.0,12.0,9.0
1,19.0,11.0,5.0,11.0,4.0
2,18.0,15.0,14.0,16.0,5.0
3,15.0,12.0,11.0,18.0,7.0
4,9.0,8.0,3.0,16.0,14.0


Number of missing values in 'Feature_1' after KNN imputation: 0


## Handle missing values using iterative imputation

### Subtask:
Impute the missing values in another column using the Iterative imputer.


**Reasoning**:
Create a copy of the original DataFrame, select the column for imputation, import the IterativeImputer, instantiate it, reshape the data, fit and transform, replace the column, display the head, and print the missing value count.



In [11]:
# Import the necessary module to enable IterativeImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# 1. Create a copy of the original DataFrame and name it df_iterative_imputed.
df_iterative_imputed = df_original.copy()

# 2. Select a column from df_iterative_imputed that still contains missing values for iterative imputation.
# Based on the previous steps and the original dataframe, 'Feature_4' should be the remaining column with missing values.
column_to_impute_iterative = 'Feature_4'

# 4. Instantiate an IterativeImputer object.
iterative_imputer = IterativeImputer()

# 5. Reshape the selected column into a 2D array.
data_to_impute_iterative = df_iterative_imputed[[column_to_impute_iterative]]

# 6. Fit the imputer to the selected column's data and transform it to get the imputed values.
imputed_values_iterative = iterative_imputer.fit_transform(data_to_impute_iterative)

# 7. Replace the original selected column in df_iterative_imputed with the imputed values.
df_iterative_imputed[column_to_impute_iterative] = imputed_values_iterative

# 8. Display the first few rows of df_iterative_imputed to visually verify the imputation.
display(df_iterative_imputed.head())

# 9. Print the number of missing values in the imputed column of df_iterative_imputed to confirm that the imputation was successful.
print(f"Number of missing values in '{column_to_impute_iterative}' after iterative imputation: {df_iterative_imputed[column_to_impute_iterative].isnull().sum()}")

Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5
0,11.0,14.0,7.0,15.0,9.0
1,17.0,5.0,19.0,19.0,9.0
2,17.0,12.0,9.0,4.0,8.0
3,13.0,8.0,11.0,15.0,12.0
4,5.0,5.0,14.0,18.0,9.0


Number of missing values in 'Feature_4' after iterative imputation: 0


**Reasoning**:
The previous code block failed because `IterativeImputer` is an experimental feature and requires an explicit import to be enabled. The traceback provides the necessary import statement.



In [8]:
# Import the necessary module to enable IterativeImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# 1. Create a copy of the original DataFrame and name it df_iterative_imputed.
df_iterative_imputed = df_original.copy()

# 2. Select a column from df_iterative_imputed that still contains missing values for iterative imputation.
# Based on the previous steps and the original dataframe, 'Feature_4' should be the remaining column with missing values.
column_to_impute_iterative = 'Feature_4'

# 4. Instantiate an IterativeImputer object.
iterative_imputer = IterativeImputer()

# 5. Reshape the selected column into a 2D array.
data_to_impute_iterative = df_iterative_imputed[[column_to_impute_iterative]]

# 6. Fit the imputer to the selected column's data and transform it to get the imputed values.
imputed_values_iterative = iterative_imputer.fit_transform(data_to_impute_iterative)

# 7. Replace the original selected column in df_iterative_imputed with the imputed values.
df_iterative_imputed[column_to_impute_iterative] = imputed_values_iterative

# 8. Display the first few rows of df_iterative_imputed to visually verify the imputation.
display(df_iterative_imputed.head())

# 9. Print the number of missing values in the imputed column of df_iterative_imputed to confirm that the imputation was successful.
print(f"Number of missing values in '{column_to_impute_iterative}' after iterative imputation: {df_iterative_imputed[column_to_impute_iterative].isnull().sum()}")

Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5
0,17.0,8.0,19.0,12.0,9.0
1,19.0,11.0,5.0,11.0,4.0
2,18.0,15.0,14.0,16.0,5.0
3,15.0,12.0,11.0,18.0,7.0
4,9.0,8.0,3.0,16.0,14.0


Number of missing values in 'Feature_4' after iterative imputation: 0


## Display the results

### Subtask:
Show the original dataset and the datasets after applying each imputation technique to compare the results.


**Reasoning**:
Display the head of the original and imputed dataframes to compare the results of different imputation techniques.



In [9]:
print("Original DataFrame:")
display(df_original.head())

print("\nDataFrame after Mean Imputation (Feature_2):")
display(df_mean_imputed.head())

print("\nDataFrame after Median Imputation (Feature_3):")
display(df_median_imputed.head())

print("\nDataFrame after Mode Imputation (Feature_5):")
display(df_mode_imputed.head())

print("\nDataFrame after KNN Imputation (Feature_1):")
display(df_knn_imputed.head())

print("\nDataFrame after Iterative Imputation (Feature_4):")
display(df_iterative_imputed.head())

Original DataFrame:


Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5
0,17.0,8.0,19.0,12.0,9.0
1,19.0,11.0,5.0,11.0,4.0
2,18.0,15.0,14.0,16.0,5.0
3,15.0,12.0,11.0,18.0,7.0
4,9.0,8.0,3.0,16.0,14.0



DataFrame after Mean Imputation (Feature_2):


Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5
0,17.0,8.0,19.0,12.0,9.0
1,19.0,11.0,5.0,11.0,4.0
2,18.0,15.0,14.0,16.0,5.0
3,15.0,12.0,11.0,18.0,7.0
4,9.0,8.0,3.0,16.0,14.0



DataFrame after Median Imputation (Feature_3):


Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5
0,17.0,8.0,19.0,12.0,9.0
1,19.0,11.0,5.0,11.0,4.0
2,18.0,15.0,14.0,16.0,5.0
3,15.0,12.0,11.0,18.0,7.0
4,9.0,8.0,3.0,16.0,14.0



DataFrame after Mode Imputation (Feature_5):


Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5
0,17.0,8.0,19.0,12.0,9.0
1,19.0,11.0,5.0,11.0,4.0
2,18.0,15.0,14.0,16.0,5.0
3,15.0,12.0,11.0,18.0,7.0
4,9.0,8.0,3.0,16.0,14.0



DataFrame after KNN Imputation (Feature_1):


Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5
0,17.0,8.0,19.0,12.0,9.0
1,19.0,11.0,5.0,11.0,4.0
2,18.0,15.0,14.0,16.0,5.0
3,15.0,12.0,11.0,18.0,7.0
4,9.0,8.0,3.0,16.0,14.0



DataFrame after Iterative Imputation (Feature_4):


Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5
0,17.0,8.0,19.0,12.0,9.0
1,19.0,11.0,5.0,11.0,4.0
2,18.0,15.0,14.0,16.0,5.0
3,15.0,12.0,11.0,18.0,7.0
4,9.0,8.0,3.0,16.0,14.0


## Summary:

### Data Analysis Key Findings

*   A random dataset was successfully generated with approximately 10% missing values (NaNs) across its columns.
*   Mean imputation was applied to 'Feature\_2', successfully filling all missing values in that column with the mean of the non-missing values.
*   Median imputation was applied to 'Feature\_3', successfully filling all missing values in that column with the median of the non-missing values.
*   Mode imputation was applied to 'Feature\_5', successfully filling all missing values in that column with the mode of the non-missing values.
*   KNN imputation (with `n_neighbors=5`) was applied to 'Feature\_1', successfully filling all missing values in that column.
*   Iterative imputation was applied to 'Feature\_4', successfully filling all missing values in that column.
*   The final step successfully displayed the head of the original DataFrame and each of the DataFrames after applying the different imputation techniques to demonstrate the results.

### Insights or Next Steps

*   Different imputation methods yield different imputed values, which can impact subsequent analysis. The choice of method should depend on the data distribution and the nature of the missingness.
*   For a more complete analysis, it would be beneficial to evaluate the performance of each imputation method, perhaps by comparing the distribution of the imputed data to the original data or by training a model on the different imputed datasets and comparing their performance metrics.
