### **Handling missing values in the dataset**

| **Method**                 | **When to Use** |
|----------------------------|----------------|
| **Drop Rows** | When missing values are few and random. |
| **Drop Columns** | When a column has too many missing values (>50%). |
| **Zero Imputation** | When the missing value is expected to be zero or when the it is not informative. |
| **Mean Imputation** | When data is normally distributed and missing values are random. |
| **Median Imputation** | When the distributin of the data is skewed. |

### **Examples**

In [None]:
# Drop missing values
import pandas as pd

# Creating a simple dataset with NaN values for demonstration
data = {'A': [1, 2, 5, 4], 'B':[None, 2, 3, None]}
df = pd.DataFrame(data)
print(df)

   A    B
0  1  NaN
1  2  2.0
2  5  3.0
3  4  NaN


In [None]:
# Drop rows with NaN values
df_new = df.dropna()
print("\nDataFrame after dropping NaN values:")
print(df_new)


DataFrame after dropping NaN values:
   A    B
1  2  2.0
2  5  3.0


In [None]:
# Drop columns with NaN values
df_new = df.dropna(axis=1)
print("\nDataFrame after dropping columns with NaN values:")
print(df_new)


DataFrame after dropping columns with NaN values:
   A
0  1
1  2
2  5
3  4


In [None]:
# Imputation methods
# 1. Zero Imputation
df1 = df
df_filled1 = df1.fillna(0)
print("\nDataFrame after zero imputation:")
print(df_filled1)


DataFrame after zero imputation:
   A    B
0  1  0.0
1  2  2.0
2  5  3.0
3  4  0.0


In [None]:
# 2. Mean Imputation (method 1)
df2 = df
df_filled2 = df2.fillna(df.mean())
print("\nDataFrame after mean imputation:")
print(df_filled2)


DataFrame after mean imputation:
   A    B
0  1  2.5
1  2  2.0
2  5  3.0
3  4  2.5


In [None]:
# 2. Mean Imputation (method 2: using SimpleImputer)
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='mean')
df3 = df
df_filled3 = pd.DataFrame(imp.fit_transform(df3),columns=df3.columns)
print("\nDataFrame after mean imputation using SimpleImputer:")
print(df_filled3)


DataFrame after mean imputation using SimpleImputer:
     A    B
0  1.0  2.5
1  2.0  2.0
2  5.0  3.0
3  4.0  2.5


In [None]:
# 3. Median Imputation(method 1)
df4 = df
df_filled4 = df4.fillna(df.median())
print("\nDataFrame after median imputation:")
print(df_filled4)


DataFrame after median imputation:
   A    B
0  1  2.5
1  2  2.0
2  5  3.0
3  4  2.5


In [None]:
# 3. Median Imputation (method 2: using SimpleImputer)
imp = SimpleImputer(strategy='median')
df5 = df
df_filled4 = pd.DataFrame(imp.fit_transform(df5),columns=df5.columns)
print("\nDataFrame after median imputation using SimpleImputer:")
print(df_filled4)


DataFrame after median imputation using SimpleImputer:
     A    B
0  1.0  2.5
1  2.0  2.0
2  5.0  3.0
3  4.0  2.5


In [None]:
# # Example code snippet for histogram plot comparison
# import matplotlib.pyplot as plt
# import seaborn as sns
# plt.figure(figsize=(10, 5))  # Set the figure size
# plt.subplot(1, 2, 1)         # Create a subplot
# sns.histplot(df)             # Plot the histogram, replace 'df' with the actual data
# plt.title("Label of first subplot goes here")
# plt.subplot(1, 2, 2)         # Creating second plot
# sns.histplot(df_filled4)     # Plot the histogram, replace 'df_filled4' with the actual data
# plt.title("Label of the second subplot goes here") # Set the title of the second subplot
# plt.show()