### 🛠️ Handling Missing Values

Handling missing values in datasets is a crucial step in data preprocessing, especially when dealing with real-world data like the Titanic dataset. Here's a detailed explanation of the various methods to handle missing values along with Python code examples.

1. **🚮 Removing Missing Values**  
   - **Dropping Rows with Missing Values:** If the missing data is small in proportion, you can drop the rows containing missing values.  
   - **Dropping Columns with Missing Values:** If an entire column has a significant number of missing values, you might consider dropping the entire column.

In [1]:
import pandas as pd

# Load the Titanic dataset
titanic_df = pd.read_csv('Titanic-Dataset.csv')

# Drop rows with any missing values
titanic_df_dropped_rows = titanic_df.dropna()

# Drop columns with any missing values
titanic_df_dropped_columns = titanic_df.dropna(axis=1)

In [12]:
titanic_df_dropped_columns = titanic_df.dropna(axis=0) # Drop rows with any missing values
df = titanic_df.isnull()
df.count

<bound method DataFrame.count of      PassengerId  Survived  Pclass   Name    Sex    Age  SibSp  Parch  Ticket  \
0          False     False   False  False  False  False  False  False   False   
1          False     False   False  False  False  False  False  False   False   
2          False     False   False  False  False  False  False  False   False   
3          False     False   False  False  False  False  False  False   False   
4          False     False   False  False  False  False  False  False   False   
..           ...       ...     ...    ...    ...    ...    ...    ...     ...   
886        False     False   False  False  False  False  False  False   False   
887        False     False   False  False  False  False  False  False   False   
888        False     False   False  False  False   True  False  False   False   
889        False     False   False  False  False  False  False  False   False   
890        False     False   False  False  False  False  False  False   Fals

### Univariate Imputation 🔍
2. **🔄 Imputation**  
   - **📊 Mean/Median/Mode Imputation:** Replace missing values with the mean, median, or mode of the respective column.
   - **⏩ Forward/Backward Fill:** Replace missing values with the previous (forward fill) or next (backward fill) value in the column.
   - **🔢 Constant Value Imputation:** Replace missing values with a specific constant value, such as 0 or -1.
   - **🔍 K-Nearest Neighbors (KNN) Imputation:** Uses the K-nearest neighbors to predict and impute the missing values based on the similarity of features.

In [22]:
from sklearn.impute import SimpleImputer
import pandas as pd

titanic_df = pd.read_csv('Titanic-Dataset.csv')

# Mean Imputation
mean_imputer = SimpleImputer(strategy='mean')
titanic_df['Age'] = mean_imputer.fit_transform(titanic_df[['Age']])

# Median Imputation
median_imputer = SimpleImputer(strategy='median')
titanic_df['Age'] = median_imputer.fit_transform(titanic_df[['Age']])

# Mode Imputation
mode_imputer = SimpleImputer(strategy='most_frequent')
titanic_df['Embarked'] = mode_imputer.fit_transform(titanic_df[['Embarked']]).ravel()

# Constant Value Imputation
constant_imputer = SimpleImputer(strategy='constant', fill_value=-1) # <-- Fill value with -1
titanic_df['Cabin'] = constant_imputer.fit_transform(titanic_df[['Cabin']]).ravel()

# Forward Fill
titanic_df['Age'] = titanic_df['Age'].ffill()

# Backward Fill
titanic_df['Age'] = titanic_df['Age'].bfill()

In [23]:
titanic_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.000000,1,0,A/5 21171,7.2500,-1,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.000000,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.000000,0,0,STON/O2. 3101282,7.9250,-1,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.000000,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.000000,0,0,373450,8.0500,-1,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.000000,0,0,211536,13.0000,-1,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.000000,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,29.699118,1,2,W./C. 6607,23.4500,-1,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.000000,0,0,111369,30.0000,C148,C


### Multivariate Imputation 🔍

In [None]:
# KNN Imputation (requires sklearn.impute.KNNImputer)
from sklearn.impute import KNNImputer

knn_imputer = KNNImputer(n_neighbors=5)
titanic_df[['Age', 'Fare']] = knn_imputer.fit_transform(titanic_df[['Age', 'Fare']])

- Handle missing data by imputing (filling in) the missing values based on the K-nearest neighbors of each data point. It is particularly useful when the missing values are not missing completely at random and you want to maintain the relationship between the data points.

### How KNN 🔍 Imputer Works ?
- **📏 Calculate Distances:** For each data point with missing values, the algorithm finds the k nearest neighbors using a distance metric (like Euclidean distance) that only considers the non-missing values.
- **🔄 Impute Missing Values:** The missing values are then imputed using the average (or sometimes the median or mode) of the corresponding feature values from the nearest neighbors.

3. **🚩Using Indicator for Missing Values:** 
- **Creating a new feature that indicates the presence of missing values:** Sometimes, the fact that a value is missing can be informative. You can create a binary indicator (0 or 1) for missingness and use it as a feature in your model.

In [24]:
# Create an indicator column for missing 'Age'
titanic_df['Age_missing'] = titanic_df['Age'].isna().astype(int)
titanic_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_missing
0,1,0,3,"Braund, Mr. Owen Harris",male,22.000000,1,0,A/5 21171,7.2500,-1,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.000000,1,0,PC 17599,71.2833,C85,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.000000,0,0,STON/O2. 3101282,7.9250,-1,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.000000,1,0,113803,53.1000,C123,S,0
4,5,0,3,"Allen, Mr. William Henry",male,35.000000,0,0,373450,8.0500,-1,S,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.000000,0,0,211536,13.0000,-1,S,0
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.000000,0,0,112053,30.0000,B42,S,0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,29.699118,1,2,W./C. 6607,23.4500,-1,S,0
889,890,1,1,"Behr, Mr. Karl Howell",male,26.000000,0,0,111369,30.0000,C148,C,0


In [27]:
titanic_df = pd.read_csv('Titanic-Dataset.csv')

4. **🚫 Dropping Features**  
- **Dropping features with high percentages of missing values:** If a feature has too many missing values, it might be better to drop it altogether.

In [26]:
# Drop columns where more than 50% of the data is missing
threshold = 0.5
titanic_df = titanic_df.dropna(thresh=int((1-threshold) * len(titanic_df)), axis=1)

5. **🔄 Interpolation**  
- **Interpolate missing values:** This method is particularly useful when dealing with time series data. Interpolation fills in the missing values based on linear interpolation or other mathematical methods.

In [28]:
# Interpolation for missing values
titanic_df['Age'] = titanic_df['Age'].interpolate()