# **Handling Missing Values**

**1. Dropping Missing Values:**

**Pros:**
- Simple and quick.
- Useful when the missing data is random and doesn't carry important information.

**Cons:**
- May result in loss of valuable information.
- If a significant amount of data is missing, it can lead to biased analysis.

In [12]:
df

Unnamed: 0,A,B,C
0,1.0,10.0,100.0
1,2.0,,200.0
2,,,300.0
3,,40.0,
4,5.0,50.0,


In [9]:
df.dropna()

Unnamed: 0,A,B,C
0,1.0,10.0,100.0


**2. Imputation using Mean, Median, or Mode:**

**Pros:**
- Easy to implement.
- Maintains the same sample size.

**Cons:**
- Imputing with central tendencies may distort the original distribution.
- Doesn't consider relationships between features.

In [11]:
df

Unnamed: 0,A,B,C
0,1.0,10.0,100.0
1,2.0,,200.0
2,,,300.0
3,,40.0,
4,5.0,50.0,


In [10]:
df.fillna(df.mean())

Unnamed: 0,A,B,C
0,1.0,10.0,100.0
1,2.0,33.333333,200.0
2,2.666667,33.333333,300.0
3,2.666667,40.0,200.0
4,5.0,50.0,200.0


**3. Forward Fill (ffill) or Backward Fill (bfill):**

**Pros:**
- Maintains the temporal order of data.
- Useful for time series data.

**Cons:**
- May not be suitable for all types of data.
- Imputation might be incorrect if there are consecutive missing values.

In [13]:
df

Unnamed: 0,A,B,C
0,1.0,10.0,100.0
1,2.0,,200.0
2,,,300.0
3,,40.0,
4,5.0,50.0,


In [14]:
df.ffill()

Unnamed: 0,A,B,C
0,1.0,10.0,100.0
1,2.0,10.0,200.0
2,2.0,10.0,300.0
3,2.0,40.0,300.0
4,5.0,50.0,300.0


**4. Interpolation:**

**Pros:**
- Fills missing values based on the relationship between other values.
- Suitable for time series data and ordered datasets.

**Cons:**
- Sensitive to outliers.
- Choice of interpolation method can affect results.

In [15]:
df

Unnamed: 0,A,B,C
0,1.0,10.0,100.0
1,2.0,,200.0
2,,,300.0
3,,40.0,
4,5.0,50.0,


In [16]:
df.interpolate()

Unnamed: 0,A,B,C
0,1.0,10.0,100.0
1,2.0,20.0,200.0
2,3.0,30.0,300.0
3,4.0,40.0,300.0
4,5.0,50.0,300.0


**5. Imputation using Machine Learning Algorithms:**

**Pros:**
- Considers relationships between features.
- Can be more accurate than simple imputation methods.

**Cons:**
- Computationally expensive.
- Requires careful model selection and tuning.

In [17]:
df

Unnamed: 0,A,B,C
0,1.0,10.0,100.0
1,2.0,,200.0
2,,,300.0
3,,40.0,
4,5.0,50.0,


In [32]:
from sklearn.impute import KNNImputer
knn_imputer = KNNImputer(n_neighbors=2)
df_imputed = knn_imputer.fit_transform(df)
df_imputed

array([[  1. ,  10. , 100. ],
       [  2. ,  30. , 200. ],
       [  1.5,  25. , 300. ],
       [  3. ,  40. , 150. ],
       [  5. ,  50. , 150. ]])

**6. Using a Placeholder Value:**

**Pros:**
- Simple approach.
- Preserves the original dataset structure.

**Cons:**
- The placeholder value may introduce bias.
- Not suitable for all types of data.

In [33]:
import pandas as pd
import numpy as np

# Creating a DataFrame with missing values
data = {'A': [1, 2, np.nan, 4, 5], 'B': [10, np.nan, 30, 40, 50]}
df = pd.DataFrame(data)
df

Unnamed: 0,A,B
0,1.0,10.0
1,2.0,
2,,30.0
3,4.0,40.0
4,5.0,50.0


In [34]:
# Using a placeholder value to replace missing values
placeholder_value = -1
df_with_placeholder = df.fillna(placeholder_value)

In [35]:
df_with_placeholder

Unnamed: 0,A,B
0,1.0,10.0
1,2.0,-1.0
2,-1.0,30.0
3,4.0,40.0
4,5.0,50.0


**7. Multiple Imputation:**

**Pros:**
- Accounts for uncertainty in imputed values.
- Provides more realistic estimates.

**Cons:**
- Complexity and increased computational cost.
- Requires multiple imputations and model fitting.

In [36]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import pandas as pd

# Create a DataFrame with missing values
data = {'A': [1, 2, np.nan, 4, 5], 'B': [10, np.nan, 30, 40, 50]}
df = pd.DataFrame(data)
df


Unnamed: 0,A,B
0,1.0,10.0
1,2.0,
2,,30.0
3,4.0,40.0
4,5.0,50.0


In [41]:
# Perform multiple imputation
imputer = IterativeImputer(max_iter=30, random_state=0)
df_imputed = imputer.fit_transform(df)
df_imputed

array([[ 1.        , 10.        ],
       [ 2.        , 20.00036152],
       [ 2.99987348, 30.        ],
       [ 4.        , 40.        ],
       [ 5.        , 50.        ]])