# **Missing Values**

## Continuous value

## Complete Case Analysis (Deletion):
1. when the missing data is minimal and occurs completely at random.
2. When the missing data is distributed uniformly across the variables.
3. When there is a large amount of data available, and the removal of a few rows or columns will not significantly affect the analysis.



## Mean, Median, or Mode Imputation:

1. When the data is missing completely at random or missing not at random, but the missingness is not related to the value of the variable itself.
2. When the percentage of missing data is relatively small and imputing with summary statistics is reasonable.

## Interpolation:

1. When working with continuous data or time series data with a smooth relationship between the variables.
2. When the missing data points lie within the range of observed values.

## Machine Learning-based Methods:

1. When the missing data is related to other features in a complex way that cannot be easily captured by simple imputation methods.
2. When there is a substantial amount of data available to build reliable predictive models.

## Categorical Values

All above  methods can also be used.

## Constant Value: 
1. You can use a specific constant value to represent missing data in a categorical variable. 
2. This constant can be a string like "Unknown" or "-1," depending on the nature of the data.

## Random Imputation: 
Missing categorical values can be imputed by randomly selecting a category from the available categories in the dataset.

# Implementation!!!

In [19]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
df=pd.read_csv('titanic.csv')

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [21]:
df.columns[df.isnull().any()]

Index(['Age', 'Cabin', 'Embarked'], dtype='object')

### Deletion

In [22]:
data=df.copy()
data.dropna(inplace=True)

### Mean, Median, or Mode Imputation:

1. Mean for normal continuous
2. Median for skewed continuous
3. Mode for categorical 

In [36]:
df['Cabin'].mode()[0] ##for mode use [0]

'B96 B98'

In [37]:
data=df.copy()
data['Cabin'].fillna(data['Cabin'].mode()[0][0:3],inplace=True)  ##[0:3]becaus two values with same mode,used to get one value
data.columns[data.isnull().any()]

Index(['Age', 'Embarked'], dtype='object')

In [39]:
data=df.copy()
data['Age'].fillna(data['Age'].median(),inplace=True)
data.columns[data.isnull().any()]

Index(['Cabin', 'Embarked'], dtype='object')

### Interpolation:

In [42]:
data=df.copy()
data['Age']=data['Age'].interpolate()
data.columns[data.isnull().any()]

Index(['Cabin', 'Embarked'], dtype='object')

In [41]:
data['Age']

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888    22.5
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

### Constant Value

In [43]:
data=df.copy()
data['Cabin'].fillna('rare',inplace=True)
data.columns[data.isnull().any()]

Index(['Age', 'Embarked'], dtype='object')

In [48]:
data['Cabin'].value_counts()

rare           687
C23 C25 C27      4
G6               4
B96 B98          4
C22 C26          3
              ... 
E34              1
C7               1
C54              1
E36              1
C148             1
Name: Cabin, Length: 148, dtype: int64

### Random Imputation

In [51]:
import random
data=df.copy()
data['Cabin'].value_counts()

B96 B98        4
G6             4
C23 C25 C27    4
C22 C26        3
F33            3
              ..
E34            1
C7             1
C54            1
E36            1
C148           1
Name: Cabin, Length: 147, dtype: int64

In [54]:
categories = df['Cabin'].dropna().unique()
data= df.copy()

for index, row in data.iterrows():
    if pd.isnull(row['Cabin']):
        random_category = random.choice(categories)
        data.at[index, 'Cabin'] = random_category

### Machine Learning-based Methods

Any algorithms can be used to predict the missing values.