# Dealing with missing values

This will demonstrate how to handle missing values in a dataset. Different ways of dealing with missing values:

- **Deleting Rows:**
    - Delete a row containing missing values.
    - Delete a column containing missing values if it has 70-75% of missing data.
    - ***NOTE***
        - Deletion may lead to very accurate model, but it has an issue of loosing information.
        - If more than 30% of the rows have missing value, removal of those rows lead to poor performance of the model.
---
- **Replacing with Mean / Median / Mode (Imputation):**
    - ***Numerical Variable***: Fill missing values by mean/median
    - ***Categorical Variable***: Fill the missing values by the class which has maximum count (i.e. mode)
    - Look at the description to know whether numerical variables should be imputed with mean or median.
        - Mean is extremely affected by outliers. So, one must be careful in choosing between mean and median.
        - That is, generate descriptive statistics to summarize central tendency, dispersion and shape of the dataset's distribution excluding NaN values
    - ***Alternatively***: Imputation with kNN Mean
---
- **Predicting the missing values:** Use the features which do not have missing values to predict the values of another feature/attribute.

## Importing and loading data

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

print(pd.__version__)
print(np.__version__)

2.1.3
1.26.1


In [36]:
#loading the data
data = pd.read_csv('datasets/titanic_train.csv')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
# Shape and columns of the data
print('Shape:', data.shape)
print('Columns:', data.columns)

Shape: (891, 12)
Columns: Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')


## Checking the number of missing values

- Missing data is represented by NaN.
- To check null values in Pandas dataframes, *isnull()* and *isna()* are used.
- These functions returns a dataframe of Boolean values which are True for NaN values. 

In [5]:
# Missing values in the data
data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

- 'Age' and 'Cabin' have a very high number of missing values.
- 'Embarked' has very low number of missing values.

## Deleting rows and columns

- Delete a row containing missing values.
- Delete a column containing missing values if it has 70-75% of missing data.

### Deleting rows with missing values

Here, we drop those rows which have missing values present.

In [6]:
# Age variable without missing values treatment
data['Age'].head()

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

In [7]:
# Dropping all rows with missing values
age_data = data.dropna(axis = 0)
age_data['Age'].head()

1     38.0
3     35.0
6     54.0
10     4.0
11    58.0
Name: Age, dtype: float64

In [8]:
# Shape before and after removing missing values
print(data.shape, age_data.shape)

(891, 12) (183, 12)


**Inferences Drawn:** There is a significant loss of information. Only three columns had missing values. 

### Deleting columns with missing values

In [9]:
# isnull() with ratio 
round((data.isnull().sum())/891, 2)

PassengerId    0.00
Survived       0.00
Pclass         0.00
Name           0.00
Sex            0.00
Age            0.20
SibSp          0.00
Parch          0.00
Ticket         0.00
Fare           0.00
Cabin          0.77
Embarked       0.00
dtype: float64

In [10]:
# dropping all columns with missing values
new_data = data.dropna(thresh = 500, axis=1)
new_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


In [11]:
# shape before and after removing missing values
print(data.shape, new_data.shape)

(891, 12) (891, 11)


## Replacing values with a new category / values

You can fill the missing values with any value or category, depending on the type of attributes.

In [12]:
# Missing values in the data
data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [13]:
# Checking the data in the Cabin
data['Cabin'].head()

0     NaN
1     C85
2     NaN
3    C123
4     NaN
Name: Cabin, dtype: object

In [16]:
# Fill the NaN values in 'Cabin' with 'missing'
data['Cabin'].fillna(value = 'missing').head()

0    missing
1        C85
2    missing
3       C123
4    missing
Name: Cabin, dtype: object

In [17]:
# Fill the NaN values in 'Age' with 999
data['Age'].fillna(value=999).head()

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

In [18]:
# Checking the frequency of 'Age'
data['Age'].value_counts()

Age
24.00    30
22.00    27
18.00    26
19.00    25
28.00    25
         ..
36.50     1
55.50     1
0.92      1
23.50     1
74.00     1
Name: count, Length: 88, dtype: int64

In [19]:
# Make a copy
copied_data = data.copy() 

# Replace values
copied_data['Age'] = copied_data['Age'].fillna(value = 999)
copied_data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [21]:
# Converting the NaN values in the 'Cabin' data type to int 
(data['Cabin'].isnull()).astype('int').head()

0    1
1    0
2    1
3    0
4    1
Name: Cabin, dtype: int32

In [22]:
copied_data['Cabin_NaN'] = (data['Cabin'].isnull()).astype('int')
copied_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Cabin_NaN
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1


## Imputing values with mean, median or mode values

### Age

In [24]:
# finding mean value of 'Age'
mean_age = data['Age'].mean()
print('Mean Age:', mean_age.round(3))

Mean Age: 29.699


In [25]:
# Making a copy
cleaned_data = data.copy() 

# Imputing missing values
cleaned_data['Age'] = data['Age'].fillna(value = mean_age)
cleaned_data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

### Embarked

In [26]:
# Checking the frequency of every value
data['Embarked'].value_counts()

Embarked
S    644
C    168
Q     77
Name: count, dtype: int64

In [27]:
mode_emb = data['Embarked'].mode()[0]
print('Embarked mode value:', mode_emb)

Embarked mode value: S


In [28]:
# Imputing missing values
cleaned_data['Embarked'] = data['Embarked'].fillna(value = mode_emb)
cleaned_data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64

## Using relationship with another feature(s)

Use the features which do not have missing values to predict the values of another feature/attribute.

In [45]:
#loading the data
data = pd.read_csv('datasets/titanic_train.csv')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [43]:
# Dropping all non-numerical columns
data = data.drop(['Name', 'Sex', 'Ticket', 'Embarked', 'Cabin'], axis = 1)

# correlated data
data.corr().round(2)

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.01,-0.04,0.04,-0.06,-0.0,0.01
Survived,-0.01,1.0,-0.34,-0.08,-0.04,0.08,0.26
Pclass,-0.04,-0.34,1.0,-0.37,0.08,0.02,-0.55
Age,0.04,-0.08,-0.37,1.0,-0.31,-0.19,0.1
SibSp,-0.06,-0.04,0.08,-0.31,1.0,0.41,0.16
Parch,-0.0,0.08,0.02,-0.19,0.41,1.0,0.22
Fare,0.01,0.26,-0.55,0.1,0.16,0.22,1.0


In [47]:
# Filtering data based on name and age and applying condition for null values
(data[['Name', 'Age']].loc[(data['Age'].isnull()>0)]).head()

Unnamed: 0,Name,Age
5,"Moran, Mr. James",
17,"Williams, Mr. Charles Eugene",
19,"Masselmani, Mrs. Fatima",
26,"Emir, Mr. Farred Chehab",
28,"O'Dwyer, Miss. Ellen ""Nellie""",
