In [1]:
import pandas as pd
import numpy as np

## Capturing NAN values with a new feature
it works well if the data are not missing completely at random

In [4]:
df=pd.read_csv('titanic.csv',usecols=['Age','Fare','Survived'])
df.head()

Unnamed: 0,Survived,Age,Fare
0,0,22.0,7.25
1,1,38.0,71.2833
2,1,26.0,7.925
3,1,35.0,53.1
4,0,35.0,8.05


In [5]:
df.isnull().sum()

Survived      0
Age         177
Fare          0
dtype: int64

In [6]:
df['Age_NAN']=np.where(df['Age'].isnull(),1,0)

In [7]:
df.head()

Unnamed: 0,Survived,Age,Fare,Age_NAN
0,0,22.0,7.25,0
1,1,38.0,71.2833,0
2,1,26.0,7.925,0
3,1,35.0,53.1,0
4,0,35.0,8.05,0


In [8]:
df.Age.median()

28.0

In [9]:
df['Age'].fillna(df.Age.median(),inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df.Age.median(),inplace=True)


In [10]:
df.head()

Unnamed: 0,Survived,Age,Fare,Age_NAN
0,0,22.0,7.25,0
1,1,38.0,71.2833,0
2,1,26.0,7.925,0
3,1,35.0,53.1,0
4,0,35.0,8.05,0


In [11]:
df['Age'].isnull().sum()

0

## Advantages
1. Easy to implement
2. Captures the importance of missing values 

## Disadvantages
1. Creating additional Features(Curse of Dimensionality)