## Handling Missing Value

In [3]:
import numpy as np
import pandas as pd

In [4]:
new_df = pd.DataFrame({'col_a': [1,2,4,1, np.nan, np.nan, 5],
                       'col_b': [3,7, np.nan, 9, None, 5, 8],
                       'col_c': ['a', '?', 'x', 'y', '--', np.nan, 'r'],
                       'col_d': [True, True, np.nan, None, False, True, False]})

In [5]:
new_df

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,?,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,--,False
5,,5.0,,True
6,5.0,8.0,r,False


In [16]:
new_df.to_csv("data_saya2.csv", index=False)

np.nan, None and NaT (for datetime64[ns] types) are standard missing value for Pandas.

### Find Missing Values

Pandas provides `isnull()`, `isna()` functions to detect missing values. Both of them do the same thing.

In [8]:
new_df.shape

(7, 4)

In [9]:
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   col_a   5 non-null      float64
 1   col_b   5 non-null      float64
 2   col_c   6 non-null      object 
 3   col_d   5 non-null      object 
dtypes: float64(2), object(2)
memory usage: 352.0+ bytes


In [10]:
new_df.isna()

Unnamed: 0,col_a,col_b,col_c,col_d
0,False,False,False,False
1,False,False,False,False
2,False,True,False,True
3,False,False,False,True
4,True,True,False,False
5,True,False,True,False
6,False,False,False,False


df.isna().any() returns a boolean value for each column. If there is at least one missing value in that column, the result is True.

In [11]:
new_df.isna().any()

col_a    True
col_b    True
col_c    True
col_d    True
dtype: bool

In [12]:
new_df.isna().sum()

col_a    2
col_b    2
col_c    1
col_d    2
dtype: int64

In [13]:
new_df.isnull().sum()

col_a    2
col_b    2
col_c    1
col_d    2
dtype: int64

Missing value can be irrevant characters, such as "?" and "--" character in col_c\
These character can't be detected as missing value by Pandas

If we know what kind of characters used as missing values in the dataset, we can handle them by creating the dataframe using `na_values` parameter:


In [17]:
missing_values = ["?", "--",">","="]
df2 = pd.read_csv("data_saya2.csv", na_values = missing_values)

In [19]:
new_df

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,?,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,--,False
5,,5.0,,True
6,5.0,8.0,r,False


In [18]:
df2

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,,False
5,,5.0,,True
6,5.0,8.0,r,False


In [20]:
df2.isna().sum()

col_a    2
col_b    2
col_c    3
col_d    2
dtype: int64

Another option is to use pandas replace() function to handle these values after a dataframe is created:


In [21]:
df3 = new_df.replace({"?": np.nan, "--": np.nan})

In [22]:
df3

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,,False
5,,5.0,,True
6,5.0,8.0,r,False


In [23]:
df3.isna().sum()

col_a    2
col_b    2
col_c    3
col_d    2
dtype: int64

Compare to original df

In [24]:
new_df.isna().sum()

col_a    2
col_b    2
col_c    1
col_d    2
dtype: int64

## Drop missing value

In [25]:
df3

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,,False
5,,5.0,,True
6,5.0,8.0,r,False


We can drop a row or column with missing values using dropna() function. We can use some condition:\
how='any' : drop if there is any missing value\
how='all' : drop if all values are missing

In [26]:
df3.dropna(axis=0, how='all', inplace=True)
df3

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,,False
5,,5.0,,True
6,5.0,8.0,r,False


In [27]:
df3.dropna(axis=0, how='any')

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
6,5.0,8.0,r,False


We can use 'thresh' parameter to set a threshold for missing values in order for a row/column to be dropped. Thresh is the amount of non-na value

In [28]:
df3

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,,False
5,,5.0,,True
6,5.0,8.0,r,False


In [29]:
df3.dropna(axis=0, thresh=3)

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
3,1.0,9.0,y,
6,5.0,8.0,r,False


### Replacing missing values

`fillna()` function in Pandas is used to replace missing values with another values.\
Missing values can be replaced by:
1. Special value
2. Aggregate value, such as mean, median, etc

#### Replacing with scalar

In [30]:
df5 = df3.fillna(0)
df5

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,0,True
2,4.0,0.0,x,0
3,1.0,9.0,y,0
4,0.0,0.0,0,False
5,0.0,5.0,0,True
6,5.0,8.0,r,False


In [31]:
df3

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,,False
5,,5.0,,True
6,5.0,8.0,r,False


In [32]:
df3.iloc[:, 0] = df3.iloc[:, 0].fillna(df3.iloc[:, 0].mean())
df3

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,2.6,,,False
5,2.6,5.0,,True
6,5.0,8.0,r,False


In [33]:
dfx = new_df.replace({"?": np.nan, "--": np.nan})
dfx

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,,False
5,,5.0,,True
6,5.0,8.0,r,False


In [34]:
dfx['col_a'] = dfx['col_a'].fillna(dfx['col_a'].mode()[0])
dfx

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,1.0,,,False
5,1.0,5.0,,True
6,5.0,8.0,r,False


In [34]:
df7 = new_df.replace({"?": np.nan, "--": np.nan})
df7

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,,False
5,,5.0,,True
6,5.0,8.0,r,False


In [35]:
df8 = new_df.replace({"?": np.nan, "--": np.nan})
df8

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,,False
5,,5.0,,True
6,5.0,8.0,r,False


Take the last seen values by using `ffill` (forward fill)

In [36]:
df8.fillna(method='ffill', inplace=True)

In [37]:
df8

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,a,True
2,4.0,7.0,x,True
3,1.0,9.0,y,True
4,1.0,9.0,y,False
5,1.0,5.0,y,True
6,5.0,8.0,r,False


In [38]:
df9 = new_df.replace({"?": np.nan, "--": np.nan})
df9

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,,False
5,,5.0,,True
6,5.0,8.0,r,False


In [40]:
df9.fillna(method='bfill', inplace=True)

In [41]:
df9

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,x,True
2,4.0,9.0,x,False
3,1.0,9.0,y,False
4,5.0,5.0,r,False
5,5.0,5.0,r,True
6,5.0,8.0,r,False


## Exercise 7

1. Find how many missing values in each column of Titanic data

In [47]:
import pandas as pd

In [48]:
titanic = pd.read_csv('train.csv')

2. Replace the missing values with the following values:\
    -Embarked 'S'\
    -Age 'mean'\
    -Cabin 'mode'