## Handling Missing Value

In [2]:
import numpy as np
import pandas as pd

In [3]:
new_df = pd.DataFrame({'col_a': [1,2,4,1, np.nan, np.nan, 5],
                       'col_b': [3,7, np.nan, 9, None, 5, 8],
                       'col_c': ['a', '?', 'x', 'y', '--', np.nan, 'r'],
                       'col_d': [True, True, np.nan, None, False, True, False]})

In [4]:
new_df

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,?,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,--,False
5,,5.0,,True
6,5.0,8.0,r,False


np.nan, None and NaT (for datetime64[ns] types) are standard missing value for Pandas.

### Find Missing Values

Pandas provides `isnull()`, `isna()` functions to detect missing values. Both of them do the same thing.

In [5]:
new_df.shape

(7, 4)

In [6]:
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   col_a   5 non-null      float64
 1   col_b   5 non-null      float64
 2   col_c   6 non-null      object 
 3   col_d   5 non-null      object 
dtypes: float64(2), object(2)
memory usage: 352.0+ bytes


In [7]:
new_df.isna()

Unnamed: 0,col_a,col_b,col_c,col_d
0,False,False,False,False
1,False,False,False,False
2,False,True,False,True
3,False,False,False,True
4,True,True,False,False
5,True,False,True,False
6,False,False,False,False


df.isna().any() returns a boolean value for each column. If there is at least one missing value in that column, the result is True.

In [8]:
new_df.isna().any()

col_a    True
col_b    True
col_c    True
col_d    True
dtype: bool

In [9]:
new_df.isna().sum()

col_a    2
col_b    2
col_c    1
col_d    2
dtype: int64

In [10]:
new_df.isnull().sum()

col_a    2
col_b    2
col_c    1
col_d    2
dtype: int64

Missing value can be irrevant characters, such as "?" and "--" character in col_c\
These character can't be detected as missing value by Pandas

If we know what kind of characters used as missing values in the dataset, we can handle them by creating the dataframe using `na_values` parameter:


In [11]:
new_df.to_csv("data_saya.csv")

In [12]:
missing_values = ["?", "--"]
df2 = pd.read_csv("data_saya.csv", na_values = missing_values)

In [13]:
df2.isna().sum()

Unnamed: 0    0
col_a         2
col_b         2
col_c         3
col_d         2
dtype: int64

In [14]:
df2

Unnamed: 0.1,Unnamed: 0,col_a,col_b,col_c,col_d
0,0,1.0,3.0,a,True
1,1,2.0,7.0,,True
2,2,4.0,,x,
3,3,1.0,9.0,y,
4,4,,,,False
5,5,,5.0,,True
6,6,5.0,8.0,r,False


Another option is to use pandas replace() function to handle these values after a dataframe is created:


In [15]:
df3 = new_df.replace({"?": np.nan, "--": np.nan})

In [16]:
df3

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,,False
5,,5.0,,True
6,5.0,8.0,r,False


In [17]:
df3.isna().sum()

col_a    2
col_b    2
col_c    3
col_d    2
dtype: int64

Compare to original df

In [18]:
new_df.isna().sum()

col_a    2
col_b    2
col_c    1
col_d    2
dtype: int64

## Drop missing value

In [19]:
df3

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,,False
5,,5.0,,True
6,5.0,8.0,r,False


We can drop a row or column with missing values using dropna() function. We can use some condition:\
how='any' : drop if there is any missing value\
how='all' : drop if all values are missing

In [20]:
df3.dropna(axis=0, how='all', inplace=True)
df3

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,,False
5,,5.0,,True
6,5.0,8.0,r,False


In [21]:
df3.dropna(axis=0, how='any')

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
6,5.0,8.0,r,False


We can use 'thresh' parameter to set a threshold for missing values in order for a row/column to be dropped. Thresh is the amount of non-na value

In [22]:
df3

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,,False
5,,5.0,,True
6,5.0,8.0,r,False


In [23]:
df3.dropna(axis=0, thresh=2)

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
5,,5.0,,True
6,5.0,8.0,r,False


### Replacing missing values

`fillna()` function in Pandas is used to replace missing values with another values.\
Missing values can be replaced by:
1. Special value
2. Aggregate value, such as mean, median, etc

#### Replacing with scalar

In [24]:
df5 = df3.fillna(0)
df5

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,0,True
2,4.0,0.0,x,0
3,1.0,9.0,y,0
4,0.0,0.0,0,False
5,0.0,5.0,0,True
6,5.0,8.0,r,False


In [25]:
df3

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,,False
5,,5.0,,True
6,5.0,8.0,r,False


In [26]:
df3.iloc[:, 0] = df3.iloc[:, 0].fillna(df3.iloc[:, 0].mean())
df3

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,2.6,,,False
5,2.6,5.0,,True
6,5.0,8.0,r,False


In [27]:
dfmean = df3.iloc[:, 0].mean()
dfmean

2.6

In [28]:
type(dfmean)

float

In [34]:
dfx = new_df.replace({"?": np.nan, "--": np.nan})
dfx

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,,False
5,,5.0,,True
6,5.0,8.0,r,False


In [30]:
dfx['col_a'] = dfx['col_a'].fillna(dfx['col_a'].mode()[0])
dfx

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,1.0,,,False
5,1.0,5.0,,True
6,5.0,8.0,r,False


In [35]:
dfmode = dfx['col_a'].mode()
dfmode

0    1.0
dtype: float64

In [36]:
type(dfmode)

pandas.core.series.Series

In [30]:
df7 = new_df.replace({"?": np.nan, "--": np.nan})
df7

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,,False
5,,5.0,,True
6,5.0,8.0,r,False


In [31]:
df7.iloc[:, 0] = df7.iloc[:, 0].fillna(df7.iloc[:, 0].median())
df7

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,2.0,,,False
5,2.0,5.0,,True
6,5.0,8.0,r,False


In [32]:
df7.iloc[:, 0].median()

2.0

In [33]:
type(df7.iloc[:, 0].median())

float

In [34]:
df8 = new_df.replace({"?": np.nan, "--": np.nan})
df8

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,,False
5,,5.0,,True
6,5.0,8.0,r,False


Take the last seen values by using `ffill` (forward fill)

In [35]:
df8.fillna(method='ffill', inplace=True)

In [36]:
df8

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,a,True
2,4.0,7.0,x,True
3,1.0,9.0,y,True
4,1.0,9.0,y,False
5,1.0,5.0,y,True
6,5.0,8.0,r,False


In [37]:
df9 = new_df.replace({"?": np.nan, "--": np.nan})
df9

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,,True
2,4.0,,x,
3,1.0,9.0,y,
4,,,,False
5,,5.0,,True
6,5.0,8.0,r,False


In [38]:
df9.fillna(method='bfill', inplace=True)

In [39]:
df9

Unnamed: 0,col_a,col_b,col_c,col_d
0,1.0,3.0,a,True
1,2.0,7.0,x,True
2,4.0,9.0,x,False
3,1.0,9.0,y,False
4,5.0,5.0,r,False
5,5.0,5.0,r,True
6,5.0,8.0,r,False


## Exercise 7

1. Find how many missing values in each column of Titanic data

2. Replace the missing values with the following values:\
    -Embarked 'S'\
    -Age 'mean'\
    -Cabin 'mode'

In [1]:
import pandas as pd

In [2]:
titanic = pd.read_csv('titanic.csv')

In [3]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [40]:
titanic.isnull().any()

PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked        True
dtype: bool

In [41]:
titanic.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [42]:
titanic['Age'] = titanic['Age'].fillna(titanic['Age'].mean())

In [43]:
titanic.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [44]:
titanic['Embarked'] = titanic['Embarked'].fillna('S')

In [45]:
titanic.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64

In [46]:
titanic['Cabin'] = titanic['Cabin'].fillna(titanic['Cabin'].mode()[0])

In [47]:
titanic.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

In [8]:
titanic['Cabin'].mode()

0        B96 B98
1    C23 C25 C27
2             G6
dtype: object

In [7]:
titanic['Cabin'].value_counts()

C23 C25 C27    4
B96 B98        4
G6             4
F33            3
D              3
              ..
D28            1
B78            1
D56            1
B41            1
C91            1
Name: Cabin, Length: 147, dtype: int64

In [10]:
titanic['Age'].unique()

array([22.  , 38.  , 26.  , 35.  ,   nan, 54.  ,  2.  , 27.  , 14.  ,
        4.  , 58.  , 20.  , 39.  , 55.  , 31.  , 34.  , 15.  , 28.  ,
        8.  , 19.  , 40.  , 66.  , 42.  , 21.  , 18.  ,  3.  ,  7.  ,
       49.  , 29.  , 65.  , 28.5 ,  5.  , 11.  , 45.  , 17.  , 32.  ,
       16.  , 25.  ,  0.83, 30.  , 33.  , 23.  , 24.  , 46.  , 59.  ,
       71.  , 37.  , 47.  , 14.5 , 70.5 , 32.5 , 12.  ,  9.  , 36.5 ,
       51.  , 55.5 , 40.5 , 44.  ,  1.  , 61.  , 56.  , 50.  , 36.  ,
       45.5 , 20.5 , 62.  , 41.  , 52.  , 63.  , 23.5 ,  0.92, 43.  ,
       60.  , 10.  , 64.  , 13.  , 48.  ,  0.75, 53.  , 57.  , 80.  ,
       70.  , 24.5 ,  6.  ,  0.67, 30.5 ,  0.42, 34.5 , 74.  ])

In [11]:
titanic['Age'].value_counts()

24.00    30
22.00    27
18.00    26
19.00    25
30.00    25
         ..
55.50     1
70.50     1
66.00     1
23.50     1
0.42      1
Name: Age, Length: 88, dtype: int64

In [12]:
titanic['Age'].value_counts()[70]

2