# Chapter 7 - Data Cleaning and Preparation

## 7.1 Handling Missing Data

In [1]:
import sys
import json

import pandas as pd
import numpy as np
from numpy import nan as NA

- Representing missing data using `np.nan`

- Detecting missing values using `.isnull()` or `.notnull()`

- Use `dropna()` to remove values

- Use `fillna()` to impute values

<hr>

In [2]:
df = pd.read_csv('dataset-G-subsidies.csv', sep='|')
df

Unnamed: 0,Year Total,HDB 1- & 2- Room Flats,HDB 3-Room Flats,HDB 4-Room Flats,HDB 5-Room & Executive Flats,Condominiums & Other Apartments,Landed Properties
2014,3526,9315.0,3385,3402.0,3554,2177.0,2488
2015,4096,,3853,3906.0,4233,2887.0,3093
2016,4248,10062.0,4251,,4231,,2907
2017,4503,10424.0,4500,,4425,,3339
2018,4494,10347.0,4535,4397.0,4424,2957.0,3152


`NaN` from the `numpy` library is used as a <i>sentinel value</i> to represent missing data. Use `Series.isnull()` to detect any missing values. Note that Python's `None` is also treated as null. Also, `notnull()` is the complement result of `isnull()`.


In [16]:
display(df['HDB 1- & 2- Room Flats'])
display(df['HDB 1- & 2- Room Flats'].isnull())
display(df['HDB 1- & 2- Room Flats'].notnull())

2014     9315.0
2015        NaN
2016    10062.0
2017    10424.0
2018    10347.0
Name: HDB 1- & 2- Room Flats, dtype: float64

2014    False
2015     True
2016    False
2017    False
2018    False
Name: HDB 1- & 2- Room Flats, dtype: bool

2014     True
2015    False
2016     True
2017     True
2018     True
Name: HDB 1- & 2- Room Flats, dtype: bool

Assigning a value in a `Series` to `None` will also yield `True` when using `Series.isnull()`.

In [17]:
f_data = df['HDB 1- & 2- Room Flats'].copy()
f_data[2014] = None
display(f_data)
display(f_data.isnull())

2014        NaN
2015        NaN
2016    10062.0
2017    10424.0
2018    10347.0
Name: HDB 1- & 2- Room Flats, dtype: float64

2014     True
2015     True
2016    False
2017    False
2018    False
Name: HDB 1- & 2- Room Flats, dtype: bool

Use `Series.dropna()` to remove all values that are missing. Note that when rows are dropped, the corresponding index is also dropped

In [18]:
df_condo = df.copy()['Condominiums & Other Apartments']
display(df_condo)
display(df_condo.dropna())

2014    2177.0
2015    2887.0
2016       NaN
2017       NaN
2018    2957.0
Name: Condominiums & Other Apartments, dtype: float64

2014    2177.0
2015    2887.0
2018    2957.0
Name: Condominiums & Other Apartments, dtype: float64

Use `df.dropna()` to also drop all rows with <i>any</i> missing values. The row is dropped <u>as long as</u> there is <u>at least one missing value</u>.

In [21]:
display(df)
# Use axis=1 to drop rows
display(df.dropna(axis=0)) 
# Use axis=1 to drop columns
display(df.dropna(axis=1)) 

Unnamed: 0,Year Total,HDB 1- & 2- Room Flats,HDB 3-Room Flats,HDB 4-Room Flats,HDB 5-Room & Executive Flats,Condominiums & Other Apartments,Landed Properties
2014,3526,9315.0,3385,3402.0,3554,2177.0,2488
2015,4096,,3853,3906.0,4233,2887.0,3093
2016,4248,10062.0,4251,,4231,,2907
2017,4503,10424.0,4500,,4425,,3339
2018,4494,10347.0,4535,4397.0,4424,2957.0,3152


Unnamed: 0,Year Total,HDB 1- & 2- Room Flats,HDB 3-Room Flats,HDB 4-Room Flats,HDB 5-Room & Executive Flats,Condominiums & Other Apartments,Landed Properties
2014,3526,9315.0,3385,3402.0,3554,2177.0,2488
2018,4494,10347.0,4535,4397.0,4424,2957.0,3152


Unnamed: 0,Year Total,HDB 3-Room Flats,HDB 5-Room & Executive Flats,Landed Properties
2014,3526,3385,3554,2488
2015,4096,3853,4233,3093
2016,4248,4251,4231,2907
2017,4503,4500,4425,3339
2018,4494,4535,4424,3152


In [7]:
# To drop using only one row, consider filtering using Series.isnull()
display(df[df['HDB 1- & 2- Room Flats'].notnull()])

Unnamed: 0,Year Total,HDB 1- & 2- Room Flats,HDB 3-Room Flats,HDB 4-Room Flats,HDB 5-Room & Executive Flats,Condominiums & Other Apartments,Landed Properties
2014,3526,9315.0,3385,3402.0,3554,2177.0,2488
2016,4248,10062.0,4251,,4231,,2907
2017,4503,10424.0,4500,,4425,,3339
2018,4494,10347.0,4535,4397.0,4424,2957.0,3152


In [22]:
df2 = pd.read_csv('dataset-G1-subsidies.csv', sep='|')
df2

Unnamed: 0,Year Total 3/,HDB 1- & 2- Room Flats 4/,HDB 3-Room Flats,HDB 4-Room Flats,HDB 5-Room & Executive Flats,Condominiums & Other Apartments,Landed Properties
2003,1589.0,3314.0,1386.0,1651.0,1623.0,,1407.0
2004,1674.0,,,1730.0,,1244.0,1513.0
2005,1855.0,3879.0,1565.0,1862.0,,1370.0,1800.0
2006,2280.0,4769.0,2221.0,2383.0,2251.0,1456.0,1788.0
2007,2008.0,3850.0,1729.0,,,1423.0,1748.0
2008,2865.0,5792.0,,2866.0,2889.0,2206.0,2635.0
2009,2708.0,5711.0,,2666.0,2770.0,2173.0,2416.0
2010,,6559.0,2376.0,2561.0,2696.0,1790.0,2135.0
2011,3273.0,,,,3401.0,2456.0,2812.0
2012,2870.0,7204.0,2503.0,2773.0,3049.0,1976.0,2235.0


To keep rows that have at least some filled values, use `thresh`. The value is the **minimum** number of filled values in that row required to preserve it.

In [9]:
# Preserve all rows with at least 5 filled values. 
df2.dropna(thresh=6, axis=0)

Unnamed: 0,Year Total 3/,HDB 1- & 2- Room Flats 4/,HDB 3-Room Flats,HDB 4-Room Flats,HDB 5-Room & Executive Flats,Condominiums & Other Apartments,Landed Properties
2003,1589.0,3314.0,1386.0,1651.0,1623.0,,1407.0
2005,1855.0,3879.0,1565.0,1862.0,,1370.0,1800.0
2006,2280.0,4769.0,2221.0,2383.0,2251.0,1456.0,1788.0
2008,2865.0,5792.0,,2866.0,2889.0,2206.0,2635.0
2009,2708.0,5711.0,,2666.0,2770.0,2173.0,2416.0
2010,,6559.0,2376.0,2561.0,2696.0,1790.0,2135.0
2012,2870.0,7204.0,2503.0,2773.0,3049.0,1976.0,2235.0
2014,3526.0,9315.0,3385.0,,3554.0,2177.0,2488.0
2015,4096.0,9474.0,3853.0,,4233.0,2887.0,3093.0


Use `Series.fillna()` to fill in spaces with missing values.

In [11]:
display(df['HDB 1- & 2- Room Flats'])
display(df['HDB 1- & 2- Room Flats'].fillna(df['HDB 1- & 2- Room Flats'].mean()))

2014     9315.0
2015        NaN
2016    10062.0
2017    10424.0
2018    10347.0
Name: HDB 1- & 2- Room Flats, dtype: float64

2014     9315.0
2015    10037.0
2016    10062.0
2017    10424.0
2018    10347.0
Name: HDB 1- & 2- Room Flats, dtype: float64

`DataFrame.fillna()` can be used to fill in spaces for all values in the `df`. To specify different fill values for different columns, add a `dict()` as another param. Finally, use `method='ffill'` to fill in values from the previous observed value in a column.

In [12]:
display(df)
display(df.fillna(0))

Unnamed: 0,Year Total,HDB 1- & 2- Room Flats,HDB 3-Room Flats,HDB 4-Room Flats,HDB 5-Room & Executive Flats,Condominiums & Other Apartments,Landed Properties
2014,3526,9315.0,3385,3402.0,3554,2177.0,2488
2015,4096,,3853,3906.0,4233,2887.0,3093
2016,4248,10062.0,4251,,4231,,2907
2017,4503,10424.0,4500,,4425,,3339
2018,4494,10347.0,4535,4397.0,4424,2957.0,3152


Unnamed: 0,Year Total,HDB 1- & 2- Room Flats,HDB 3-Room Flats,HDB 4-Room Flats,HDB 5-Room & Executive Flats,Condominiums & Other Apartments,Landed Properties
2014,3526,9315.0,3385,3402.0,3554,2177.0,2488
2015,4096,0.0,3853,3906.0,4233,2887.0,3093
2016,4248,10062.0,4251,0.0,4231,0.0,2907
2017,4503,10424.0,4500,0.0,4425,0.0,3339
2018,4494,10347.0,4535,4397.0,4424,2957.0,3152


In [13]:
df.columns # Get the full list of columns as a clue

Index(['Year Total', 'HDB 1- & 2- Room Flats', 'HDB 3-Room Flats',
       'HDB 4-Room Flats', 'HDB 5-Room & Executive Flats',
       'Condominiums & Other Apartments', 'Landed Properties'],
      dtype='object')

In [14]:
# Use a dictionary to fill different values for different columns.
display(df.fillna({'HDB 4-Room Flats' : 4000.0, 'Condominiums & Other Apartments' : 2000.0}))

Unnamed: 0,Year Total,HDB 1- & 2- Room Flats,HDB 3-Room Flats,HDB 4-Room Flats,HDB 5-Room & Executive Flats,Condominiums & Other Apartments,Landed Properties
2014,3526,9315.0,3385,3402.0,3554,2177.0,2488
2015,4096,,3853,3906.0,4233,2887.0,3093
2016,4248,10062.0,4251,4000.0,4231,2000.0,2907
2017,4503,10424.0,4500,4000.0,4425,2000.0,3339
2018,4494,10347.0,4535,4397.0,4424,2957.0,3152


In [15]:
# Use the method='ffill' method to fill values down the rows.
display(df.fillna(method='ffill'))

Unnamed: 0,Year Total,HDB 1- & 2- Room Flats,HDB 3-Room Flats,HDB 4-Room Flats,HDB 5-Room & Executive Flats,Condominiums & Other Apartments,Landed Properties
2014,3526,9315.0,3385,3402.0,3554,2177.0,2488
2015,4096,9315.0,3853,3906.0,4233,2887.0,3093
2016,4248,10062.0,4251,3906.0,4231,2887.0,2907
2017,4503,10424.0,4500,3906.0,4425,2887.0,3339
2018,4494,10347.0,4535,4397.0,4424,2957.0,3152


**References:**

Python for Data Analysis, 2nd Edition, McKinney (2017)