# Week 11
# Data Cleaning and Preparation

During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up 80% or more of an analyst's time. This week, let's study tools for missing data, duplicate data, string manipulation, and some other analytical data transformations.

Reading:
- Textbook, Chapter 7

## I. Handling Missing Values

For various reasons, many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. Such datasets are usually incompatible with the operations we want to apply to it during the analysis.

In this section, we will discuss several common approaches for handling missing values:
- Discard imcomplete records
- Mean/median imputation
- Hot-deck imputation
- Missing value indicator
- Advanced imputation methods

**An Example Data Set**

The [Pima Indians Diabetes Dataset](https://www.kaggle.com/uciml/pima-indians-diabetes-database) involves predicting the onset of diabetes within 5 years in Pima Indians given medical details.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. The variable names are as follows:

0. Number of times pregnant.
1. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
2. Diastolic blood pressure (mm Hg).
3. Triceps skinfold thickness (mm).
4. 2-Hour serum insulin (mu U/ml).
5. Body mass index (weight in kg/(height in m)^2).
6. Diabetes pedigree function.
7. Age (years).
8. Class variable (0 or 1).

In [1]:
# Reference
# https://machinelearningmastery.com/handle-missing-data-python/
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Load the data set as a data frame named "data"

# 1. unzip the file `archive.zip`
import zipfile

with zipfile.ZipFile('C:/Users/lzhao/Downloads/archive.zip', 'r') as f:
    f.printdir() # display the contents
    f.extractall("Data/diabetes/") # extract files from the zip file
    
# 2. Load the csv file as a data frame
data = pd.read_csv('Data/diabetes/diabetes.csv', sep=',')
data.head(20)

File Name                                             Modified             Size
diabetes.csv                                   2019-09-19 22:44:06        23873


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


To save time, we will skip some routine steps such as checking the data types or the distributions.

In [None]:
# Show value counts of the outcomes

data['Outcome'].value_counts()

In [3]:
data.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In the following columns, a value of zero indicates a missing value:

- Plasma glucose concentration
- Diastolic blood pressure
- Triceps skinfold thickness
- 2-Hour serum insulin
- Body mass index
- Diabetes pedibree function
- Age

In [4]:
# We should mark missing values with np.nan, so that these values can be
# correctly ignored from operations such as sum, count, min, etc.
cols = list(data.columns)
cols.remove(cols[0]) # remove the preganicies column
cols.remove(cols[-1]) # remove the outcome column
print(cols)

# Approach 1: use a double for loop to go through all cells one by one
for col in cols:
    for idx in data.index:
        if data.loc[idx, col] == 0:
            data.loc[idx, col] = np.nan
            
# Approach 2: use a single loop
for col in cols:
    data[data[col] == 0][col] == np.nan
data.head()

['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,0
2,8,183.0,64.0,,,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1


In [5]:
# How many missing values are there for each feature?
data.isnull().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

We can see that Glucose, BloodPressure, and BMI have just a few zero values, while SkinThickness and Insulin show nearly half of the rows missing.

## Approach 1: Discard Rows/Columns with Missing values

The simpliest strategy for handling missing data is to discard rows/columns that contain a missing value.

In [None]:
# Pandas provides the dropna() function that can be used to drop either columns or rows \
# with missing data.
data1 = data.dropna()
data1.head()

In [None]:
data1.shape # the size of dataset shrinked significantly

In [None]:
# Change axis paramter to drop columns containing missing values
data2 = data.dropna(axis=1)
data2.head(10) # too many useful features are removed

Removing rows with missing values may significantly reduce the number of rows, and thus hurt the quality of dataset. This approach is only recommended if the number of missing values is small.

In [None]:
# Drop the skinthickness column and the insulin column, because there are too many values missing
data4 = data.drop(columns=['SkinThickness', 'Insulin'])
data4.head()

In [None]:
# Find which rows have glucose, blood pressure, or BMI missing. Drop these rows.
data4.dropna(axis=0, subset=['Glucose', 'BloodPressure', 'BMI'], inplace=True) 
# inplace=True means that we modify the existing data frame
data4.isnull().sum()

## Approach 2: Replace Missing Values with Mean or Median

The mean and median represent the "average" value of the column, and thus can be a reasonable guess on the missing values.

In [None]:
# Pandas provides fillna() function for replacing missing values with a 
# specific value.

# fill the insulin column with the mean value
data3 = data.copy() # raw_data will not be affected
mean = data3['Insulin'].mean()
print(mean)
data3['Insulin'].fillna(mean, inplace=True)
data3.head(10)

In [6]:
# Perform mean imputation for all columns
data4 = data.copy()
data4.fillna(data4.mean(), inplace=True)
data4.head(20)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,155.548223,33.6,0.627,50,1
1,1,85.0,66.0,29.0,155.548223,26.6,0.351,31,0
2,8,183.0,64.0,29.15342,155.548223,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1
5,5,116.0,74.0,29.15342,155.548223,25.6,0.201,30,0
6,3,78.0,50.0,32.0,88.0,31.0,0.248,26,1
7,10,115.0,72.405184,29.15342,155.548223,35.3,0.134,29,0
8,2,197.0,70.0,45.0,543.0,30.5,0.158,53,1
9,8,125.0,96.0,29.15342,155.548223,32.457464,0.232,54,1


In [None]:
data4.mean()

In [None]:
data4.isnull().sum()

In [None]:
data.median()

**Discussion:** 
1. When is median value preferred over the mean value?

For some features, median is a better indicator of the center. When there are a few extremely large values, the mean tends to be significantly larger than a typical value from the majority. Examples: income, grades, age.

2. What are the limitations of mean/median imputation?

    1. Imputation introduces "fake" values to the dataset. It might not be appropriate.
    2. Always using mean value will make values biased towards the center. It reduces the variance.

In [7]:
# The standard deviations of the raw dataset
data.std()

Pregnancies                   3.369578
Glucose                      30.535641
BloodPressure                12.382158
SkinThickness                10.476982
Insulin                     118.775855
BMI                           6.924988
DiabetesPedigreeFunction      0.331329
Age                          11.760232
Outcome                       0.476951
dtype: float64

In [8]:
# the standard deviations of the imputed dataset
data4.std()

Pregnancies                  3.369578
Glucose                     30.435949
BloodPressure               12.096346
SkinThickness                8.790942
Insulin                     85.021108
BMI                          6.875151
DiabetesPedigreeFunction     0.331329
Age                         11.760232
Outcome                      0.476951
dtype: float64

## Approach 3: Hot Deck Imputation
**Hot deck imputation** is a method for handling missing data by replacing them with an random observed value. This imputation method preserves the variance of the dataset.

In [None]:
# Use np.random.choice() to randomly select a value from a list
temp = [1, 2, 3, 4, 5]
np.random.choice(temp)

In [9]:
# Write a function that implements hot deck imputation for a column, and then
# use apply() to apply this function to the data frame

def hot_deck(record, col_name, value_pool):
    """
    This method picks a random value from value_pool, and put it in record[col_name].
    """
    # 1. Skip if the col_name value is not missing in record
    if not np.isnan(record[col_name]):
        return record
    
    # 2. pick a random value from value_pool
    val = np.random.choice(value_pool)
    
    # 3. put this random value to record[col_name]
    record = record.copy() # make a copy of the original record
    record[col_name] = val
    
    return record

In [10]:
record = data.loc[0, :] # The insulin value is missing in this record
print(record)

Pregnancies                   6.000
Glucose                     148.000
BloodPressure                72.000
SkinThickness                35.000
Insulin                         NaN
BMI                          33.600
DiabetesPedigreeFunction      0.627
Age                          50.000
Outcome                       1.000
Name: 0, dtype: float64


In [11]:
# data['Insulin'].dropna() # all the non-missing values from Insulin column

In [12]:
record = hot_deck(data.loc[0, :], "Insulin", data['Insulin'].dropna())
print(record)

Pregnancies                   6.000
Glucose                     148.000
BloodPressure                72.000
SkinThickness                35.000
Insulin                      50.000
BMI                          33.600
DiabetesPedigreeFunction      0.627
Age                          50.000
Outcome                       1.000
Name: 0, dtype: float64


In [13]:
# Compare the standard deviation of imputed dataset and the original one.
data5 = data.apply(hot_deck, args=("Insulin", data['Insulin'].dropna()), axis=1)
data5.isnull().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                       0
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

In [14]:
# Apply the method to all columns
data5 = data.copy()
for col in data5.columns:
    data5 = data5.apply(hot_deck, args=(col, data[col].dropna()), axis=1)
data5.head(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.0,148.0,72.0,35.0,75.0,33.6,0.627,50.0,1.0
1,1.0,85.0,66.0,29.0,64.0,26.6,0.351,31.0,0.0
2,8.0,183.0,64.0,23.0,58.0,23.3,0.672,32.0,1.0
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21.0,0.0
4,0.0,137.0,40.0,35.0,168.0,43.1,2.288,33.0,1.0
5,5.0,116.0,74.0,28.0,87.0,25.6,0.201,30.0,0.0
6,3.0,78.0,50.0,32.0,88.0,31.0,0.248,26.0,1.0
7,10.0,115.0,58.0,29.0,90.0,35.3,0.134,29.0,0.0
8,2.0,197.0,70.0,45.0,543.0,30.5,0.158,53.0,1.0
9,8.0,125.0,96.0,23.0,120.0,28.0,0.232,54.0,1.0


In [15]:
data5.std()

Pregnancies                   3.369578
Glucose                      30.568896
BloodPressure                12.314621
SkinThickness                10.598657
Insulin                     118.397546
BMI                           6.960495
DiabetesPedigreeFunction      0.331329
Age                          11.760232
Outcome                       0.476951
dtype: float64

In [16]:
data.std()

Pregnancies                   3.369578
Glucose                      30.535641
BloodPressure                12.382158
SkinThickness                10.476982
Insulin                     118.775855
BMI                           6.924988
DiabetesPedigreeFunction      0.331329
Age                          11.760232
Outcome                       0.476951
dtype: float64

**Advanced Usage:**

Approach 2 and 3 can be made more specific on which group each instance belongs to.

In [None]:
# Replace the missing Glucose values using the average value from people
# of the same age.
data = raw_data.copy()
index = pd.isnull(data['Glucose'])
data[index]

In [None]:
# find the mean glucose for all the people with age 22
data[data['Age'] == 41]['Glucose'].mean()

In [None]:
data['Glucose'].mean()

In [None]:
# Replace the missing BloodPressure value using a random value from people 
# of the same age.



## Approach 4: Add missing value indicator

Sometimes the values are **not missing at random**, meaning that one cannot simply predict the missing values using existing values. If this is likely the case, then a safe approach is to add an indicator feature of whether the corresponding value is missing.

In [18]:
# Create a boolean indicating whether the insulin value is missing or not
data = data.copy()
data['InsulinMissing'] = data['Insulin'].isnull().astype(int)
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,InsulinMissing
0,6,148.0,72.0,35.0,,33.6,0.627,50,1,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,0,1
2,8,183.0,64.0,,,23.3,0.672,32,1,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1,0


## Approach 5: Use Predictive Machine Learning Model

Reference:
- [MICEFOREST](https://github.com/AnotherSamWilson/miceforest)

In [19]:
!pip install miceforest
# This the command below if the first one does not work:
# !pip install git+https://github.com/AnotherSamWilson/miceforest.git



In [20]:
import miceforest as mf
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

# Load data and introduce missing values
iris = pd.concat(load_iris(as_frame=True,return_X_y=True),axis=1)
iris['target'] = iris['target'].astype('category')
iris_amp = mf.ampute_data(iris,perc=0.25,random_state=1991)

In [21]:
iris_amp

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,,3.5,1.4,0.2,0
1,,3.0,1.4,0.2,0
2,4.7,3.2,1.3,,0
3,4.6,3.1,1.5,0.2,
4,5.0,,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,
146,,2.5,5.0,1.9,2
147,6.5,,5.2,2.0,2
148,,,,2.3,2


In [22]:
# Create kernel. 
kds = mf.KernelDataSet(
  iris_amp,
  save_all_iterations=True,
  random_state=1991
)

# Run the MICE algorithm for 3 iterations
kds.mice(3)

# Return the completed kernel data
completed_data = kds.complete_data()

In [23]:
completed_data

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.0,3.5,1.4,0.2,0
1,4.4,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.1,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.3,3.0,5.1,2.3,2


In [24]:
iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
