# Week 11
# Data Cleaning and Preparation

During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up 80% or more of an analyst's time. This wekk, let's study tools for missing data, duplicate data, string manipulation, and some other analytical data transformations.

Reading:
- Textbook, Chapter 7

## I. Handling Missing Values

For various reasons, many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. Such datasets are usually incompatible with the operations we want to apply to it during the analysis.

In this section, we will discuss several common approaches for handling missing values:
- Discard imcomplete records
- Mean/median imputation
- Hot-deck imputation
- Missing value indicator
- Advanced imputation methods

**An Example Data Set**

The [Pima Indians Diabetes Dataset](https://www.kaggle.com/uciml/pima-indians-diabetes-database) involves predicting the onset of diabetes within 5 years in Pima Indians given medical details.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. The variable names are as follows:

0. Number of times pregnant.
1. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
2. Diastolic blood pressure (mm Hg).
3. Triceps skinfold thickness (mm).
4. 2-Hour serum insulin (mu U/ml).
5. Body mass index (weight in kg/(height in m)^2).
6. Diabetes pedigree function.
7. Age (years).
8. Class variable (0 or 1).

In [1]:
# Reference
# https://machinelearningmastery.com/handle-missing-data-python/
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Load the data set as a data frame named "data"



To save time, we will skip some routine steps such as checking the data types or the distributions.

In [None]:
# Show value counts of the outcomes



In the following columns, a value of zero indicates a missing value:

- Plasma glucose concentration
- Diastolic blood pressure
- Triceps skinfold thickness
- 2-Hour serum insulin
- Body mass index

In [None]:
# We should mark missing values with np.nan, so that these values can be
# correctly ignored from operations such as sum, count, min, etc.
cols = list(data.columns)
cols.remove(cols[0]) # remove the preganicies column
cols.remove(cols[-1]) # remove the outcome column
print(cols)

for col in cols:
    for idx in data.index:
        if data.loc[idx, col] == 0:
            data.loc[idx, col] = np.nan

In [None]:
# How many missing values are there for each feature?



We can see that Glucose, BloodPressure, and BMI have just a few zero values, while SkinThickness and Insulin show nearly half of the rows missing.

## Approach 1: Discard Rows/Columns with Missing values

The simpliest strategy for handling missing data is to discard rows/columns that contain a missing value.

In [None]:
# Pandas provides the dropna() function that can be used to drop either columns or rows \
# with missing data.
data1 = data.dropna()
data1.head()

In [None]:
data1.shape # the size of dataset shrinked significantly

In [None]:
# Change axis paramter to drop columns containing missing values
data2 = data.dropna(axis=1)
data2.head(10) # too many useful features are removed

Removing rows with missing values may significantly reduce the number of rows, and thus hurt the quality of dataset. This approach is only recommended if the number of missing values is small.

## Approach 2: Replace Missing Values with Mean or Median

The mean and median represent the "average" value of the column, and thus can be a reasonable guess on the missing values.

In [None]:
# Pandas provides fillna() function for replacing missing values with a 
# specific value.

# fill the insulin column with the mean value
data3 = data.copy() # raw_data will not be affected
mean = data3['Insulin'].mean()
print(mean)
data3['Insulin'].fillna(mean, inplace=True)
data3.head(10)

In [None]:
# Perform mean imputation for all columns
data4 = data.copy()
data4.fillna(data4.mean(), inplace=True)
data4.head(10)

In [None]:
data4.isnull().sum()

**Discussion:** 
1. When is median value preferred over the mean value?

For some features, median is a better indicator of the center. When there are a few extremely large values, the mean tends to be significantly larger than a typical value from the majority. Examples: income, grades, age.

2. What are the limitations of mean/median imputation?

    1. Imputation introduces "fake" values to the dataset. It might not be appropriate.
    2. Always using mean value will make values biased towards the center. It reduces the variance.

In [None]:
# The standard deviations of the raw dataset
data.std()

In [None]:
# the standard deviations of the imputed dataset
data4.std()

## Approach 3: Hot Deck Imputation
**Hot deck imputation** is a method for handling missing data by replacing them with an random observed value. This imputation method preserves the variance of the dataset.

In [None]:
# Write a function that implements hot deck imputation for a column, and then
# use apply() to apply this function to the data frame




In [None]:
# Compare the standard deviation of imputed dataset and the original one.


**Advanced Usage:**

Approach 2 and 3 can be made more specific on which group each instance belongs to.

In [None]:
# Replace the missing Glucose values using the average value from people
# of the same age.
data = raw_data.copy()
index = pd.isnull(data['Glucose'])
data[index]

In [None]:
# find the mean glucose for all the people with age 22
data[data['Age'] == 41]['Glucose'].mean()

In [None]:
data['Glucose'].mean()

In [None]:
# Replace the missing BloodPressure value using a random value from people 
# of the same age.



## Approach 4: Add missing value indicator

Sometimes the values are **not missing at random**, meaning that one cannot simply predict the missing values using existing values. If this is likely the case, then a safe approach is to add an indicator feature of whether the corresponding value is missing.

In [None]:
# Create a boolean indicating whether the insulin value is missing or not
data = raw_data.copy()
data['InsulinMissing'] = data['Insulin'].isnull().astype(int)
data.head()

## Approach 5: Use Predictive Machine Learning Model

Reference:
- [MICEFOREST](https://github.com/AnotherSamWilson/miceforest)

In [3]:
!pip install miceforest
# This the command below if the first one does not work:
# !pip install git+https://github.com/AnotherSamWilson/miceforest.git

Collecting git+https://github.com/AnotherSamWilson/miceforest.git

  Running command git clone -q https://github.com/AnotherSamWilson/miceforest.git 'C:\Users\lzhao\AppData\Local\Temp\pip-req-build-dck07sdu'



  Cloning https://github.com/AnotherSamWilson/miceforest.git to c:\users\lzhao\appdata\local\temp\pip-req-build-dck07sdu
Collecting seaborn>=0.11.0
  Using cached seaborn-0.11.0-py3-none-any.whl (283 kB)
Building wheels for collected packages: miceforest
  Building wheel for miceforest (setup.py): started
  Building wheel for miceforest (setup.py): finished with status 'done'
  Created wheel for miceforest: filename=miceforest-2.0.3-py3-none-any.whl size=27133 sha256=181dd6e74cd372886502a26050852d546752c9ff22c90ae0b965e9987f5736a1
  Stored in directory: C:\Users\lzhao\AppData\Local\Temp\pip-ephem-wheel-cache-o8ojuah5\wheels\42\73\11\99af38bd0099fa95a8011ddba41feca47e12a7e279ef1db619
Successfully built miceforest
Installing collected packages: seaborn, miceforest
  Attempting uninstall: seaborn
    Found existing installation: seaborn 0.10.1
    Uninstalling seaborn-0.10.1:
      Successfully uninstalled seaborn-0.10.1
Successfully installed miceforest-2.0.3 seaborn-0.11.0


In [6]:
import miceforest as mf
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

# Load data and introduce missing values
iris = pd.concat(load_iris(as_frame=True,return_X_y=True),axis=1)
iris['target'] = iris['target'].astype('category')
iris_amp = mf.ampute_data(iris,perc=0.25,random_state=1991)

In [23]:
# Create kernel. 
kds = mf.KernelDataSet(
  iris_amp,
  save_all_iterations=True,
  random_state=1991
)

# Run the MICE algorithm for 3 iterations
kds.mice(3)

# Return the completed kernel data
completed_data = kds.complete_data()

In [24]:
completed_data

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.0,3.5,1.4,0.2,0
1,4.4,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.1,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.3,3.0,5.1,2.3,2
