# Chapter 7 Data Cleaning and Preparation

During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up 80% or more of an analyst's time. In this chapter, let's study tools for missing data, duplicate data, string manipulation, and some other analytical data transformations.

## I. An Example of Handling Missing Values

The [Pima Indians Diabetes Dataset](https://www.kaggle.com/uciml/pima-indians-diabetes-database) involves predicting the onset of diabetes within 5 years in Pima Indians given medical details.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. The variable names are as follows:

0. Number of times pregnant.
1. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
2. Diastolic blood pressure (mm Hg).
3. Triceps skinfold thickness (mm).
4. 2-Hour serum insulin (mu U/ml).
5. Body mass index (weight in kg/(height in m)^2).
6. Diabetes pedigree function.
7. Age (years).
8. Class variable (0 or 1).

In [1]:
# https://machinelearningmastery.com/handle-missing-data-python/
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
data = pd.read_csv('Data/diabetes/diabetes.csv', delimiter=',')
data.head(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


In [3]:
data['Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

In [4]:
data.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [5]:
pd.isnull(data).sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In the following columns, a value of zero indicates a missing value:

- Plasma glucose concentration
- Diastolic blood pressure
- Triceps skinfold thickness
- 2-Hour serum insulin
- Body mass index

In [6]:
# Find how many missing values exist in each column
# Count how many zeros there are in each column
(data == 0).sum()

Pregnancies                 111
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                     500
dtype: int64

We can see that Glucose, BloodPressure, and BMI have just a few zero values, while SkinThickness and Insulin show nearly half of the rows missing.

In [7]:
# We should mark missing values with np.nan, so that these values can be
# correctly ignored from operations such as sum, count, min, etc.
cols = list(data.columns)
cols.remove(cols[0]) # remove the preganicies column
cols.remove(cols[-1]) # remove the outcome column
print(cols)
for col in cols:
    for idx in data.index:
        if data.loc[idx, col] == 0:
            data.loc[idx, col] = np.nan

['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']


In [8]:
pd.isnull(data).sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

In [9]:
data = pd.read_csv('Data/diabetes/diabetes.csv', delimiter=',')
data.head(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


In [10]:
# Replace on column basis
for col in cols:
    index = (data[col] == 0)
    data.loc[index, col] = np.NaN
data.head(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,,33.6,0.627,50.0,1
1,1,85.0,66.0,29.0,,26.6,0.351,31.0,0
2,8,183.0,64.0,,,23.3,0.672,32.0,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21.0,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33.0,1
5,5,116.0,74.0,,,25.6,0.201,30.0,0
6,3,78.0,50.0,32.0,88.0,31.0,0.248,26.0,1
7,10,115.0,,,,35.3,0.134,29.0,0
8,2,197.0,70.0,45.0,543.0,30.5,0.158,53.0,1
9,8,125.0,96.0,,,,0.232,54.0,1


In [11]:
# The third attempt
data = pd.read_csv('Data/diabetes/diabetes.csv', delimiter=',')
index1 = (data == 0)
index1
index1['Pregnancies'] = False
index1['Outcome'] = False
data[index1] = np.NaN
data.head(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,0
2,8,183.0,64.0,,,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1
5,5,116.0,74.0,,,25.6,0.201,30,0
6,3,78.0,50.0,32.0,88.0,31.0,0.248,26,1
7,10,115.0,,,,35.3,0.134,29,0
8,2,197.0,70.0,45.0,543.0,30.5,0.158,53,1
9,8,125.0,96.0,,,,0.232,54,1


In [12]:
# Use isnull() to find the number of missing values
data.isnull().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

## Approach 1: Remove Rows/Columns with Missing values

The simpliest strategy for handling missing data is to remove rows/columns that contain a missing value.

In [13]:
# Pandas provides the dropna() function that can be used to drop either columns or rows \
# with missing data.
data1 = data.dropna()
data1.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1
6,3,78.0,50.0,32.0,88.0,31.0,0.248,26,1
8,2,197.0,70.0,45.0,543.0,30.5,0.158,53,1
13,1,189.0,60.0,23.0,846.0,30.1,0.398,59,1


In [14]:
data1.shape # the size of dataset shrinked significantly

(392, 9)

In [15]:
# Change axis paramter to drop columns containing missing values
data2 = data.dropna(axis=1)
data2.head(10) # too many useful features are removed

Unnamed: 0,Pregnancies,DiabetesPedigreeFunction,Age,Outcome
0,6,0.627,50,1
1,1,0.351,31,0
2,8,0.672,32,1
3,1,0.167,21,0
4,0,2.288,33,1
5,5,0.201,30,0
6,3,0.248,26,1
7,10,0.134,29,0
8,2,0.158,53,1
9,8,0.232,54,1


Removing rows with missing values may significantly reduce the number of rows, and thus hurt the quality of dataset. This approach is only recommended if the number of missing values is small.

## Approach 2: Replace Missing Values with Mean or Median

The mean and median represent the "average" value of the column, and thus can be a reasonable guess on the missing values.

In [16]:
raw_data = pd.read_csv("Data/diabetes/diabetes.csv")
# replace zeros with np.nan
cols = list(raw_data.columns)
cols.remove(cols[0])
cols.remove(cols[-1])
for col in cols:
    index = (raw_data[col] == 0)
    raw_data.loc[index, col] = np.nan

In [17]:
# Pandas provides fillna() function for replacing missing values with a 
# specific value.

# fill the insulin column with the mean value
data = raw_data.copy() # raw_data will not be affected
mean = data['Insulin'].mean()
print(mean)
data['Insulin'].fillna(mean, inplace=True)
data.head(10)

155.5482233502538


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,155.548223,33.6,0.627,50.0,1
1,1,85.0,66.0,29.0,155.548223,26.6,0.351,31.0,0
2,8,183.0,64.0,,155.548223,23.3,0.672,32.0,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21.0,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33.0,1
5,5,116.0,74.0,,155.548223,25.6,0.201,30.0,0
6,3,78.0,50.0,32.0,88.0,31.0,0.248,26.0,1
7,10,115.0,,,155.548223,35.3,0.134,29.0,0
8,2,197.0,70.0,45.0,543.0,30.5,0.158,53.0,1
9,8,125.0,96.0,,155.548223,,0.232,54.0,1


In [18]:
# Perform mean imputation for all columns
data = raw_data.copy()
data.fillna(data.mean(), inplace=True)
data.head(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,155.548223,33.6,0.627,50.0,1
1,1,85.0,66.0,29.0,155.548223,26.6,0.351,31.0,0
2,8,183.0,64.0,29.15342,155.548223,23.3,0.672,32.0,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21.0,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33.0,1
5,5,116.0,74.0,29.15342,155.548223,25.6,0.201,30.0,0
6,3,78.0,50.0,32.0,88.0,31.0,0.248,26.0,1
7,10,115.0,72.405184,29.15342,155.548223,35.3,0.134,29.0,0
8,2,197.0,70.0,45.0,543.0,30.5,0.158,53.0,1
9,8,125.0,96.0,29.15342,155.548223,32.457464,0.232,54.0,1


In [19]:
data.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

**Discussion:** 
1. When is median value preferred over the mean value?

A. for some features, median is a better indicator of the center. When there are a few extremely large
values, the mean tends to be significantly larger than a typical value from the majority. Examples: income, grades, age.

2. What are the limitations of mean/median imputation?

A. Imputation introduces "fake" values to the dataset. It might not be appropriate.
B. Always using mean value will make values biased towards the center. It reduces the variance.

In [20]:
# The standard deviations of the raw dataset
raw_data.std()

Pregnancies                   3.369578
Glucose                      30.535641
BloodPressure                12.382158
SkinThickness                10.476982
Insulin                     118.775855
BMI                           6.924988
DiabetesPedigreeFunction      0.331329
Age                          11.760232
Outcome                       0.476951
dtype: float64

In [21]:
# the standard deviations of the imputed dataset
data.std()

Pregnancies                  3.369578
Glucose                     30.435949
BloodPressure               12.096346
SkinThickness                8.790942
Insulin                     85.021108
BMI                          6.875151
DiabetesPedigreeFunction     0.331329
Age                         11.760232
Outcome                      0.476951
dtype: float64

## Approach 3: Hot Deck Imputation
**Hot deck imputation** is a method for handling missing data by replacing them with an random observed value. This imputation method preserves the variance of the dataset.

In [30]:
subdata

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
1,1,85.0,66.0,29.0,,26.6,0.351,31.0,0
2,8,183.0,64.0,,,23.3,0.672,32.0,1


In [52]:
# Write a function that implements hot deck imputation, and then
# use apply() to apply this function to the data frame
data = raw_data.copy()

def hot_deck_imputation(row, cols, dataset):
    row2 = row.copy()
    for col in cols:
        if pd.isnull(row2[col]):
            row2[col] = np.random.choice(data[col].dropna())
    return row2

# data.apply(hot_deck_imputation)
# print(data.loc[0])
# hot_deck_imputation(data.loc[0], data.columns, data)
# Apply this function to each row of the data frame
data = data.apply(hot_deck_imputation, args=(data.columns, data), axis=1)
data.head(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.0,148.0,72.0,35.0,15.0,33.6,0.627,50.0,1.0
1,1.0,85.0,66.0,29.0,277.0,26.6,0.351,31.0,0.0
2,8.0,183.0,64.0,27.0,210.0,23.3,0.672,32.0,1.0
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21.0,0.0
4,0.0,137.0,40.0,35.0,168.0,43.1,2.288,33.0,1.0
5,5.0,116.0,74.0,49.0,40.0,25.6,0.201,30.0,0.0
6,3.0,78.0,50.0,32.0,88.0,31.0,0.248,26.0,1.0
7,10.0,115.0,76.0,17.0,88.0,35.3,0.134,29.0,0.0
8,2.0,197.0,70.0,45.0,543.0,30.5,0.158,53.0,1.0
9,8.0,125.0,96.0,13.0,190.0,35.1,0.232,54.0,1.0


In [41]:
row = data.loc[1, :]
row.index

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [53]:
# Use for loops
data = raw_data.copy()
for col in data.columns:
    for idx in data.index:
        if pd.isnull(data.loc[idx, col]):
            data.loc[idx, col] = np.random.choice(data[col].dropna())
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,230.0,33.6,0.627,50.0,1
1,1,85.0,66.0,29.0,185.0,26.6,0.351,31.0,0
2,8,183.0,64.0,10.0,160.0,23.3,0.672,32.0,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21.0,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33.0,1


In [None]:
# How to draw a random sample from a column?
data.sample()

In [None]:
df = pd.DataFrame(columns=['Age'], index=range(10))
df[df.index < 10] = 50
df.loc[9] = 10
df.loc[10] = np.nan
df

In [None]:
# impute using a random sample
for i in range(20):
    print(df.sample())

In [None]:
# Compare the standard deviation of imputed dataset with the original.
raw_data.std()

In [None]:
data.std()

**Advance Usage:**

Approach 2 and 3 can be made more specific on which group each instance belongs to.

In [None]:
# Replace the missing Glucose values using the average value from people
# of the same age.
data = raw_data.copy()
index = pd.isnull(data['Glucose'])
data[index]

In [None]:
# find the mean glucose for all the people with age 22
data[data['Age'] == 41].mean()

In [None]:
# Replace the missing BloodPressure value using a random value from people 
# of the same age.



## Approach 4: Add missing value indicator

Sometimes the values are **not missing at random**, meaning that one cannot simply predict the missing values using existing values. If this is likely the case, then a safe approach is to add an indicator feature of whether the corresponding value is missing.

## Approach 5: Use Predictive Machine Learning Model (omitted)

## II. Data Transformation

### 1. Removing Duplicates

In [None]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                     'k2': [1, 1, 2, 3, 3, 4, 4]})
data

In [None]:
# Identify duplicated rows
data.duplicated()

In [None]:
# Drop duplicated rows
data.drop_duplicates()

In [None]:
# Drop duplicated values from column k1
data.drop_duplicates(['k1'])

## 2. Transforming Data Using a Function or Mapping

In [None]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                              'Pastrami', 'corned beef', 'Bacon',
                              'pastrami', 'honey ham', 'nova lox'],
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

In [None]:
# Suppose that we want to map the meat type to the kind of animal:
meat_to_animal = {
  'bacon': 'pig',
  'pulled pork': 'pig',
  'pastrami': 'cow',
  'corned beef': 'cow',
  'honey ham': 'pig',
  'nova lox': 'salmon'
}

In [None]:
# To make matching simpler, change strings to lowercase first
lowercased = data['food'].str.lower()
lowercased
data['animal'] = lowercased.map(meat_to_animal)
data

In [None]:
# We can also pass a function
data['food'].map(lambda x: meat_to_animal[x.lower()])