# Chapter 7 Data Cleaning and Preparation

During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up 80% or more of an analyst's time. In this chapter, let's study tools for missing data, duplicate data, string manipulation, and some other analytical data transformations.

## I. An Example of Handling Missing Values

The [Pima Indians Diabetes Dataset](https://www.kaggle.com/uciml/pima-indians-diabetes-database) involves predicting the onset of diabetes within 5 years in Pima Indians given medical details.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. The variable names are as follows:

0. Number of times pregnant.
1. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
2. Diastolic blood pressure (mm Hg).
3. Triceps skinfold thickness (mm).
4. 2-Hour serum insulin (mu U/ml).
5. Body mass index (weight in kg/(height in m)^2).
6. Diabetes pedigree function.
7. Age (years).
8. Class variable (0 or 1).

In [3]:
# https://machinelearningmastery.com/handle-missing-data-python/
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
data = pd.read_csv('Data/diabetes/diabetes.csv', delimiter=',')
data.head(3)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1


In [34]:
data.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In the following columns, a value of zero indicates a missing value:

- Plasma glucose concentration
- Diastolic blood pressure
- Triceps skinfold thickness
- 2-Hour serum insulin
- Body mass index

In [5]:
# Find how many missing values exist in each column
zero_filter = (data["Glucose"] == 0) | (data["BloodPressure"] == 0) | (data["SkinThickness"] == 0) | (data["Insulin"] == 0) | (data["BMI"] == 0)
data2 = data[zero_filter]

columns = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]

for column in columns:
    temp_filter = data[column] == 0
    print(column + ": " + str(len(data[temp_filter])))


Glucose: 5
BloodPressure: 35
SkinThickness: 227
Insulin: 374
BMI: 11


We can see that Glucose, BloodPressure, and BMI have just a few zero values, while SkinThickness and Insulin show nearly half of the rows missing.

In [6]:
# We should mark missing values with np.nan, so that these values can be
# correctly ignored from operations such as sum, count, min, etc.

for column in columns:
    for i in np.arange(0, len(data[column])):
        if data[column][i] == 0:
            data[column][i] = np.nan


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [7]:
# Use isnull() to find the number of missing values
data.isnull().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

## Approach 1: Remove Rows/Columns with Missing values

The simpliest strategy for handling missing data is to remove rows/columns that contain a missing value.

In [68]:
# Pandas provides the dropna() function that can be used to drop either columns or rows \
# with missing data.
data3 = data

data3 = data3.dropna()
data3 = data3.reset_index(drop = True)

data3.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
1,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1
2,3,78.0,50.0,32.0,88.0,31.0,0.248,26,1
3,2,197.0,70.0,45.0,543.0,30.5,0.158,53,1
4,1,189.0,60.0,23.0,846.0,30.1,0.398,59,1


In [69]:
data3.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

Removing rows with missing values may significantly reduce the number of rows, and thus hurt the quality of dataset. This approach is only recommended if the number of missing values is small.

In [64]:
len(data)

768

In [52]:
len(data3)

392

## Approach 2: Replace Missing Values with Mean or Median

The mean and median represent the "average" value of the column, and thus can be a reasonable guess on the missing values.

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [78]:
# Pandas provides fillna() function for replacing missing values with a 
# specific value.
data4 = data
for column in columns:
    data4[column] = data4[column].fillna(data4[column].mean())


In [79]:
data4.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [87]:
data = pd.read_csv('Data/diabetes/diabetes.csv', delimiter=',')
data.head(3)
for column in columns:
    for i in np.arange(0, len(data[column])):
        if data[column][i] == 0:
            data[column][i] = np.nan

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [88]:
data.isnull().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

**Discussion:** 
1. When is median value preferred over the mean value?

    __Median value is preferred over the mean value in prescence of a significant amount of outliers.__ 
    
    

2. What are the limitations of mean/median imputation?

    __The limitations of the mean/median imputation is that you're making an assumption that the mean/median of the missing values are the mean/median of the entire dataset and variance is not preserved.__ 

## Approach 3: Hot Deck Imputation
**Hot deck imputation** is a method for handling missing data by replacing them with an random observed value. This imputation method preserves the variance of the dataset.

In [89]:
# Write a function that implements hot deck imputation, and then
# use apply() to apply this function to the data frame

data5 = data.copy()

for col in data5.columns:
    for i in data5.index:
        if pd.isnull(data.loc[i, col]):
            data5.loc[i, col] = np.random.choice(data5[col].dropna())

data5

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,36.0,33.6,0.627,50,1
1,1,85.0,66.0,29.0,207.0,26.6,0.351,31,0
2,8,183.0,64.0,33.0,215.0,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1
5,5,116.0,74.0,37.0,231.0,25.6,0.201,30,0
6,3,78.0,50.0,32.0,88.0,31.0,0.248,26,1
7,10,115.0,66.0,20.0,94.0,35.3,0.134,29,0
8,2,197.0,70.0,45.0,543.0,30.5,0.158,53,1
9,8,125.0,96.0,29.0,325.0,27.4,0.232,54,1


In [90]:
# Compare the standard deviation of imputed dataset with the original.
data.std()

Pregnancies                   3.369578
Glucose                      30.535641
BloodPressure                12.382158
SkinThickness                10.476982
Insulin                     118.775855
BMI                           6.924988
DiabetesPedigreeFunction      0.331329
Age                          11.760232
Outcome                       0.476951
dtype: float64

In [91]:
data5.std()

Pregnancies                   3.369578
Glucose                      30.577145
BloodPressure                12.333788
SkinThickness                10.727015
Insulin                     113.769802
BMI                           6.903038
DiabetesPedigreeFunction      0.331329
Age                          11.760232
Outcome                       0.476951
dtype: float64

**Advance Usage:**

Approach 2 and 3 can be made more specific on which group each instance belongs to.

In [8]:
# Replace the missing Glucose values using the average value from people
# of the same age.
data6 = data.copy()

glucose_null_filter = pd.isnull(data6["Glucose"])
data6[glucose_null_filter]


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
75,1,,48.0,20.0,,24.7,0.14,22,0
182,1,,74.0,20.0,23.0,27.7,0.299,21,0
342,1,,68.0,35.0,,32.0,0.389,22,0
349,5,,80.0,32.0,,41.0,0.346,37,1
502,6,,68.0,41.0,,39.0,0.727,41,1


In [18]:
for i in np.arange(0, len(data6["Glucose"])):
    if(pd.isnull(data6.loc[i, "Glucose"])):
        age_filter = data6["Age"] == data6.loc[i, "Age"]
        mean = data6[age_filter]["Glucose"].mean()
        
        data6.loc[i, "Glucose"] = mean
    


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome


In [20]:
# Replace the missing BloodPressure value using a random value from people 
# of the same age.

for i in np.arange(0, len(data6["BloodPressure"])):
    if(pd.isnull(data6.loc[i, "BloodPressure"])):
        age_filter = data6["Age"] == data6.loc[i, "Age"]
        mean = data6[age_filter]["BloodPressure"].mean()
        
        data6.loc[i, "BloodPressure"] = mean

## Approach 4: Add missing value indicator

Sometimes the values are **not missing at random**, meaning that one cannot simply predict the missing values using existing values. If this is likely the case, then a safe approach is to add an indicator feature of whether the corresponding value is missing.

In [25]:
data7 = data.copy()

data7["isGlucoseMissing"] = data7["Glucose"].isnull()

In [28]:
data7.sample(15)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,isGlucoseMissing
421,2,94.0,68.0,18.0,76.0,26.0,0.561,21,0,False
113,4,76.0,62.0,,,34.0,0.391,25,0,False
397,0,131.0,66.0,40.0,,34.3,0.196,22,1,False
155,7,152.0,88.0,44.0,,50.0,0.337,36,1,False
130,4,173.0,70.0,14.0,168.0,29.7,0.361,33,1,False
225,1,87.0,78.0,27.0,32.0,34.6,0.101,22,0,False
368,3,81.0,86.0,16.0,66.0,27.5,0.306,22,0,False
144,4,154.0,62.0,31.0,284.0,32.8,0.237,23,0,False
512,9,91.0,68.0,,,24.2,0.2,58,0,False
248,9,124.0,70.0,33.0,402.0,35.4,0.282,34,0,False


## II. Data Transformation

### 1. Removing Duplicates

In [29]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                     'k2': [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [30]:
# Identify duplicated rows
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [31]:
# Drop duplicated rows
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


In [32]:
# Drop duplicated values from column k1
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2
0,one,1
1,two,1


## 2. Transforming Data Using a Function or Mapping

In [33]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                              'Pastrami', 'corned beef', 'Bacon',
                              'pastrami', 'honey ham', 'nova lox'],
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


In [34]:
# Suppose that we want to map the meat type to the kind of animal:
meat_to_animal = {
  'bacon': 'pig',
  'pulled pork': 'pig',
  'pastrami': 'cow',
  'corned beef': 'cow',
  'honey ham': 'pig',
  'nova lox': 'salmon'
}

In [35]:
# To make matching simpler, change strings to lowercase first
lowercased = data['food'].str.lower()
lowercased
data['animal'] = lowercased.map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


In [68]:
# We can also pass a function
data['food'].map(lambda x: meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object