## Introduction to Data Preparation Techniques

Surveys of data analysts found that they spend most of their time massaging rather than mining or modeling data. Data preparation accounts for 60% to 80% of the work of data analysts. In this section, we will introduce some common techniques in data preparation. Mastering these techniques may save you a lot time and trouble. In this lesson, we will introduce some commonly used Python data preparation techniques.

## Table of Contents

[Load Data](#Load-Data)

[Change Column Name](#Change-Column-Name)  

[Handle Missing Data](#Handle-Missing-Data)  

[Manipulate String Values](#Manipulate-String-Values)  

[Transform Datetime](#Transform-Datetime)  

[Lambda Function](#Lambda-Function)


-----
[[Back to TOC]](#Table-of-Contents)

## Load Data

A dataset can be one of several different types. Dataset type is distinguished on the basis of data storage and structure. In this lesson, we will briefly introduce how to load **structured** data from various storage types, CSV file, Excel file and relational database, using `pandas` functions.

#### From CSV File

Comma separated values or CSV file, is one of the most popular way to store structured data. We call `pandas` function `read_csv()` to load a dataset from CSV files to `pandas` `DataFrame`.

> ~~~
>import pandas as pd
>df = pd.read_csv('filename.csv')
> ~~~

`read_csv()` have many other arguments in addition to file name. For example, the following code loads a CSV file that uses '\|' as delimiter, skips the first 5 rows in the file, sets the first column as DataFrame index:

> ~~~
>df = pd.read_csv('filename.csv', delimiter='|', index_col=0, skiprows=5)
> ~~~

To learn the detail of the function, check out the functions doc string:

> ~~~
>#display function doc string
>help(pd.read_csv)
> ~~~

#### From Spreadsheet

To load a dataset from an Excel file, we will use `pandas` function `read_excel()`. The following code loads a dataset from sheet "sheet1" of an excel file "filename.xlsx":

> ~~~
>df = pd.read_excel('filename.xlsx', sheet_name='sheet1')
> ~~~

As `read_csv()`, `read_excel()` takes many other arguments. Please refer to the doc string of the function (`help(pd.read_excel)`) for more detail.

#### From Relational database

`pandas` function `read_sql()` can be used to load data from a relational database. We need to provide a database connection to the function, we also need to provide a query that will be used to retrieve values from the database. In the following code, we assume there's a SQLite database "mydb" in current folder, and there's a table with name tblCustomer. We will load all rows from tblCustomer table to a DataFrame. The column names of tblCustomer will be the column names of the resulting DataFrame.

> ~~~
>import pandas as pd
>import sqlite3 as sql
>with sql.connect('mydb') as con:
    df_customer = pd.read_sql('select * from tblCustomer', con)
> ~~~

Using `with` clause to create a database connection ensures that the connection will be closed after the block is executed.


-----
[[Back to TOC]](#Table-of-Contents)

## Change Column Name

When a dataset is loaded to a DataFrame, it may or may not have column names. If there are no column names, we usually need to assign meaningful column names based on data understanding. When the DataFrame does have column names, we may still need to modify them to make the following work easier. Industry practice suggests column names should:
- **Not** contain spaces. Instead of `Last Name` as column name, it should be `LastName` or `last_name`.
- **Not** contain special characters(e.g, $, &, %, -, all punctuations).
- Be descriptive and provide some information in the field.

DataFrame has a `columns` property, and you can reset all column names with:

`df.columns = [col_name1, col_name2...]`

You can also change a particular column name with function `rename()`:

`df.rename(columns={orignial_name:new_name}, inplace=True)`

Notice that `df.rename()` will return a renamed DataFrame by default. If you want to rename `df` in place, you need to set the function argument `inplace=True`. For more details of the function, check out the doc string with `help(df.rename)` or `df.rename?`.

In [1]:
import pandas as pd
df1 = pd.DataFrame({'Last Name':['Brunner', 'Lu'], 'First Name':['Rob', 'Lin'], 'Age':[21, 22]})
df1

Unnamed: 0,Last Name,First Name,Age
0,Brunner,Rob,21
1,Lu,Lin,22


In [2]:
#print current column names
df1.columns

Index(['Last Name', 'First Name', 'Age'], dtype='object')

In [3]:
df1_1 = df1.copy()
#set new column names
df1_1.columns = ['LastName', 'FirstName', 'Age']
df1_1

Unnamed: 0,LastName,FirstName,Age
0,Brunner,Rob,21
1,Lu,Lin,22


In [4]:
df1_2 = df1.copy()
#change specific column names
df1_2.rename(columns={'Last Name':'LastName', 'First Name':'FirstName'}, inplace=True)
df1_2

Unnamed: 0,LastName,FirstName,Age
0,Brunner,Rob,21
1,Lu,Lin,22


-----
[[Back to TOC]](#Table-of-Contents)

## Handle Missing Data

Data sets can have missing values for various reasons such as observations not recorded or data corruption. Handling missing data is very important as many analytic algorithms do not support data with missing values.

Missing data is normally represented as `NaN` which stands for "Not a Number" in Pandas DataFrame(`NaT` for datetime data). We will demonstrate common techniques to handling missing values in a DataFrame which include:
- Find general information about missing values in the DataFrame.
- Simply drop rows contain missing value.
- Fill missing value
    - with constant
    - with mode (for categorical features)
    - with mean/median (for continuous features)
    - with estimated values generated from other columns
    
First, we will create a DataFrame with missing values.




In [5]:
import numpy as np
df2 = pd.DataFrame({'LastName':['Mend', 'Brun', 'Lu', 'Guy'], 'FirstName':['Kim', 'Rob', 'Lin', 'Ron'],
                    'Gender':['F', 'M', 'M', np.NaN], 'Age':[20, 21, np.NaN, 22]})
df2

Unnamed: 0,LastName,FirstName,Gender,Age
0,Mend,Kim,F,20.0
1,Brun,Rob,M,21.0
2,Lu,Lin,M,
3,Guy,Ron,,22.0


#### General information about missing values

Pandas DataFrame has a function `info()` that will print out general DataFrame information including number of rows, number of columns, each column's data type, and number of not null values in all columns. The below cell shows that df2 has 3 rows, and 4 columns. LastName and FirstName have 3 non-null values, or no missing values, Gender and Age have 2 non-null values or each has one missing value. 


In [6]:
#general information about missing values
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
LastName     4 non-null object
FirstName    4 non-null object
Gender       3 non-null object
Age          3 non-null float64
dtypes: float64(1), object(3)
memory usage: 256.0+ bytes


We can also use `df.isnull().sum()` to get the number of missing values in each column. The output is a pandas Series, with column name as index and number of missing values as values. From the following cell, we can see that there's no missing value in LastName and FirstName, there's 1 missing value in Gender and Age. You may also check the number of missing values in specific column as shown below.

In [7]:
df2.isnull().sum()

LastName     0
FirstName    0
Gender       1
Age          1
dtype: int64

In [8]:
#check number of missing value in specific column
df2.Gender.isnull().sum()

1

#### Simply Drop Rows with Missing Values

The simplest way to handle missing values is simply dropping rows that have missing values in particular columns with DataFrame function `dropna()`. The function by default will return a new DataFrame with the missing values dropped. If you want to modify the DataFrame in place, you'll have to set `inplace=True`. Try `help(df2.dropna)` or `df2.dropna?` to see details of this function.

In [9]:
#drop all rows that have missing values
df2_1 = df2.dropna()
df2_1

Unnamed: 0,LastName,FirstName,Gender,Age
0,Mend,Kim,F,20.0
1,Brun,Rob,M,21.0


In [10]:
#drop rows that miss Age
df2_2 = df2.dropna(subset=['Age'])
df2_2

Unnamed: 0,LastName,FirstName,Gender,Age
0,Mend,Kim,F,20.0
1,Brun,Rob,M,21.0
3,Guy,Ron,,22.0


#### Fill Missing Values

We can also fill missing values with `fillna()` function. We will set function argument `inplace=True` to fill missing values in place.

We will demonstrate three ways to fill missing values:

- Fill with constant.
- Fill with mean/mode.
- Fill with values calculated based on other column.

In [11]:
df2

Unnamed: 0,LastName,FirstName,Gender,Age
0,Mend,Kim,F,20.0
1,Brun,Rob,M,21.0
2,Lu,Lin,M,
3,Guy,Ron,,22.0


##### Fill with constant
Fill missing Age with 20.

In [12]:
#fill missing age with 20
df2_3 = df2.copy()
df2_3.Age.fillna(20, inplace=True)
df2_3

Unnamed: 0,LastName,FirstName,Gender,Age
0,Mend,Kim,F,20.0
1,Brun,Rob,M,21.0
2,Lu,Lin,M,20.0
3,Guy,Ron,,22.0


##### Fill with mean/mode

In [13]:
#fill missing age with mean age
df2_4 = df2.copy()
df2_4.Age.fillna(df2_4.Age.mean(), inplace=True)
df2_4

Unnamed: 0,LastName,FirstName,Gender,Age
0,Mend,Kim,F,20.0
1,Brun,Rob,M,21.0
2,Lu,Lin,M,21.0
3,Guy,Ron,,22.0


In [14]:
#fill missing Gender with mode
df2_5 = df2.copy()
df2_5.Gender.fillna(df2_5.Gender.mode()[0], inplace=True)
df2_5

Unnamed: 0,LastName,FirstName,Gender,Age
0,Mend,Kim,F,20.0
1,Brun,Rob,M,21.0
2,Lu,Lin,M,
3,Guy,Ron,M,22.0


##### Fill with values calculated based on other column

In [15]:
#fill missing Age with average age based on gender
df2_6 = df2.copy()
df2_6.Age.fillna(df2_6.groupby('Gender').Age.transform('mean'), inplace=True)
df2_6

Unnamed: 0,LastName,FirstName,Gender,Age
0,Mend,Kim,F,20.0
1,Brun,Rob,M,21.0
2,Lu,Lin,M,21.0
3,Guy,Ron,,22.0


-----
[[Back to TOC]](#Table-of-Contents)

## Manipulate String Values

There are countless ways you might want to transform string values. We will introduce some common manipulations. We will mostly use pandas Series attribute `str` to invoke string functions.

- Strip leading/trailing white spaces.
- Unify string case (this is especially critical if you need to compare or group by the values in the column).
- Replace certain characters in string (common when dealing with currency values).

First, we will create a DataFrame to demonstrate string manipulations. We will:
- strip leading/trailing white spaces in Gender column and convert values to all upper case, and
- remove '$' and ',' from Tuition column and convert the column to float data type.


In [16]:
df3 = pd.DataFrame({'Name':['Kim Mend', 'Rob Brun', 'Lin Lu', 'Ron Guy'],
                   'Gender':['Femal  ', '  Male  ', 'male', 'Male'],
                    'Birthday':['3/31/1998', 'Jan 15, 1999', '20/1/1998', '5/6/1997'],
                   'Tuition':['$10,125', '$9999', '0', '$2000']})
df3

Unnamed: 0,Name,Gender,Birthday,Tuition
0,Kim Mend,Femal,3/31/1998,"$10,125"
1,Rob Brun,Male,"Jan 15, 1999",$9999
2,Lin Lu,male,20/1/1998,0
3,Ron Guy,Male,5/6/1997,$2000


##### Strip leading/trailing white spaces and unify string case

In [17]:
#Strip leading/trailing spaces in Gender column
df3['Gender'] = df3.Gender.str.strip()
#Convert to all upper case
df3['Gender'] = df3.Gender.str.upper()
df3

Unnamed: 0,Name,Gender,Birthday,Tuition
0,Kim Mend,FEMAL,3/31/1998,"$10,125"
1,Rob Brun,MALE,"Jan 15, 1999",$9999
2,Lin Lu,MALE,20/1/1998,0
3,Ron Guy,MALE,5/6/1997,$2000


##### Replace substring

Currency values are some times stored as strings like $12,345.99. We need to strip dollar sign and comma from the values and then convert to float for further operations. The following code achieves three tasks in one line. Notice that you will need to involve attribute `str` for both call of string function `replace()`.

In [18]:
#Remove $ and , in Tuition, convert to float
df3['Tuition'] = df3.Tuition.str.replace('$', '').str.replace(',','').astype(float)
df3

Unnamed: 0,Name,Gender,Birthday,Tuition
0,Kim Mend,FEMAL,3/31/1998,10125.0
1,Rob Brun,MALE,"Jan 15, 1999",9999.0
2,Lin Lu,MALE,20/1/1998,0.0
3,Ron Guy,MALE,5/6/1997,2000.0


-----
[[Back to TOC]](#Table-of-Contents)

## Transform Datetime

Date and time are very common in business related datasets. We always want to convert date time information from string to datetime data type. We sometimes need to create datetime related columns for datetime related analysis, i.e., sales in different days of the week. 

#### Convert Datetime String to Datetime Object

Pandas has a function `to_datetime()` that will convert date time string to datetime datatype. This function can deal with various date time formats. 

In [19]:
df3['Birthday_clean'] = pd.to_datetime(df3.Birthday)
df3

Unnamed: 0,Name,Gender,Birthday,Tuition,Birthday_clean
0,Kim Mend,FEMAL,3/31/1998,10125.0,1998-03-31
1,Rob Brun,MALE,"Jan 15, 1999",9999.0,1999-01-15
2,Lin Lu,MALE,20/1/1998,0.0,1998-01-20
3,Ron Guy,MALE,5/6/1997,2000.0,1997-05-06


We need to be very careful when dealing with datetime values, for example, for '5/6/1998', it's unclear whether it's May 6 or Jun 5. `pd.to_datetime()` by default will assume '5' is month and '6' is date. But this may not be the case. You may pass datetime format to the function to eliminate the ambiguity like the following example shows. But if you want to apply a format to a DataFrame column, this column has to have the same datetime format. For a dataset that has very messy datetime format, you need more advanced techniques to clean them up, which is out of the scope of this lesson.

You may review the list of all [python datetime format codes](https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior).

In [20]:
#set datetime format
pd.to_datetime('5/6/1997', format='%d/%m/%Y')

Timestamp('1997-06-05 00:00:00')

#### Create New Columns from Datetime Column

Pandas datetime object has functions to get datetime parts like year, month, day, day of week, hour etc. Similar to string functions, we need to use datetime attribute `dt` when invoking datetime functions. In the following cells, we will add new datetime related columns into the DataFrame created above.

In [21]:
df4 = df3.copy()
df4

Unnamed: 0,Name,Gender,Birthday,Tuition,Birthday_clean
0,Kim Mend,FEMAL,3/31/1998,10125.0,1998-03-31
1,Rob Brun,MALE,"Jan 15, 1999",9999.0,1999-01-15
2,Lin Lu,MALE,20/1/1998,0.0,1998-01-20
3,Ron Guy,MALE,5/6/1997,2000.0,1997-05-06


In [22]:
df4['Year'] = df4.Birthday_clean.dt.year
df4['Month'] = df4.Birthday_clean.dt.month
df4['Day'] = df4.Birthday_clean.dt.day
df4['DayOfWeek'] = df4.Birthday_clean.dt.dayofweek
df4['Hour'] = df4.Birthday_clean.dt.hour
df4

Unnamed: 0,Name,Gender,Birthday,Tuition,Birthday_clean,Year,Month,Day,DayOfWeek,Hour
0,Kim Mend,FEMAL,3/31/1998,10125.0,1998-03-31,1998,3,31,1,0
1,Rob Brun,MALE,"Jan 15, 1999",9999.0,1999-01-15,1999,1,15,4,0
2,Lin Lu,MALE,20/1/1998,0.0,1998-01-20,1998,1,20,1,0
3,Ron Guy,MALE,5/6/1997,2000.0,1997-05-06,1997,5,6,1,0


-----
[[Back to TOC]](#Table-of-Contents)

## Lambda Function

Python lambda functions, also known as anonymous functions, are inline functions that do not have a name. They are created with the lambda keyword. The syntax is `lambda arguments : expression`. The following cell defines a lambda function that adds up 2 numbers.

In [23]:
x = lambda a, b: a + b
x(3,4)

7

The lambda function is very handy in DataFrame manipulations. We will use 2 examples to demonstrate this. The first one is split one column to two columns with lambda function. The second one is to create datetime column from multiple columns.

#### Apply Lambda Function on One Column

The syntax to apply lambda function on a column is `df.column.apply(lambda x:expression(x))`, which will apply the lambda function on each value in the column, then return a Pandas Series with return value of the lambda function. We will create FirstName and Lastname columns by applying lambda function on Name column in the following example.

In [24]:
df5 = df4.copy()
df5

Unnamed: 0,Name,Gender,Birthday,Tuition,Birthday_clean,Year,Month,Day,DayOfWeek,Hour
0,Kim Mend,FEMAL,3/31/1998,10125.0,1998-03-31,1998,3,31,1,0
1,Rob Brun,MALE,"Jan 15, 1999",9999.0,1999-01-15,1999,1,15,4,0
2,Lin Lu,MALE,20/1/1998,0.0,1998-01-20,1998,1,20,1,0
3,Ron Guy,MALE,5/6/1997,2000.0,1997-05-06,1997,5,6,1,0


In [25]:
#create last name and first name from Name
df5['Firstname'] = df5.Name.apply(lambda x:x.split()[0])
df5['Lastname'] = df5.Name.apply(lambda x:x.split()[1])
df5

Unnamed: 0,Name,Gender,Birthday,Tuition,Birthday_clean,Year,Month,Day,DayOfWeek,Hour,Firstname,Lastname
0,Kim Mend,FEMAL,3/31/1998,10125.0,1998-03-31,1998,3,31,1,0,Kim,Mend
1,Rob Brun,MALE,"Jan 15, 1999",9999.0,1999-01-15,1999,1,15,4,0,Rob,Brun
2,Lin Lu,MALE,20/1/1998,0.0,1998-01-20,1998,1,20,1,0,Lin,Lu
3,Ron Guy,MALE,5/6/1997,2000.0,1997-05-06,1997,5,6,1,0,Ron,Guy


In the above example, lambda function `lambda x:x.split()[0]` takes one string argument, x, `x.split()[0]` will first call split function on x which splits x with whitespace, then takes the first word, which is first name. Similarly, the second word is lastname.

#### Apply Lambda Function on DataFrame Rows/Columns

You may also apply lambda function on the DataFrame rows or columns with syntax `df.apply(lambda x: expression(x), axis=1)`. Notice that there's a new argument `axis`. This is because when call apply() on a DataFrame, x will be either a column or a row of the DataFrame. `axis=0` means x will be a column of the DataFrame. In most cases, we'd apply lambda function on each row, which means `axis=1`. In the following example, we will create a new datetime column based on year, month and day columns.

In [26]:
from datetime import datetime
df5['Birthday_created'] = df5.apply(lambda x: datetime(x.Year, x.Month, x.Day), axis=1)
df5

Unnamed: 0,Name,Gender,Birthday,Tuition,Birthday_clean,Year,Month,Day,DayOfWeek,Hour,Firstname,Lastname,Birthday_created
0,Kim Mend,FEMAL,3/31/1998,10125.0,1998-03-31,1998,3,31,1,0,Kim,Mend,1998-03-31
1,Rob Brun,MALE,"Jan 15, 1999",9999.0,1999-01-15,1999,1,15,4,0,Rob,Brun,1999-01-15
2,Lin Lu,MALE,20/1/1998,0.0,1998-01-20,1998,1,20,1,0,Lin,Lu,1998-01-20
3,Ron Guy,MALE,5/6/1997,2000.0,1997-05-06,1997,5,6,1,0,Ron,Guy,1997-05-06


-----

## Ancillary Information

The following links are to additional documentation that you might find helpful in learning this material. Reading these web-accessible documents is completely optional.

1. [Pandas documentation][pdd]
2. A complete Pandas [tutorial][pdt]
3. The [Pandas chapter][pdc] from the book _Python Data Science Handbook_ by Jake VanderPlas

-----

[pdd]: http://pandas.pydata.org/pandas-docs/stable/index.html
[pdt]: https://github.com/TomAugspurger/effective-pandas
[pdc]: http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.00-Introduction-to-Pandas.ipynb

**&copy; 2019: Gies College of Business at the University of Illinois.**

This notebook is released under the [Creative Commons license CC BY-NC-SA 4.0][ll]. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.

[ll]: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode