# Data Cleaning

* Anywhere from 50% to 80% of data science is data cleaning
    * of course I hear 70% of statistics are made up on the spot
* Dealing with dirty data is a fact of life when doing data intensive research
* Especially if you are collecting or creating the data yourself
* Fortunately, Pandas is excellent at data cleaning and once you get the hang of it you might even enjoy it!


In [1]:
# load the necessary libraries
import pandas as pd
import numpy as np


## Missing Values 

* One of challenges you may face when working with messy data are *missing* or **null** values 
* There are multiple conventions for representing null values when doing data science in Python
* There is a Pythonic way using the `None` object
* There is a Numpy/Pandas-y way using `NaN`

### None - Pythonic Missing Data

* None is the standard way of representing nothing in plain python
* It is useful, but it is also a complex data structure
* It can be used in numeric and programmatic contexts

In [2]:
# create a numpy array of numbers and a null value represented by None
some_numbers = np.array([1,None,3,4])
some_numbers

array([1, None, 3, 4], dtype=object)

* Because numpy arrays (and pandas series/columns) all have to be the same data type, it will default to the most expressive and most inefficient data type for the array
    * Note:  Pandas will automatically convert `None` to `Nan` (Not a Number) so we use `np.array` here
* This means any operations running over the array/column/series are going to run slower than they could if the data type was numeric

In [3]:
# create a list of objects and a list of integers
# compute their sum and time how long it takes
for dtype in ['object','int']:
    print("data type = ", dtype)
    %timeit np.arange(1E6, dtype=dtype).sum()
    print()

data type =  object
94.8 ms ± 3.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

data type =  int
2.7 ms ± 842 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)



* Notice the integer array was ***a lot*** faster than the object array
* Also, the vectorized math operations don't like `None`

In [None]:
some_numbers.sum()

### NaN - Numpy/Pandas-y Missing Numeric data

* The Numpy third-party library has a mechanism for representing missing numeric values
* Under the hood, NaNs are a standards compliant floating point numbers 
    * Note for R users: There is no `Null` only `NaN`
* This means you can use them with other numeric arrays for fast computations

In [5]:
# Create a numeric Pandas Series with a missing value
nanny = pd.Series([1, np.nan, 3, 4])
nanny.dtype

dtype('float64')

* Now we can use all the fast and easy computations in Pandas without worring about missing values

In [6]:
# compute the sum of all the numbers in the Series
nanny.sum()

8.0

## Operating on Null Values

* There are four functions in Pandas that are useful for working with missing data
* The examples below operate on Series, but they can work on Dataframes as well


### Null value functions

* `isna()` - Generate a boolean mask of the missing values (can also use `isnull()`)
* `notna()` - Do the opposite of `isna()` (can also use `notnull()`
* `dropna()` - Create a filtered copy of the data with no null values
* `fillna(value)` - Create a copy of the data will null values filled in

In [7]:
# display the Series
nanny

0    1.0
1    NaN
2    3.0
3    4.0
dtype: float64

In [8]:
# what values are null
nanny.isna()

0    False
1     True
2    False
3    False
dtype: bool

In [9]:
# what values are not null
nanny.notna()

0     True
1    False
2     True
3     True
dtype: bool

* These masks can be used to filter the data and create a view of missing or not missing 

In [10]:
# not super useful in a Series, but handy with Dataframes
nanny[nanny.isna()]

1   NaN
dtype: float64

* Rather than creating a view, we can create *copies* of the data with the null values removed or filled in

In [11]:
# Just get rid of all the null values
no_null_nanny = nanny.dropna()
no_null_nanny

0    1.0
2    3.0
3    4.0
dtype: float64

In [12]:
# fill in the null values with zero
fill_null_nanny = nanny.fillna(0)
fill_null_nanny

0    1.0
1    0.0
2    3.0
3    4.0
dtype: float64

In [13]:
# fill in the null values with a different value
fill_null_nanny = nanny.fillna(999)
fill_null_nanny

0      1.0
1    999.0
2      3.0
3      4.0
dtype: float64

In [14]:
# The original nanny Series remains untouched #noreboot
# Fran Drescher frowns with dissapointment 
nanny

0    1.0
1    NaN
2    3.0
3    4.0
dtype: float64

* These functions work with dataframes as well
* But you will need to pay closer attention to what it is doing 

In [15]:
df_nanny = pd.DataFrame([[1, np.nan, 2],
                        [2, 3, 5],
                        [np.nan, 4, 6]])
df_nanny

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


* Dropping null values with `dropna()` removes the entire axis (row or column) and returns a new copy of the dataframe
* You can specify dropping rows or columns with the axis parameter

In [16]:
# dropna gets rid of rows by default
df_nanny.dropna() # axis="rows" or axis=0

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [17]:
# use the axis="columns" or axis=1 to drop columns
df_nanny.dropna(axis="columns")

Unnamed: 0,2
0,2
1,5
2,6


* There are a couple other parameters that let you specify other behaviors
* Like only dropping rows/columns with all null values or settings a threshold

## Working with null values in real data

* Here is an example of some real data, the diabetes data from week 2

In [18]:
# Import data file into a Pandas dataframe
df = pd.read_csv("../2 - data python two/diabetes.csv")

# Display the first 5 rows of the data
df.head()

Unnamed: 0,id,chol,stab.glu,hdl,age,gender,height,weight,frame,bp.1s,bp.1d
0,1000,203.0,82,56.0,46,female,62.0,121.0,medium,118.0,59.0
1,1001,165.0,97,24.0,29,female,64.0,218.0,large,112.0,68.0
2,1002,228.0,92,37.0,58,female,61.0,256.0,large,190.0,92.0
3,1003,78.0,93,12.0,67,male,67.0,119.0,large,110.0,50.0
4,1005,249.0,90,28.0,64,male,68.0,183.0,medium,138.0,80.0


In [19]:
# Display the metadata about the data, making sure to display null values
df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 403 entries, 0 to 402
Data columns (total 11 columns):
id          403 non-null int64
chol        402 non-null float64
stab.glu    403 non-null int64
hdl         402 non-null float64
age         403 non-null int64
gender      403 non-null object
height      398 non-null float64
weight      402 non-null float64
frame       391 non-null object
bp.1s       398 non-null float64
bp.1d       398 non-null float64
dtypes: float64(6), int64(3), object(2)
memory usage: 34.7+ KB


* If we look closely at this information we can see there are a few null values in this dataset
* There are 403 rows, but some columns have less than 403 non-null values
* Now let's check which values in the dataset are missing

In [20]:
# Create a boolean mask where True indicates a null value
df.isna().head()

Unnamed: 0,id,chol,stab.glu,hdl,age,gender,height,weight,frame,bp.1s,bp.1d
0,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False


* Gak! Too much data, how can we just get a quick count of the null values?
* What if we combined `isnull()` with the `sum()` function?

In [21]:
# Use the sum function to count the True values in the boolean mask
df.isna().sum()

id           0
chol         1
stab.glu     0
hdl          1
age          0
gender       0
height       5
weight       1
frame       12
bp.1s        5
bp.1d        5
dtype: int64

* If we wanted to look at a specific column we can do the same operation 
* These functions work with Series as well as DataFrames

In [22]:
# How many null values in the chol column
df["chol"].isnull().sum()

1

* Now let's deal with missing values
* Solution 1: Remove rows with empty values
* If there are only a few null values and you know that deleting values will not cause adverse effects on your result, remove them from your DataFrame
* Make sure to save the new dataframe to a new variable!

In [23]:
# Display missing value counts
print("Missing values before dropping rows: ")
print(df.isnull().sum())


# Display new dataset
mod_df = df.dropna() # make a copy of the dataframe with null values removed
print("Missing values after dropping rows: ")
print(mod_df.isnull().sum())


Missing values before dropping rows: 
id           0
chol         1
stab.glu     0
hdl          1
age          0
gender       0
height       5
weight       1
frame       12
bp.1s        5
bp.1d        5
dtype: int64
Missing values after dropping rows: 
id          0
chol        0
stab.glu    0
hdl         0
age         0
gender      0
height      0
weight      0
frame       0
bp.1s       0
bp.1d       0
dtype: int64


### EXERCISE

A reviewer on your article that you submitted to the most prestigious journal in your field, loves your analysis but doesn't like the fact you dropped rows with missing cholesterol values. You can't drop them and you can't just put in zero, so you need to identify a technique to deal with those missing values; some kind of *interpolation* that *fills in* a new value in place of the null values. Hopefully it won't drastically change the interpretation!

1. Create a new `filler_value` by deriving a number (mean, median or something else) from the column of cholesterol values (`df['chol']`)
2. Use the `fillna()` function to fill in the missing values of the cholesterol column


In [38]:
# Put your code here

## Create a filler value
filler_value = df['chol'].mean

df['chol'].fillna(filler_value, inplace = True)

df['chol'].head(20)



0     203
1     165
2     228
3      78
4     249
5     248
6     195
7     227
8     177
9     263
10    242
11    215
12    238
13    183
14    191
15    213
16    255
17    230
18    194
19    196
Name: chol, dtype: object

### Solution

* One quick and easy way is to fill in missing values with the mean value of a giving column

In [None]:
# Find the mean
filler_value = df["chol"].mean()
filler_value

In [41]:
# Fill missing values with a mean (average) value of a given column
# Note the inplace=True parameter - that means that we are overwriting the data
# in the existing dataset
df["chol"].fillna(filler_value, inplace=True)
df.isnull().sum()

id           0
chol         0
stab.glu     0
hdl          1
age          0
gender       0
height       5
weight       1
frame       12
bp.1s        5
bp.1d        5
dtype: int64

* No more null values in the `chol` column

## Vectorized String Operations

* If you are dealing with textual or categorical data, you often have to clean strings
* Pandas has a set of *Vectorized String Operations* that are much faster and easier than the Python equivalents 
* Especially handling bad data!

In [40]:
data = ['peter', 'Paul', 'MARY', 'gUIDO']

for s in data:
    print(s.capitalize())

Peter
Paul
Mary
Guido


* But like above, this breaks very easily with missing values

In [None]:
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']

for s in data:
    print(s.capitalize())

* The Pandas library has *vectorized string operations* that handle missing data

In [42]:
# convert our list into a Series
names = pd.Series(data)
names

0    peter
1     Paul
2     MARY
3    gUIDO
dtype: object

In [43]:
# Use the string vector function to capitalize everything
names.str.capitalize()

0    Peter
1     Paul
2     Mary
3    Guido
dtype: object

* Look ma! No errors!
* Pandas includes a a bunch of methods for doing things to strings.

|  Functions  |. |.  |. |
|-------------|------------------|------------------|------------------|
|``len()``    | ``lower()``      | ``translate()``  | ``islower()``    | 
|``ljust()``  | ``upper()``      | ``startswith()`` | ``isupper()``    | 
|``rjust()``  | ``find()``       | ``endswith()``   | ``isnumeric()``  | 
|``center()`` | ``rfind()``      | ``isalnum()``    | ``isdecimal()``  | 
|``zfill()``  | ``index()``      | ``isalpha()``    | ``split()``      | 
|``strip()``  | ``rindex()``     | ``isdigit()``    | ``rsplit()``     | 
|``rstrip()`` | ``capitalize()`` | ``isspace()``    | ``partition()``  | 
|``lstrip()`` |  ``swapcase()``  |  ``istitle()``   | ``rpartition()`` |

### Exercise

* In the cells below, try three of the string operations listed above on the Pandas Series `monte`
* Remember, you can hit tab to autocomplete and shift-tab to see documentation

In [44]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])
monte

0    Graham Chapman
1       John Cleese
2     Terry Gilliam
3         Eric Idle
4       Terry Jones
5     Michael Palin
dtype: object

In [45]:
# First
monte.str.split()


0    [Graham, Chapman]
1       [John, Cleese]
2     [Terry, Gilliam]
3         [Eric, Idle]
4       [Terry, Jones]
5     [Michael, Palin]
dtype: object

In [46]:
# Second
monte.str.len()




0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64

In [49]:
# Third
monte.str.endswith('n')




0     True
1    False
2    False
3    False
4    False
5     True
dtype: bool

### String Vector Operations with Real Data

* Let's try some string vector operations using real data!

In [51]:
# open the chipotle data and look at the first 5 rows
orders = pd.read_csv("../4 - data management one/chipotle.tsv", sep="\t")
orders.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,2.39
1,1,1,Izze,[Clementine],3.39
2,1,1,Nantucket Nectar,[Apple],3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",16.98


We have downloaded the data and loaded it into a dataframe directly from the web.

In [52]:
# get the rows and columns of the dataframe
orders.shape

(4622, 5)

* We see there are nearly 4,622 order, and 5 columns.
* Let's take a look at the 4th row to see what textual information we have:

In [53]:
# display the first item in the DataFrame
orders.iloc[4]

order_id                                                              2
quantity                                                              2
item_name                                                  Chicken Bowl
choice_description    [Tomatillo-Red Chili Salsa (Hot), [Black Beans...
item_price                                                        16.98
Name: 4, dtype: object

* We can use Vectorized String Operations to explore the textual data

In [54]:
# Summarize the length of the choice_description string
orders['choice_description'].str.len().describe()

count    3376.000000
mean       64.215344
std        31.629927
min         6.000000
25%        49.000000
50%        68.000000
75%        83.000000
max       201.000000
Name: choice_description, dtype: float64

In [55]:
# which row has the longest ingredients string
orders['choice_description'].str.len().idxmax()

3659

In [56]:
# use iloc to fetch that specific row from the dataframe
orders.iloc[3659]

order_id                                                           1463
quantity                                                              1
item_name                                                   Veggie Bowl
choice_description    [[Fresh Tomato Salsa (Mild), Tomatillo-Green C...
item_price                                                         8.49
Name: 3659, dtype: object

In [57]:
# use iloc to fetch the max row automatically
orders.iloc[orders['choice_description'].str.len().idxmax()]

order_id                                                           1463
quantity                                                              1
item_name                                                   Veggie Bowl
choice_description    [[Fresh Tomato Salsa (Mild), Tomatillo-Green C...
item_price                                                         8.49
Name: 3659, dtype: object

In [58]:
# only look at the description string
orders.iloc[orders['choice_description'].str.len().idxmax()]['choice_description']

'[[Fresh Tomato Salsa (Mild), Tomatillo-Green Chili Salsa (Medium), Roasted Chili Corn Salsa (Medium), Tomatillo-Red Chili Salsa (Hot)], [Pinto Beans, Rice, Fajita Veggies, Cheese, Sour Cream, Lettuce]]'

* WOW! That is a lot of ingredients! It looks like that string is semi-structured, I wonder if we can do something with it...
* We could start by doing some string matching

In [59]:
# How many orders contain salsa
orders['choice_description'].str.contains('Salsa').sum()

2808

* Note, you can use dot notation with column names
* This is useful because then you can use autocomplete with the string vector functions

In [60]:
# How many orders contain salsa
orders.choice_description.str.contains('Salsa').sum()

2808

In [61]:
# How many Burritos
orders.item_name.str.contains("Burrito").sum()

1172

In [62]:
# How many burritos...capitalization matters!
orders.item_name.str.contains("burrito").sum()

0

* Let's find the burrito with the most items in it

In [66]:
# only look at the description string
burrito_mask = orders.item_name.str.contains("Burrito")
burrito_mask.head()

0    False
1    False
2    False
3    False
4    False
Name: item_name, dtype: bool

In [67]:
# get the id of the burrito with the longest description
max_burrito_id = orders[burrito_mask]["choice_description"].str.len().idxmax()
max_burrito_id

1320

In [68]:
# get the description column of the row with the max_burrito_id
orders.iloc[max_burrito_id]["choice_description"]

'[[Tomatillo-Green Chili Salsa (Medium), Roasted Chili Corn Salsa (Medium), Tomatillo-Red Chili Salsa (Hot)], [Pinto Beans, Rice, Fajita Veggies, Cheese, Sour Cream, Guacamole, Lettuce]]'

* That is a LOADED BURRITO!
* This data is interesting, but not very useful because it is one big string
* But we can probably do more with that `choice_description` column
* Let's pretend [it doesn't look like Python code](https://stackoverflow.com/questions/33281450/right-way-to-use-eval-statement-in-pandas-dataframe-map-function) and instead treat it as a comma separated list
* What string function could we use?

|  Functions  |. |.  |. |
|-------------|------------------|------------------|------------------|
|``len()``    | ``lower()``      | ``translate()``  | ``islower()``    | 
|``ljust()``  | ``upper()``      | ``startswith()`` | ``isupper()``    | 
|``rjust()``  | ``find()``       | ``endswith()``   | ``isnumeric()``  | 
|``center()`` | ``rfind()``      | ``isalnum()``    | ``isdecimal()``  | 
|``zfill()``  | ``index()``      | ``isalpha()``    | ``split()``      | 
|``strip()``  | ``rindex()``     | ``isdigit()``    | ``rsplit()``     | 
|``rstrip()`` | ``capitalize()`` | ``isspace()``    | ``partition()``  | 
|``lstrip()`` |  ``swapcase()``  |  ``istitle()``   | ``rpartition()`` |


In [70]:
# Use the split function to break up the different  
orders.choice_description.str.split(",").head()

0                                                  NaN
1                                       [[Clementine]]
2                                            [[Apple]]
3                                                  NaN
4    [[Tomatillo-Red Chili Salsa (Hot),  [Black Bea...
Name: choice_description, dtype: object

* But what about those pesky brackets! Let's get rid of them!

In [71]:
# remove the left brackets
orders.choice_description.str.replace("[","" ).head()

0                                                  NaN
1                                          Clementine]
2                                               Apple]
3                                                  NaN
4    Tomatillo-Red Chili Salsa (Hot), Black Beans, ...
Name: choice_description, dtype: object

In [72]:
# remove the left and right brackets
orders.choice_description.str.replace("[","" ).str.replace("]","").head()

0                                                  NaN
1                                           Clementine
2                                                Apple
3                                                  NaN
4    Tomatillo-Red Chili Salsa (Hot), Black Beans, ...
Name: choice_description, dtype: object

In [73]:
# remove the left and right brackets and split on commas
orders.choice_description.str.replace("[","" ).str.replace("]","").str.split(",").head(10)

0                                                  NaN
1                                         [Clementine]
2                                              [Apple]
3                                                  NaN
4    [Tomatillo-Red Chili Salsa (Hot),  Black Beans...
5    [Fresh Tomato Salsa (Mild),  Rice,  Cheese,  S...
6                                                  NaN
7    [Tomatillo Red Chili Salsa,  Fajita Vegetables...
8    [Tomatillo Green Chili Salsa,  Pinto Beans,  C...
9    [Fresh Tomato Salsa,  Rice,  Black Beans,  Pin...
Name: choice_description, dtype: object

* Wait what!? The brackets are back!(*@&#^$
* Yes, but now they indicate Python lists instead of `[` and `]` characters (confusing yes I know)
* How can we grab items from those lists of ingredients?

In [74]:
# remove the left and right brackets and split on commas and grab the first element
orders.choice_description.str.replace("[","" ).str.replace("]","").str.split(",").str[0].head()

0                                NaN
1                         Clementine
2                              Apple
3                                NaN
4    Tomatillo-Red Chili Salsa (Hot)
Name: choice_description, dtype: object

In [75]:
# remove the left and right brackets and split on commas and grab the last element
orders.choice_description.str.replace("[","" ).str.replace("]","").str.split(",").str[-1].head()

0            NaN
1     Clementine
2          Apple
3            NaN
4     Sour Cream
Name: choice_description, dtype: object

In [76]:
# remove the left and right brackets and split on commas and grab the first 3 elements
orders.choice_description.str.replace("[","" ).str.replace("]","").str.split(",").str[0:3].head()

0                                                  NaN
1                                         [Clementine]
2                                              [Apple]
3                                                  NaN
4    [Tomatillo-Red Chili Salsa (Hot),  Black Beans...
Name: choice_description, dtype: object

In [78]:
# Put the split descriptions into a new Series
split_description = orders.choice_description.str.replace("[","" ).str.replace("]","").str.split(",")
split_description.head()

0                                                  NaN
1                                         [Clementine]
2                                              [Apple]
3                                                  NaN
4    [Tomatillo-Red Chili Salsa (Hot),  Black Beans...
Name: choice_description, dtype: object

In [84]:
# look at the 4604th element of the split_descriptions series
split_description.iloc[4604]

['Fresh Tomato Salsa', ' Rice', ' Black Beans', ' Cheese', ' Sour Cream']

* Every item in the series is a list

In [85]:
# Count how many items are in each description list
split_description.str.len().fillna(0).head(10)

0    0.0
1    1.0
2    1.0
3    0.0
4    5.0
5    6.0
6    0.0
7    8.0
8    5.0
9    7.0
Name: choice_description, dtype: float64

In [82]:
split_description.value_counts().head(10)

[Diet Coke]                                                                              134
[Coke]                                                                                   123
[Sprite]                                                                                  77
[Fresh Tomato Salsa,  Rice,  Black Beans,  Cheese,  Sour Cream,  Lettuce]                 42
[Fresh Tomato Salsa,  Rice,  Black Beans,  Cheese,  Sour Cream,  Guacamole,  Lettuce]     40
[Fresh Tomato Salsa (Mild),  Pinto Beans,  Rice,  Cheese,  Sour Cream]                    36
[Lemonade]                                                                                33
[Fresh Tomato Salsa,  Rice,  Black Beans,  Cheese,  Sour Cream]                           33
[Fresh Tomato Salsa,  Rice,  Cheese,  Sour Cream,  Lettuce]                               29
[Fresh Tomato Salsa,  Rice,  Black Beans,  Cheese]                                        28
Name: choice_description, dtype: int64