# 7 Cleaning data and preparation

## Handling Missing Data

- Preparation, cleaning, loading, transforming and rearranging data, are often reported to take up 80% of an analysts time.
- One of the goals of pandas is to make working with missing data as painless as possible.
- Pandas uses the floating-point value NaN to represent missing data (we call this a sentinel value that easily can be detected)

In [1]:
import pandas as pd
import numpy as np

In [3]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [4]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In pandas, we adopted a convention used in the R programming language by referring to missing data as NA, which stands for 'Not Available'. 

When cleaning up data for analysis, it is often important t odo  analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data. 

The Python value ```None``` is also treated as NA in object arrays

In [5]:
string_data[0] = None

In [6]:
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

Some NA handling methods:

Argument | Description
---------|-------------
dropna | Filters azis labels based on whether vaules for each label have missing data, with varying thresholds for how much missing data to tolerate
fillna | Fill in missing data with some value or using an interpolation method such as ```ffill``` or ```bfill```
isnull | Return boolean vaules indicating which values are missing
notnull | Negation of ```isnull```

## Filtering out Missing Data

```dropna``` returns a Series with only the non-null data and index values

In [7]:
from numpy import nan as NA

In [8]:
data = pd.Series([1,NA,3.5,NA,7])

In [9]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [10]:
# equal to 

data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

With dataFrame objects things are a littlemore complex. You may want to drop rows of columns that are all NA or only those containing any NAs, dropna be default drops any row containg a missing value

In [19]:
data = pd.DataFrame([[1,2.2,3.5,6,7],[1,NA,4.5,NA,4],[NA,NA,NA,NA,NA],[1,NA,6.5,NA,0]])

In [20]:
cleaned = data.dropna()

In [21]:
data

Unnamed: 0,0,1,2,3,4
0,1.0,2.2,3.5,6.0,7.0
1,1.0,,4.5,,4.0
2,,,,,
3,1.0,,6.5,,0.0


In [23]:
cleaned # dropping all rows that contain NA

Unnamed: 0,0,1,2,3,4
0,1.0,2.2,3.5,6.0,7.0


In [24]:
# passing how='all' will only drop rows that are all NA

data.dropna(how='all')

Unnamed: 0,0,1,2,3,4
0,1.0,2.2,3.5,6.0,7.0
1,1.0,,4.5,,4.0
3,1.0,,6.5,,0.0


In [26]:
# to drop columns in the same way, pass axis=1

data[4] = NA
data

Unnamed: 0,0,1,2,3,4
0,1.0,2.2,3.5,6.0,
1,1.0,,4.5,,
2,,,,,
3,1.0,,6.5,,


In [27]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2,3
0,1.0,2.2,3.5,6.0
1,1.0,,4.5,
2,,,,
3,1.0,,6.5,


A realted way to filter out DataFrame rows tends to concern time series data. Suppose you want to keep onlt rows containg a certain number of observations. You can indicate this with the thresh argument:

In [28]:
df = pd.DataFrame(np.random.randn(7,3))

In [29]:
df.loc[:4, 1] = NA

In [30]:
df.iloc[:2, 2] = NA

In [31]:
df

Unnamed: 0,0,1,2
0,0.84299,,
1,0.320703,,
2,0.845518,,1.070171
3,-1.020682,,-1.231241
4,-1.798502,,-0.441074
5,0.674863,-0.717168,0.307461
6,-1.981553,1.606253,0.783189


In [32]:
df.dropna()

Unnamed: 0,0,1,2
5,0.674863,-0.717168,0.307461
6,-1.981553,1.606253,0.783189


In [33]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,0.845518,,1.070171
3,-1.020682,,-1.231241
4,-1.798502,,-0.441074
5,0.674863,-0.717168,0.307461
6,-1.981553,1.606253,0.783189


## Filling in Missing data

Rather than filtering out missing data, and potentially discard other data along with it, you can fill in the 'holes' in any number of ways. 

For most purposes the ```fillna``` method is the workhorse function to use. 

In [34]:
# calling fillna with a constant replaces missing vaules with that value

df.fillna(0)

Unnamed: 0,0,1,2
0,0.84299,0.0,0.0
1,0.320703,0.0,0.0
2,0.845518,0.0,1.070171
3,-1.020682,0.0,-1.231241
4,-1.798502,0.0,-0.441074
5,0.674863,-0.717168,0.307461
6,-1.981553,1.606253,0.783189


In [36]:
# callin fillna with a dict, you cant use a different fill value for each column

df.fillna({1: 0.5, 2: 0}) # col 1 we fill with 0.5, col 2 with 0

Unnamed: 0,0,1,2
0,0.84299,0.5,0.0
1,0.320703,0.5,0.0
2,0.845518,0.5,1.070171
3,-1.020682,0.5,-1.231241
4,-1.798502,0.5,-0.441074
5,0.674863,-0.717168,0.307461
6,-1.981553,1.606253,0.783189


In [37]:
# fillna returns a new object, but you can modify the existing object in place

_ = df.fillna(0, inplace=True)
df

Unnamed: 0,0,1,2
0,0.84299,0.0,0.0
1,0.320703,0.0,0.0
2,0.845518,0.0,1.070171
3,-1.020682,0.0,-1.231241
4,-1.798502,0.0,-0.441074
5,0.674863,-0.717168,0.307461
6,-1.981553,1.606253,0.783189


In [38]:
# The same interpolation methods available for reindexing can be used with fillna

df = pd.DataFrame(np.random.randn(6,3))

In [39]:
df.iloc[2:, 1] = NA

In [40]:
df.iloc[4:, 2] = NA

In [41]:
df

Unnamed: 0,0,1,2
0,0.415758,-0.423893,0.087863
1,-0.102544,-0.367362,0.907813
2,0.146272,,-1.319144
3,-0.063737,,-1.263808
4,-0.332929,,
5,0.629834,,


In [42]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,0.415758,-0.423893,0.087863
1,-0.102544,-0.367362,0.907813
2,0.146272,-0.367362,-1.319144
3,-0.063737,-0.367362,-1.263808
4,-0.332929,-0.367362,-1.263808
5,0.629834,-0.367362,-1.263808


In [43]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,0.415758,-0.423893,0.087863
1,-0.102544,-0.367362,0.907813
2,0.146272,-0.367362,-1.319144
3,-0.063737,-0.367362,-1.263808
4,-0.332929,,-1.263808
5,0.629834,,-1.263808


In [45]:
# with fillna you can do lots of other things with a little creativity. 
# For example you might pass the mean or median value of a Series

data = pd.Series([1.,NA,3.5,NA,7])

In [46]:
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

#### fillna functions arguments

Argument | Description
---------|------------
value | Scalar value or dickt like object to use to fill missing values
method | Interpolation, by default 'ffill' if function valled with no other arguments
axis | Axis to fill on; default axis = 0
inplace | Modify the calling object without making a copy
limit | For forward and backward filling, maximum number of consecutive periods to fill

# Data Transformation

## Removing duplicates

The DataFrame method ```duplicated``` returns a boolean Series indicating whether each row is a duplicate or not

In [48]:
data = pd.DataFrame({'k1':['one','two'] * 3 + ['two'], 'k2': [1,1,2,3,3,4,4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [49]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [50]:
# drop_duplicates returns a DataFrame where the duplicated array is False

data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


In [51]:
# you can specify a subset of columns to detect duplicates

data['v1'] = range(7)

In [52]:
data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [53]:
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


In [54]:
# duplicated and drop_duplicates be defalult keep the first observed value combination
# passing keep='last' will return the last one

data.drop_duplicates(['k1','k2'], keep='last')

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


### Transforming Data using a function or Mapping

For many datasets, you may wish to perform some transformation based on the values in an array, Series, or column in a Dataframe. 

In [60]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', ' bacon', 'Pastrami',
 'corned beef', 'Bacon', 'pastrami', 
 'honey ham', 'nova lox'], 
 'ounces':[4,3,12,6,7.5,8,3,5,6]})

data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Suppose you want to add a column indicating the type of animal each food came from

In [61]:
meat_to_animal = {'bacon':'pig',
'pulled pork':'pig',
'pastrami':'cow', 
'corned beef':'cow',
'honey ham':'pig', 
'nova lox':'salmon'}

The map method on a Series accepts a function or dict-like object containg a mapping, but here we have a small problem in that some of the meats are capitalized and others are not. 

We need to convert to lower caseusing str.lower

In [62]:
lowercased = data['food'].str.lower()
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [64]:
data['animal'] = lowercased.map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


In [71]:
# we could also have passed a function that does all the work

data['food'].map(lambda x: meat_to_animal[x.lower()])

KeyError: ' bacon'

Using map is a convenient way to perform element-wise transofrmations and other data cleaning-realated operations

### Replaceing values

```replace``` provides a simple and flexible way of modiying a subset of values

In [72]:
data = pd.Series([1., -999.,2.,-999.,-1000.,3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

The -999 values may be sentinel values for missing data. To replace these with NA values that pandas understand, we can use ```replace```, producint a new Series

In [73]:
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [74]:
# if you want to replace multiple values at once

data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

In [75]:
# to use a different repalcement for each value pass alist of subsitutes

data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

In [76]:
# the argument passed can also be a dict

data.replace({-999: np.nan, -1000: 0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

### Renaming Axis Indexes

In [78]:
data = pd.DataFrame(np.arange(12).reshape((3,4)), 
    index=['ohio', 'utah', 'new york'], 
    columns=['one', 'two', 'three', 'four'])

In [79]:
transform = lambda x: x[:4].upper()

In [80]:
data.index.map(transform)

Index(['OHIO', 'UTAH', 'NEW '], dtype='object')

In [81]:
# you can assign to index, modifying the DataFrame in-place

data.index = data.index.map(transform)
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
UTAH,4,5,6,7
NEW,8,9,10,11


In [82]:
# if you want to create a transformed version of a dataset without modifuing the original you can use rename

data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Utah,4,5,6,7
New,8,9,10,11


In [83]:
# rename can be used with a dict-like object providing new values for a subset of the axis labels

data.rename(index={'OHIO': 'INDIANA'},columns={'three':'peekaboo'})

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
UTAH,4,5,6,7
NEW,8,9,10,11


### Descretization and Binning

(e.g grouping)

In [84]:
ages = [20,22,25,27,21,23,37,31,61,45,41,32]
bins=[18,25,35,60,100]

In [85]:
cats = pd.cut(ages,bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

The object returned is a special Categorical object. The output describes the bins computed by ```pandas.cut```.
You can treat it like an array of strings indicating the vin name, internally it contains a categories aray specifying the distinct category names along with a labeling for the ages data in the codes attribute

In [86]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [87]:
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]], dtype='interval[int64, right]')

In [88]:
pd.value_counts(cats)

(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
dtype: int64

parenthesis means that the side is open, bracket means that the side is closed(inclusive). You can change which side is open using ```right=False```

In [89]:
pd.cut(ages,[18,26,36,61,100], right=False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64, left]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

For more on cut, read page 208

## Detecting and Filtering Outliers

In [90]:
data = pd.DataFrame(np.random.randn(1000,4))

In [91]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.038092,-0.022077,0.006957,-0.025138
std,0.993568,1.002833,1.028322,0.993141
min,-3.013759,-3.020988,-3.375419,-2.865589
25%,-0.663872,-0.658286,-0.709772,-0.682107
50%,0.034277,-0.010671,0.024882,-0.062806
75%,0.717087,0.680254,0.705795,0.651943
max,3.60567,2.982751,2.983483,3.004844


In [92]:
# suppose you want to find values in one column that is more then 3 in absolute

col = data[2]

In [93]:
col[np.abs(col) > 3]

41    -3.375419
205   -3.100405
Name: 2, dtype: float64

In [94]:
# to select all rows having a vaule exceeding 3 or -3 you can use the any method on a boolean DataFrame

data[(np.abs(data) > 3).any(1)]

Unnamed: 0,0,1,2,3
41,0.405879,-0.633622,-3.375419,0.229272
96,3.60567,1.065272,0.253023,0.378747
142,1.553422,-0.151267,-0.430065,3.004844
205,1.502456,2.122463,-3.100405,0.96759
698,-1.438033,-3.020988,0.872672,0.340035
738,3.295181,0.716391,0.569749,-0.449451
763,-3.013759,0.949583,-0.597258,1.436113


In [96]:
# values can be set based on these criteria. Here is code to cap values outside the interval -3 to 3

data[np.abs(data) > 3 ] = np.sign(data)*3

In [98]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.037205,-0.022056,0.007433,-0.025143
std,0.990604,1.002771,1.026855,0.993126
min,-3.0,-3.0,-3.0,-2.865589
25%,-0.663872,-0.658286,-0.709772,-0.682107
50%,0.034277,-0.010671,0.024882,-0.062806
75%,0.717087,0.680254,0.705795,0.651943
max,3.0,2.982751,2.983483,3.0


In [100]:
# produces 1 or -1 based on whether the vaules in data are positive or negative
np.sign(data).head()

Unnamed: 0,0,1,2,3
0,1.0,1.0,1.0,1.0
1,-1.0,-1.0,1.0,-1.0
2,-1.0,-1.0,1.0,-1.0
3,1.0,-1.0,1.0,-1.0
4,-1.0,1.0,1.0,1.0


## Permutation and Random Sampling

Calling permuteation with the langth of the axis you want to permute produces an array of integers indicating the new ordering

In [101]:
df = pd.DataFrame(np.arange(5 * 4).reshape((5,4)))

In [103]:
sampler = np.random.permutation(5)
sampler

array([3, 0, 1, 2, 4])

In [104]:
# the sampler array can then be used in an iloc-based indexing or the equal take function

df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [105]:
df.take(sampler)

Unnamed: 0,0,1,2,3
3,12,13,14,15
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
4,16,17,18,19


In [106]:
# To select a random subset without replacement, you can use the sample method on series and DataFrame

df.sample(n=3)

Unnamed: 0,0,1,2,3
0,0,1,2,3
2,8,9,10,11
4,16,17,18,19


In [107]:
# to generate a sample with replacement pass replace=True to sample

choices = pd.Series([5,7,-1,6,4])

In [110]:
draws = choices.sample(n=10, replace=True)
draws

3    6
0    5
3    6
0    5
0    5
0    5
1    7
3    6
2   -1
1    7
dtype: int64

### Computing Indicator/ Dummy Variables

Read page 212

## String manipulation

pandas is enabling you to apply string and regular expressions concisely on whole arrays of data, additionally handling the annoyance of missing data. 

### String object methods

In [112]:
# split
val = 'a,b,  guido'
val.split(',')

['a', 'b', '  guido']

In [114]:
# split and strip (of whitespace) combined
pieces = [x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

In [115]:
'guido' in val

True

In [116]:
val.index(',')

1

In [117]:
val.find(':')

-1

In [118]:
val.count(',')

2

In [119]:
val.replace(',', '')

'ab  guido'

Python built-in string methods

Method | Description
-------|------------
count | d
endswith | d
startswith | d
join | d
index | d
find | d
rfind | d
replace | d
strip | d
rstrip | d
lstrip | d
spilt | d
lower | d
upper | d
casefold | d
ljust | d
rjust | d

### Regular expressions

RegEx provides a flexible way to search or match complex string patterns in text. 

Pythons buil-in module ```re``` is respnsible for applying regular expressions to strings

Read more on page 217

### Vectorized String Functions in pandas

Read more on page 220