<h1> Data Cleaning and Preparation</h1>

During the course of doing data analysis and modelling, a significant amount of time is spent on data preparatin: loading, cleaning, transforming and rearranging. Such tasks are often reported to keep upto 80% of analyst's time. Sometimes the way the data is stored in files or databases is not in the right format for a particular task. Fortunately, pandas, along with the built-in Python language features, provides you with a high-level, flexible, and fast set of tools to enable you to manipulate data into right form.

<h3>Handling Missing Data</h3>

Missing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing data as painless as possible. For example, all of the descriptive statistics on pandas objects exclude missing data by default.

The way that missing data is represented in pandas objects is somewhat imperfect, but it is functional for a lot of users. For numeric data, pandas uses the floating-point value Nan(Not a Number) to represent missing data. We call this a <b><i>sentinel value</i></b> that can be easily detected:

In [1]:
import pandas as pd
import numpy as np

In [2]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])

In [3]:
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [4]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

But In pandas, a convention in the R programming language is used wehere we refer to the missing data as NA, which stands for <b>not available</b>. In statistics applications, NA data may either be data that does not exist  or that exists but was not observed. 
When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data.

Note: The built-in Python <b>None</b> value is also treated as NA in object arrays:

In [5]:
string_data[0] = None

In [6]:
string_data

0         None
1    artichoke
2          NaN
3      avocado
dtype: object

In [7]:
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

![alt Text](Images/DataCleaningandPreparation/na_handling.png)

In [8]:
string_data.notnull()

0    False
1     True
2    False
3     True
dtype: bool

<h3>Filtering Out Missing Data</h3>

While we always have the option to filter out missing data by hand using pandas.isnull and boolean indexing, the dropna can be helpful. 
On a Series, it returns the Series with only the non-null data and index values:

In [9]:
string_data

0         None
1    artichoke
2          NaN
3      avocado
dtype: object

In [10]:
string_data[~string_data.isnull()]

1    artichoke
3      avocado
dtype: object

In [11]:
string_data[string_data.isnull()]

0    None
2     NaN
dtype: object

In [12]:
from numpy import nan as NA

In [13]:
data = pd.Series([1, NA, 3.5, NA, 7])

In [14]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

This is equivalent to:

In [15]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

But with DataFrame objects, things are a bit more complex. We may want to drop rows or columns that are all NA or only those containing amy NAs. <b>dropna</b> by default drops any row containing a missng value:

In [16]:
data = pd.DataFrame([[1., 6.5, 3.],[1., NA, NA],
                    [NA, NA, NA], [NA, 6.5, 3.]])

In [17]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [18]:
cleaned = data.dropna()

In [19]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


Passing how='all' will only drop rows that are all NA:

In [20]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


To drop columns in the same way, pass axis = 1:

In [21]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [22]:
data[4] = NA

In [23]:
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [24]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


A related way to filter out DataFrame rows tends to concern time series data. Suppose we want to keep only rows containing certiain number of observations. We can indicate this with the thresh argument:

In [25]:
df  = pd.DataFrame(np.random.randn(7,3))

In [26]:
df

Unnamed: 0,0,1,2
0,-1.093586,0.100343,0.279438
1,0.234243,-0.608648,-0.291345
2,0.79889,0.909428,0.381803
3,0.409746,-0.81396,0.833808
4,1.047743,-0.47179,0.013256
5,0.713144,0.034476,-0.729821
6,-0.713952,-0.259725,-0.050535


In [27]:
df.iloc[:4,1] = NA

In [28]:
df

Unnamed: 0,0,1,2
0,-1.093586,,0.279438
1,0.234243,,-0.291345
2,0.79889,,0.381803
3,0.409746,,0.833808
4,1.047743,-0.47179,0.013256
5,0.713144,0.034476,-0.729821
6,-0.713952,-0.259725,-0.050535


In [29]:
df.iloc[:2,2] = NA

In [30]:
df

Unnamed: 0,0,1,2
0,-1.093586,,
1,0.234243,,
2,0.79889,,0.381803
3,0.409746,,0.833808
4,1.047743,-0.47179,0.013256
5,0.713144,0.034476,-0.729821
6,-0.713952,-0.259725,-0.050535


In [31]:
df.dropna()

Unnamed: 0,0,1,2
4,1.047743,-0.47179,0.013256
5,0.713144,0.034476,-0.729821
6,-0.713952,-0.259725,-0.050535


In [32]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,0.79889,,0.381803
3,0.409746,,0.833808
4,1.047743,-0.47179,0.013256
5,0.713144,0.034476,-0.729821
6,-0.713952,-0.259725,-0.050535


<h3>Filling in Missing Data</h3>

Rather than filtering out missing data (and potentially discarding other data along with it), we may want to fill in the "holes" in any number of ways. For most purposes, the <b>fillna</b> method is the workhouse function to use. Calling <b>fillna</b> wiht a constant replaces missing values with tat value:

In [33]:
df

Unnamed: 0,0,1,2
0,-1.093586,,
1,0.234243,,
2,0.79889,,0.381803
3,0.409746,,0.833808
4,1.047743,-0.47179,0.013256
5,0.713144,0.034476,-0.729821
6,-0.713952,-0.259725,-0.050535


In [34]:
df.fillna(0)

Unnamed: 0,0,1,2
0,-1.093586,0.0,0.0
1,0.234243,0.0,0.0
2,0.79889,0.0,0.381803
3,0.409746,0.0,0.833808
4,1.047743,-0.47179,0.013256
5,0.713144,0.034476,-0.729821
6,-0.713952,-0.259725,-0.050535


Calling fillna with a dict, we can use a different fill value for each column:

In [35]:
df.fillna({1: 0, 2:5})

Unnamed: 0,0,1,2
0,-1.093586,0.0,5.0
1,0.234243,0.0,5.0
2,0.79889,0.0,0.381803
3,0.409746,0.0,0.833808
4,1.047743,-0.47179,0.013256
5,0.713144,0.034476,-0.729821
6,-0.713952,-0.259725,-0.050535


In [36]:
df

Unnamed: 0,0,1,2
0,-1.093586,,
1,0.234243,,
2,0.79889,,0.381803
3,0.409746,,0.833808
4,1.047743,-0.47179,0.013256
5,0.713144,0.034476,-0.729821
6,-0.713952,-0.259725,-0.050535


<b>fillna</b> retruns a new object, but we can modify the existing object in-place:

In [37]:
df

Unnamed: 0,0,1,2
0,-1.093586,,
1,0.234243,,
2,0.79889,,0.381803
3,0.409746,,0.833808
4,1.047743,-0.47179,0.013256
5,0.713144,0.034476,-0.729821
6,-0.713952,-0.259725,-0.050535


In [38]:
_ = df.fillna(0, inplace=True)

In [39]:
df

Unnamed: 0,0,1,2
0,-1.093586,0.0,0.0
1,0.234243,0.0,0.0
2,0.79889,0.0,0.381803
3,0.409746,0.0,0.833808
4,1.047743,-0.47179,0.013256
5,0.713144,0.034476,-0.729821
6,-0.713952,-0.259725,-0.050535


Interpolation methods available for reindexing can be used with fillna:

In [41]:
df = pd.DataFrame(np.random.randn(6,3))

In [42]:
df

Unnamed: 0,0,1,2
0,-0.398841,-0.545017,1.196409
1,-0.844412,0.447414,0.791969
2,0.774942,1.881216,0.397568
3,0.52022,0.587132,1.272337
4,1.391185,1.136893,-0.553587
5,-1.179683,1.273934,0.781801


In [43]:
df.iloc[2:,1] = NA
df.iloc[4:,2] = NA

In [44]:
df

Unnamed: 0,0,1,2
0,-0.398841,-0.545017,1.196409
1,-0.844412,0.447414,0.791969
2,0.774942,,0.397568
3,0.52022,,1.272337
4,1.391185,,
5,-1.179683,,


In [45]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,-0.398841,-0.545017,1.196409
1,-0.844412,0.447414,0.791969
2,0.774942,0.447414,0.397568
3,0.52022,0.447414,1.272337
4,1.391185,0.447414,1.272337
5,-1.179683,0.447414,1.272337


In [47]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,-0.398841,-0.545017,1.196409
1,-0.844412,0.447414,0.791969
2,0.774942,0.447414,0.397568
3,0.52022,0.447414,1.272337
4,1.391185,,1.272337
5,-1.179683,,1.272337


With <b>fillna</b> we can do a lots of other things with a little creativity. For example, we might pass the mean or median value of a Series:

In [48]:
data = pd.Series([1., NA, 3.5, NA, 7])

In [49]:
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

![alt Text](Images/DataCleaningandPreparation/fillna.png)

<h3>Data Transformation</h3>

So far we've been concerned with rearranging data. Filtering, cleaning, and other transformations are another class of important operations.

<h3>Removing Duplicates</h3>

Duplicat rows may be found in a DataFrame for any number of reasons. Here is an example:

In [58]:
data = pd.DataFrame({'k1' : ['one', 'two'] * 3 + ['two'],
                    'k2': [1,1,2,3,3,4,4]})

In [59]:
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


The DataFrame method duplicated returns a boolean Series indicating whether each row is duplicate (has been observed in a previous row) or not:

In [60]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

Relatedly, drop_duplicates returns a DataFrame where the duplicated array is False:

In [61]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


Both of these methods by default consider all of the columns; alternatively, we can specify any subset of them to detect duplicates. Suppose we had an additional column of values and wanted to filter duplicates only based on the 'k1' column:

In [62]:
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [63]:
data['v1']  = range(7)

In [64]:
data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [65]:
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


<b>duplicaed</b> and <b>drop_duplicates</b> by default keep the first observed value combination. Passing <b>keep=last</b> wil return the last one:

In [67]:
data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [68]:
data.drop_duplicates(['k1','k2'], keep = 'last')

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


<h3>Transforming Data Using a Function or Mapping</h3>

For many datasets, we may wish to perform some transformation based on the values in an array, Series, or column in a  DataFrame. Consider the following hypothetical data collected about various kind of meat:

In [71]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                             'Pastrami', 'corned beef', 'Bacon', 
                             'pastrami', 'honey ham', 'nova lox'],
                    'ounces': [4,3,12,6,7.5,8,3,5,6]})

In [72]:
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Suppose, we wanted to add a column indicating the type of animal that each food came from. Let's wrtie down a mapping of each distinct meat type to the kind of animal:

In [73]:
meat_to_animal = {
    'bacon' : 'pig', 
    'pulled pork' : 'pig', 
    'pastrami': 'cow',
    'corned beef': 'cow', 
    'honey ham': 'pig',
    'nova lox': 'salmon'
}

The map method on a Series accepts a function or dict-like object containing a mapping, but here we have a small problem in that some of the meats are capitalized and others are not. Thus, we need to convert each value to lowercase using the str.lower Series method:

In [79]:
lowercased = data['food'].str.lower()

In [81]:
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [82]:
data['animal'] = lowercased.map(meat_to_animal)

In [83]:
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


We could also have passed a function that does all the work:

In [85]:
data['food'].map(lambda x: meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

Using map is a convenient way to perform element-wise transformations and other data cleaning-related operations.

<h3>Replacing Values</h3>

Filling in missing data with the <b>fillna</b> method is a special case of more general value replacement. As we've already seen, <b>map</b> can be used to modify a subset of values in an object but replace provides a simpler and more flexible way to do so.

In [91]:
data = pd.Series([1., -999., 2., -999.,-1000., 3.])

In [92]:
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

The -999 values might be sentinel values for missign data. To replace these with NA values that pandas understand we can use replace, producing a new Series(unless we pass inplace=True):

In [93]:
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [94]:
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

If we want to replace multiple values at once, we can instead pass a list and then the substitue value:

In [95]:
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [96]:
data.replace([-999,-1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

To use a different replacement for each value,  pass a list of substitues:

In [97]:
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [98]:
data.replace([-999,-1000],[np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

The arugment passed can also be a dict:

In [99]:
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [100]:
data.replace({-999: np.nan, -1000:0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

<b>Note: </b> - The data.replace method is distinct from data.str.replace which performs string substitution element-wise. 

<h3>Renaming Axis Indexes</h3>

Like values in a Series,  axis labels can be similarly transformed by a function or mapping of some form to produce new, differently labeled objects. We can also modify the axes in-place without creating a new data structure. For example:- 

In [112]:
data = pd.DataFrame(np.arange(12).reshape((3,4)),
                   index = ['Ohio', 'Colorado', 'New York'],
                   columns = ['one', 'two', 'three', 'four'])

Like a Series, the axis indexed have a map method:

In [113]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [116]:
transform = lambda x: x[:4].upper()

In [117]:
data.index.map(transform)

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

We can assign to index, modifying the DataFrame in-place:

In [119]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [121]:
data.index = data.index.map(transform)

In [122]:
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


If we want to create a transformed version of a dataset without modifying the original, a useful method is rename:

In [124]:
data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


In [125]:
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


Notably, rename can be used in conjunction with a dict-like object providing new values for a subset of the axis labels:

In [127]:
data.rename(index = {'OHIO': 'INDIANA'},
           columns = {'three': 'peekaboo'})

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [128]:
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


<b>rename</b> saves us from the chore of copying the DataFrame manually and assigning to its index and column attribtues. Should we wish to modify a dataset in-place, pass <b>inplcae=True</b>

In [129]:
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [130]:
data.rename(index={'OHIO':'INDIANA'}, inplace=True)

In [131]:
data

Unnamed: 0,one,two,three,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


<h3>Discretization and Binning</h3>

Continuous Data is often  discretized or otherwise separated into "bins" for analysis. Suppose we have data about a group of people in a study, and we want to group them inot discrete age buckets:

In [132]:
ages = [20,22,25,27,21,23,37,31,61,45,4 1,32]

Now, lets divide these into bins of 18-25, 26-35, 46-60 and finally 60 - older. To do so, we have to use <b>cut</b> function in pandas:

In [133]:
bins = [18,25,35,69,100]

In [134]:
cats = pd.cut(ages, bins)

In [135]:
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (35, 69], (35, 69], (35, 69], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 69] < (69, 100]]

The object pandas returns is a special <b>Categorical</b> object. The output we see describes the bins computed by <b>pandas.cut</b>. We can treat it like an array of strings indicating the bin name; internally it contains a categories array specifying the distinct category names along with a labeling for the ages data in codes attribute:

In [139]:
type(cats)

pandas.core.arrays.categorical.Categorical

In [145]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 2, 2, 2, 1], dtype=int8)

In [146]:
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 69], (69, 100]],
              closed='right',
              dtype='interval[int64]')

In [147]:
pd.value_counts(cats)

(18, 25]     5
(35, 69]     4
(25, 35]     3
(69, 100]    0
dtype: int64

<b>Note: </b> pd.value_counts(cats) are the bin counts for the result pandas.cut

Consistent with mathematical notation for intervals, a paranthesis means that the side is open, while the square bracket means it is closed(inclusive). We can change which side is closde by passing right = False:

In [148]:
ages

[20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

In [149]:
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (35, 69], (35, 69], (35, 69], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 69] < (69, 100]]

In [151]:
pd.cut(ages, [18,26,36,61,100], right = False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

We can also pass our own bin names by passing  a list or array to the labels option:

In [152]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']

In [153]:
pd.cut(ages, bins, labels=group_names)

['Youth', 'Youth', 'Youth', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'MiddleAged', 'MiddleAged', 'MiddleAged', 'YoungAdult']
Length: 12
Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']

If we pass an integer number of bins to cut instead of explicit bin edges, it will compute equal-lenght bins based on the minimum and maximum values in the data. 
Consider the case of some uniformly distributed data chopped into fourths:

In [154]:
data = np.random.rand(20)

In [155]:
data

array([0.38818142, 0.13414338, 0.92992229, 0.35167794, 0.22145741,
       0.71696227, 0.23591592, 0.15406344, 0.66309627, 0.96023341,
       0.34725869, 0.46901634, 0.38287671, 0.1793147 , 0.99850405,
       0.67770429, 0.38869429, 0.7872978 , 0.96130118, 0.96968764])

In [160]:
pd.cut(data, 4,precision=2)

[(0.35, 0.57], (0.13, 0.35], (0.78, 1.0], (0.35, 0.57], (0.13, 0.35], ..., (0.57, 0.78], (0.35, 0.57], (0.78, 1.0], (0.78, 1.0], (0.78, 1.0]]
Length: 20
Categories (4, interval[float64]): [(0.13, 0.35] < (0.35, 0.57] < (0.57, 0.78] < (0.78, 1.0]]

<b>Note: </b> The precision = 2 option limits the decimal precision to two digits.

A closely related function, qcut, bins the data based on sample quantiles. Depending on the distribution of the data, using <b>cut</b> will not usually result in each bin having the same number of data points. Since, qcut uses sample quantiles instead, by definition we will obtain roughly equal-sized bins:

In [162]:
data = np.random.randn(1000) #Normally distributed

In [163]:
cats = pd.qcut(data, 4) # Cut into quartiels

In [164]:
cats

[(-0.645, -0.0262], (-0.645, -0.0262], (-0.0262, 0.692], (-0.645, -0.0262], (0.692, 3.127], ..., (-3.12, -0.645], (0.692, 3.127], (-0.0262, 0.692], (0.692, 3.127], (-0.645, -0.0262]]
Length: 1000
Categories (4, interval[float64]): [(-3.12, -0.645] < (-0.645, -0.0262] < (-0.0262, 0.692] < (0.692, 3.127]]

In [167]:
pd.value_counts(cats)

(0.692, 3.127]       250
(-0.0262, 0.692]     250
(-0.645, -0.0262]    250
(-3.12, -0.645]      250
dtype: int64

Similary to <b>cut</b> we can pass our own quantiles (numbers between 0 and 1, inclusive):

In [168]:
pd.qcut(data, [0,0.1,0.5,0.9,1.])

[(-1.237, -0.0262], (-1.237, -0.0262], (-0.0262, 1.323], (-1.237, -0.0262], (-0.0262, 1.323], ..., (-3.12, -1.237], (-0.0262, 1.323], (-0.0262, 1.323], (-0.0262, 1.323], (-1.237, -0.0262]]
Length: 1000
Categories (4, interval[float64]): [(-3.12, -1.237] < (-1.237, -0.0262] < (-0.0262, 1.323] < (1.323, 3.127]]

<h3>Detecting and Filtering Outliers</h3>

Filtering or Transforming outliers is largely a matter of applying array operations. 
Consider a DataFrame with some normally distributed data:

In [191]:
data = pd.DataFrame(np.random.randn(1000,4))

In [192]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.001608,0.033378,-0.003834,-0.013093
std,0.966568,1.029671,0.97882,0.971457
min,-3.72149,-3.014772,-3.617963,-3.577774
25%,-0.617673,-0.660361,-0.62746,-0.6115
50%,0.015129,0.03108,-0.001962,0.020487
75%,0.633368,0.772434,0.671778,0.642917
max,2.92532,3.297286,3.113154,3.142328


Suppose we want to find values in one of the columns exceeeding 3 in absolute value:

In [193]:
col = data[2]

In [194]:
col[np.abs(col) > 3]

876   -3.617963
912    3.113154
Name: 2, dtype: float64

To select all rows having a value exceeding 3 or -3, we can use the <b> any </b> method on a boolean DataFrame:

In [195]:
data[(np.abs(data)>3).any(1)]

Unnamed: 0,0,1,2,3
118,-3.23614,0.592883,-0.388365,-0.240623
159,-0.812902,-3.014772,0.263072,0.277824
220,-0.225892,-0.562257,0.253268,-3.577774
357,0.602428,3.297286,-1.734352,1.347913
433,-3.188174,0.750683,-0.095125,-1.170726
541,-3.72149,-0.466513,-0.314228,0.422662
670,-3.398616,-0.961187,0.700108,-0.038042
708,-3.423431,1.626983,-1.862014,-0.038457
752,0.015685,1.089468,-0.247975,3.142328
876,0.60942,-1.119193,-3.617963,1.140989


Values can be set based on these critera. Here is code to cap values outside the interval -3 to 3:

In [198]:
data[np.abs(data)>3] = np.sign(data) * 3

In [199]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.00036,0.033096,-0.003329,-0.012658
std,0.959941,1.028727,0.976374,0.969051
min,-3.0,-3.0,-3.0,-3.0
25%,-0.617673,-0.660361,-0.62746,-0.6115
50%,0.015129,0.03108,-0.001962,0.020487
75%,0.633368,0.772434,0.671778,0.642917
max,2.92532,3.0,3.0,3.0


The statement np.sign(data) produces 1 and -1 values based on whether the values in data are positive or negative:

In [200]:
np.sign(data).head()

Unnamed: 0,0,1,2,3
0,1.0,1.0,-1.0,-1.0
1,1.0,1.0,-1.0,1.0
2,-1.0,1.0,-1.0,-1.0
3,1.0,-1.0,1.0,-1.0
4,1.0,1.0,-1.0,-1.0


<h3>Permutation and Random Sampling</h3>

Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do using the <b>numpy.random.permuation</b> function. Calling <b>permuation</b> with the length of the axis we want to permute produces an array of integers indicating the new ordering:

In [201]:
df  = pd.DataFrame(np.arange(5*4).reshape((5,4)))

In [202]:
sampler = np.random.permutation(5)

In [203]:
sampler

array([3, 0, 1, 2, 4])

This array can be then used in iloc-based indexing or the equivalent take function:

In [204]:
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [205]:
df.iloc[sampler]

Unnamed: 0,0,1,2,3
3,12,13,14,15
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
4,16,17,18,19


In [206]:
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [207]:
df.take(sampler)

Unnamed: 0,0,1,2,3
3,12,13,14,15
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
4,16,17,18,19


To select a random subset <b>without replacement</b>, we can use the <b>sample</b> method on Series and DataFrame:

In [208]:
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [209]:
df.sample(n=3)

Unnamed: 0,0,1,2,3
4,16,17,18,19
1,4,5,6,7
2,8,9,10,11


To generate a sample <b>with replacement (to allow repeat choices)</b>, pass <b>replace=True</b> to sample:

In [210]:
choices = pd.Series([5,7,-1,6,4])

In [212]:
draws = choices.sample(n=10, replace = True)

In [213]:
draws

0    5
2   -1
4    4
3    6
2   -1
0    5
3    6
4    4
4    4
4    4
dtype: int64

<h3>Computing Indicator/Dummy Variables </h3>

Another type of transformation for statistical modelling or machine learning applications is converting a categorical varaibel into a "dummy" or "indicator" matrix. If a column in a DataFrame has k distinct values, we would derive a matrix or DataFrame with k columns containing all 1s and 0s. pandas has a <b>get_dummies</b> function for doing this, though devising one yourself is not difficult. 

In [214]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                  'data1': range(6)})

In [215]:
df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [222]:
pd.get_dummies(df['key'], )

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In some cases, we may want to add a prefix to the columns in the indicator DataFarme, which can then be merged with the other data. <b>get_dummies</b> has a prefix arguement for doing this:

In [217]:
dummies = pd.get_dummies(df['key'], prefix='key')

In [219]:
df_with_dummy =  df[['data1']].join(dummies)

In [220]:
df_with_dummy

Unnamed: 0,data1,key_a,key_b,key_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


If a row in a DataFrame belongs to multiple categories, things are a bit more complicated

In [223]:
mnames = ['movie_id', 'title', 'genres']

In [224]:
movies = pd.read_table('pydata-book-2nd-edition/datasets/movielens/movies.dat',
                      sep = '::', header = None, names= mnames)

  return read_csv(**locals())


In [225]:
movies[:10]

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children's
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


Adding indicator variables for each genre requires a little bit of wrangling. First, we extract the list of unique genres in the dataset:


In [227]:
all_genres = []

In [229]:
for x in movies.genres:
    all_genres.extend(x.split('|'))

In [233]:
genres = pd.unique(all_genres)

In [234]:
genres

array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)

One way to construct the indicator DataFarme is to start with a DataFrame of all zeros:

In [236]:
zero_matrix = np.zeros((len(movies), len(genres)))

In [237]:
dummies = pd.DataFrame(zero_matrix, columns=genres)

Now, iterate through each movie and set entries in each row of dummies to 1. To do this, we use the <b>dummies.columns</b> to compute the column indices for each genre:

In [238]:
gen = movies.genres[0]

In [239]:
gen

"Animation|Children's|Comedy"

In [240]:
gen.split('|')

['Animation', "Children's", 'Comedy']

In [241]:
dummies.columns.get_indexer(gen.split('|'))

array([0, 1, 2], dtype=int64)

Then we can use <b>.iloc</b> to set values based on these indices:

In [242]:
movies.genres

0        Animation|Children's|Comedy
1       Adventure|Children's|Fantasy
2                     Comedy|Romance
3                       Comedy|Drama
4                             Comedy
                    ...             
3878                          Comedy
3879                           Drama
3880                           Drama
3881                           Drama
3882                  Drama|Thriller
Name: genres, Length: 3883, dtype: object

In [243]:
for i, gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1

Then as before, we can combine this with movies:

In [244]:
movies_windic = movies.join(dummies.add_prefix('Genre_'))

In [245]:
movies_windic.iloc[0]

movie_id                                       1
title                           Toy Story (1995)
genres               Animation|Children's|Comedy
Genre_Animation                                1
Genre_Children's                               1
Genre_Comedy                                   1
Genre_Adventure                                0
Genre_Fantasy                                  0
Genre_Romance                                  0
Genre_Drama                                    0
Genre_Action                                   0
Genre_Crime                                    0
Genre_Thriller                                 0
Genre_Horror                                   0
Genre_Sci-Fi                                   0
Genre_Documentary                              0
Genre_War                                      0
Genre_Musical                                  0
Genre_Mystery                                  0
Genre_Film-Noir                                0
Genre_Western       

Note: for much larger data, this method of constructing indicaor variables with multiple memebership is  not especially speedy. It would be better to write a lower-level function that writes directly to a NumPy array, and then wrap the result in a DataFrame.

A useful recipe for statistical applications is to combine <b>get_dummies</b> with a discretization function like <b>cut</b>:

In [246]:
np.random.seed(12345)

In [247]:
values = np.random.rand(10)

In [248]:
values

array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
       0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])

In [249]:
bins = [0,0.2,0.4,0.6,0.8,1]

In [250]:
pd.get_dummies(pd.cut(values, bins))

Unnamed: 0,"(0.0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1.0]"
0,0,0,0,0,1
1,0,1,0,0,0
2,1,0,0,0,0
3,0,1,0,0,0
4,0,0,1,0,0
5,0,0,1,0,0
6,0,0,0,0,1
7,0,0,0,1,0
8,0,0,0,1,0
9,0,0,0,1,0


<h3>String Manipulation</h3>

Python has long been a popular raw data manipulation language in part due to its ease of use for string and text processing. Most text operations are made simple with the string object's built-in methods. For more complex mattern matching and text manipulations, regular expressions may be needed. Pandas adds to the mix by enabling us to apply string and regular expression concisely on whole arrays of data, additionally handling the annoyance of missing data.

<h3>String Object Methods</h3>

In many string munging and scripting applications, built-in string methods are sufficient. As an example, comma-separated string can be broken into pieces with split:

In [295]:
val = 'a,b,    guido'

In [254]:
val.split(',')

['a', 'b', '    guido']

split is often combined with strip to trim whitespace(including line breaks):

In [256]:
pieces = [x.strip() for x in val.split(',')]

In [257]:
pieces

['a', 'b', 'guido']

These substrings could be concatenated together with a two-colon delimeter using addition:

In [258]:
first, second, third = pieces

In [259]:
first + "::" + second +"::"+third

'a::b::guido'

But this isn't a practical generic method. A faster and more Pythonic way is to pass a list or tuple to the join method on the string "::" :

In [260]:
"::".join(pieces)

'a::b::guido'

Other methods are concerned with locatin substrings. Using Python's <b>in</b> keyword is the best way to detect substring, though <b>index</b> and <b>find</b> can also be used:

In [261]:
pieces

['a', 'b', 'guido']

In [262]:
'guido' in pieces

True

In [265]:
val.index(',')

1

In [267]:
val.find(':')

-1

Note the differece between <b>find</b> and <b>index</b> is that index raises an exception if the string isn't found but the find returns a -1

In [268]:
val.index(':')

ValueError: substring not found

Relatedly, <b>count</b> returns the number of occurences of a particular substring:

In [269]:
val.count(',')

2

<b>replace</b> will substitute occurences of one pattern for another. It is commonly used to delete pattersn, too, by passing an empty string:

In [270]:
val

'a,b,    guido'

In [271]:
val.replace(',', '::')

'a::b::    guido'

In [273]:
val.replace(',', '')

'ab    guido'

![alt Text](Images/DataCleaningandPreparation/built_in_string_methods.png)

<h3>Regular Expressions</h3>

Regular expression provide a flexible way to search or match (often more complex) string patterns in text. A single expression, commonly called a regex, is a string formed according to the regular expressin language. Python's built-in re module is responsible for applying regular expressions to strings.

The <b>re</b> module functions fall into three categories: <b>pattern matching, substituion, and splitting</b>. Naturally these are all related; a regex describes a pattern to located in the text, which can then be used for many purposes. 

Let's look at a simple example: 
Suppose we wanted to split a string with a variable number of whitespace characters(tabs, spaces, and newlines). The regex describing one or more whitespace characters is <b>\s+</b>

In [297]:
import re

In [298]:
text = "foo   bar\t baz  \tqux"

In [299]:
text

'foo   bar\t baz  \tqux'

In [300]:
re.split('\s+', text)

['foo', 'bar', 'baz', 'qux']

When we call re.split('\s+', text), the regular expression is first compiled, and then its split method is called on the passed text. We can compile the regex ourselves with re.compile, forming a reusable regex object:

In [301]:
regex = re.compile('\s+')

In [302]:
regex.split(text)

['foo', 'bar', 'baz', 'qux']

If instead we wanted to get a list of all patterns matching the reegx, we can use the <b>findall</b> method:

In [304]:
regex.findall(text)

['   ', '\t ', '  \t']

<b>Note: To avoid unwanted escaping with \ in a regular expression, use raw string literals liek r'C:\x' instead of the equivalent 'C:\\x'.

Creating a regex object with re.compile is highly recommended if we intend to apply the same expression to many strings; doing so will save CPU cycles.

<b>match & search</b> are closely related to <b>findall<b>. While findall returns all matches in a string, search returns only the first match. More rigidly, match only matches at the beginning of the string. As a less trivial example, lets' consider a block of text and a regular expression capable of identifying most email addresses:

In [305]:
text = """Dave dave@google.com
Steve steve@gamil.com
Rob rob@gmail.com
Ryan ryan@yahoo.com"""

In [306]:
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

In [308]:
#re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags = re.IGNORECASE)

In [310]:
regex.findall(text)

['dave@google.com', 'steve@gamil.com', 'rob@gmail.com', 'ryan@yahoo.com']

<b>search</b> retruns a special match object for the first email address in the text. For the preceding regex, the match object can only tell us the start and end position of the pattern in the string:

In [311]:
m = regex.search(text)

In [312]:
m

<re.Match object; span=(5, 20), match='dave@google.com'>

In [313]:
m.start()

5

In [314]:
m.end()

20

In [315]:
text[m.start():m.end()]

'dave@google.com'

<b>regex.match </b> returns None, as it only will match if the pattern occurs at the start of the string:

In [316]:
print(regex.match(text))

None


Relatedly, <b>sub</b> will return a new string with occurrences of the pattern replaced by a new string:


In [318]:
print(regex.sub('REDACTED', text))

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED


In [319]:
text

'Dave dave@google.com\nSteve steve@gamil.com\nRob rob@gmail.com\nRyan ryan@yahoo.com'

Suppose we wanted to find email addresses and simulataneously segment each address into its three componenets: username, domain name, and domain suffix. To do this, put parentheses around the parts of the pattern to segmetn:

In [320]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'

In [322]:
regex = re.compile(pattern, flags = re.IGNORECASE)

A match object produced by this modified regex returns a tuple of the pattern components with its groups method:

In [323]:
m = regex.match('wesm@bright.net')

In [324]:
m.groups()

('wesm', 'bright', 'net')

In [326]:
username, domain_name, domain_suffix = m.groups()

In [327]:
username

'wesm'

In [328]:
domain_name

'bright'

In [329]:
domain_suffix

'net'

<b>findall</b> returns a list of tuples when the pattern has groups:

In [330]:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gamil', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

<b>sub</b> also has access to groups in each match using special symbols like \1 and \2. The symbol \1 corresponds to the first matched group, \2 corresponds to the second, and so forth:

In [331]:
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))

Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gamil, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com


![alt Text](Images/DataCleaningandPreparation/re_methods.png)

<h3>Vectorized String Functions in Pandas</h3>

Cleaning up a messy dataset for analysis often requires a lot of string munging and regularization. To complicate matters, a column containing strings will sometimes have missing data:

In [332]:
data = {'Dave': 'dave@google.com', 'Steve':'steve@gmail.com',
       'Rob':'rob@mgail.com', 'Wes': np.nan}

In [333]:
data = pd.Series(data)

In [334]:
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@mgail.com
Wes                  NaN
dtype: object

In [336]:
data.isnull()

Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

We can apply string and regular expression methods (passing lambda or other function) to each value using data.map,  but it will fail on the NA(Null) values. To cope with this, Series has  array-oreintted methdos for string operations that skip NA values. These are accessed through Sereis's str attribute; for example: we could check whether each email address has 'gmail' in it with <b>str.contains</b>:

In [339]:
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@mgail.com
Wes                  NaN
dtype: object

In [340]:
data.str.contains('gmail')

Dave     False
Steve     True
Rob      False
Wes        NaN
dtype: object

Regular expressions can be used, too, along with any re options like IGNORECASE:

In [341]:
pattern

'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

In [342]:
data.str.findall(pattern, flags = re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, mgail, com)]
Wes                        NaN
dtype: object

There are a couple of ways to do vectorized element retrieval. Either use str.get or index into the str attribute:

In [344]:
matches = data.str.match(pattern, flags = re.IGNORECASE)

In [348]:
matches

Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object

![alt Text](Images/DataCleaningandPreparation/vectorized_strings.png)