### Imports

In [62]:
import pandas as pd
import numpy as np
from numpy import nan as NAd
import re

# Chapter 7. Data Cleaning and Preperation

During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up 80% or more of an analyst’s time. Sometimes the way that data is stored in files or databases is not in the right format for a particular task. Many researchers choose to do ad hoc processing of data from one form to another using a general-purpose programming language, like Python, Perl, R, or Java, or Unix text-processing tools like sed or awk. Fortunately, pandas, along with the built-in Python language features, provides you with a high-level, flexible, and fast set of tools to enable you to manipulate data into the right form. 

If you identify a type of data manipulation that isn’t anywhere in this book or elsewhere in the pandas library, feel free to share your use case on one of the Python mailing lists or on the pandas GitHub site. Indeed, much of the design and implementation of pandas has been driven by the needs of real-world applications. In this chapter I discuss tools for missing data, duplicate data, string manipulation, and some other analytical data transformations. In the next chapter, I focus on combining and rearranging datasets in various ways.

## Handling Missing Data

Missing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing data as painless as possible. For example, all of the descriptive statistics on pandas objects exclude missing data by default. 

The way that missing data is represented in pandas objects is somewhat imperfect, but it is functional for a lot of users. For numeric data, pandas uses the floating-point value NaN (Not a Number) to represent missing data. We call this a sentinel value that can be easily detected:

In [4]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [5]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In pandas, we’ve adopted a convention used in the R programming language by referring to missing data as NA, which stands for not available. In statistics applications, NA data may either be data that does not exist or that exists but was not observed (through problems with data collection, for example). When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data. 

The built-in Python *None* value is also treated as NA in object arrays:

In [6]:
string_data[0] = None
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

### Filtering Out Missing Data

There are a few ways to filter out missing data. While you always have the option to do it by hand using pandas.isnull and boolean indexing, the dropna can be helpful. On a Series, it returns the Series with only the non-null data and index values:

In [7]:
data = pd.Series([1, NA, 3.5, NA, 7])
data.dropna()

NameError: name 'NA' is not defined

With DataFrame objects, things are a bit more complex. You may want to drop rows or columns that are all NA or only those containing any NAs. dropna by default drops any row containing a missing value:

In [8]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]])
cleaned = data.dropna()
cleaned

NameError: name 'NA' is not defined

Passing how =' all' will only drop rows that are all NA:

In [9]:
data.dropna(how = 'all')

NameError: name 'data' is not defined

In [10]:
to drop columns:

SyntaxError: invalid syntax (<ipython-input-10-8405e7bf899d>, line 1)

In [11]:
data.dropna(axis = 1, how = 'all')

NameError: name 'data' is not defined

Another related way to filer out DataFrame rows tends to concern time series data. Suppose you want to keep only rows containing a certain number of observatins. You can indicate this with the thresh argument:

In [12]:
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df

NameError: name 'NA' is not defined

In [13]:
df.dropna()

Unnamed: 0,0,1,2
0,0.390156,0.852131,1.284864
1,0.20612,0.386046,-0.231879
2,0.017142,-0.61173,-0.538494
3,-0.755182,-0.407432,-1.297024
4,-0.072054,-1.168741,0.077882
5,-0.26262,-1.630141,0.829446
6,-0.685054,0.586024,0.368876


In [14]:
df.dropna(thresh = 2)

Unnamed: 0,0,1,2
0,0.390156,0.852131,1.284864
1,0.20612,0.386046,-0.231879
2,0.017142,-0.61173,-0.538494
3,-0.755182,-0.407432,-1.297024
4,-0.072054,-1.168741,0.077882
5,-0.26262,-1.630141,0.829446
6,-0.685054,0.586024,0.368876


### Filling in Missing Data

Rather than filtering put missing data (and potentially discarding other data along with it), you may want to fill in the "holes" in any number of ways. For most purposes, the fillna method is the workhorse function to use. Calling fillna with a constant replaces missing values with that value:

In [15]:
df.fillna(0)

Unnamed: 0,0,1,2
0,0.390156,0.852131,1.284864
1,0.20612,0.386046,-0.231879
2,0.017142,-0.61173,-0.538494
3,-0.755182,-0.407432,-1.297024
4,-0.072054,-1.168741,0.077882
5,-0.26262,-1.630141,0.829446
6,-0.685054,0.586024,0.368876


In [16]:
df.fillna({1: 0.5, 2 : 0})

Unnamed: 0,0,1,2
0,0.390156,0.852131,1.284864
1,0.20612,0.386046,-0.231879
2,0.017142,-0.61173,-0.538494
3,-0.755182,-0.407432,-1.297024
4,-0.072054,-1.168741,0.077882
5,-0.26262,-1.630141,0.829446
6,-0.685054,0.586024,0.368876


## Data Transformation

### Removing duplicates

Duplicate rows may be found in a DataFrame for any number of reasons. Here is an example:

In [17]:
data = pd.DataFrame({'k1' : ['one', 'two'] * 3 + ['two'],
                     'k2' : [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [18]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [19]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


In [20]:
data['v1'] = range(7)
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


### Transforming data using a function or mapping

For many datasets, you may wish to perform some transformation based on the values in an array, Series or column in a DataFrame. Consider the following hypothetical data collected about various kinds of meat:

In [21]:
data = pd.DataFrame({'food' : ['bacon', 'pulled pork', 'bacon', 'Pastrami', 'corned beef', 'Bacon',
                               'pastrami', 'honey ham', 'nova lox'],
                     'ounces' : [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Suppose you want to add a columnd indicating the type of animal that each food came from. Let's write down a mapping of each distinct meat type to the kind of animal:

In [22]:
meat_to_animal = {
    'bacon' : 'pig',
    'pulled pork' : 'pig',
    'pastrami' : 'cow',
    'corned beef' : 'cow',
    'honey ham' : 'pig',
    'nova lox' : 'salmon'
}

The map method on a Series accepts a function or dict-like object containing a mapping, but here we have a small problem in that some of the meats are capitalized and others are not. Thus, we need to convert each value to lowercase using the str.lower Series method:

In [23]:
lowercased = data['food'].str.lower()
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [24]:
data['animal'] = lowercased.map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


### Replacing Values

Filling in missing data with the fillna method is a special case of more general value replacement. As you've already seen, map can be used to modify a subset of values in an object, but replace provides a simpler and more flexible way to do so. Let's consider this Series:

In [25]:
data = pd.Series([1., -999, 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [26]:
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [27]:
data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

In [28]:
data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

In [29]:
data.replace({-999 : np.nan, -1000 : 0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

### Renaming Axis Indexes

Like values in Series, axis labels can be similarly transformed by a function or mapping of some form to produce new, differently labeled objects. You can also modify the axes in-place without creating a new data structure. Here's a simple example:

In [30]:
data = pd.DataFrame(
    np.arange(12).reshape((3, 4)),
    index = ['Ohio', 'Colorado', 'New York'],
    columns = ['one', 'two', 'three', 'four']
)

transform = lambda x : x[:4].upper()

data.index = data.index.map(transform)
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


### Discretization and Binning

Continious data is often discretized or otherwise seperated into "bins" for analysis. Suppose you have data about a group of people in a study, and you want to group them into discrete age buckets:

In [31]:
ages = [20, 22, 25, 27, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let's divide these into bins of 18 to 25, 25 to 35, 36 to 60 and 61 and older.

In [32]:
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (25, 35], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 13
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

Pandas returns a Categorical object. The output you see describes the bins computed by pandas.cut. You can treat it like an array of strings indicating the bin name; internally it contains a categories array specifying the distinct category names along with a labeling for the ages data in the codes attribute

In [33]:
cats.codes

array([0, 0, 0, 1, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [34]:
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')

In [35]:
pd.value_counts(cats)

(18, 25]     5
(25, 35]     4
(35, 60]     3
(60, 100]    1
dtype: int64

Note that pd.value_counts(cats) are the bin counts for the result of pandas.cut 

Consisten with mathematical notation for intervals, a parenthesis means that the side is *open*, while the square bracket means it is closed (inclusive), you can change which side is closed by passing right=False

In [36]:
pd.cut(ages, [18, 26, 36, 61, 100], right = False)

[[18, 26), [18, 26), [18, 26), [26, 36), [26, 36), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 13
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

You can also pass your own bin names by passing a list or array to the labels option:

In [37]:
group_names = ['Youth', 'Young Adult', 'Middle Aged', 'Senior']
pd.cut(ages, bins, labels = group_names)

['Youth', 'Youth', 'Youth', 'Young Adult', 'Young Adult', ..., 'Young Adult', 'Senior', 'Middle Aged', 'Middle Aged', 'Young Adult']
Length: 13
Categories (4, object): ['Youth' < 'Young Adult' < 'Middle Aged' < 'Senior']

If you pass an integer number of bins to cut instead of explicit bin edges, it will compute equal-length bins based on the minimum and maxoimum values in the data.

Consider the case of some uniformly distributed data chopped into fourths:

In [38]:
data = np.random.randn(20)
pd.cut(data, 4, precision = 2)

[(-0.6, 0.25], (-0.6, 0.25], (-0.6, 0.25], (1.1, 1.95], (0.25, 1.1], ..., (1.1, 1.95], (-0.6, 0.25], (-0.6, 0.25], (0.25, 1.1], (-1.45, -0.6]]
Length: 20
Categories (4, interval[float64]): [(-1.45, -0.6] < (-0.6, 0.25] < (0.25, 1.1] < (1.1, 1.95]]

A closely related function is qcut, which bins the data based on sample quantiles. Depending on the distribution of the data, using *cut* will not usually result in each bin having the same number of data points. Since *qcut* uses sample quantiles instead, by definition, you will obtain roughly equal-size bins:

In [39]:
data = np.random.randn(100)
cats = pd.qcut(data, 5)
cats

[(0.891, 2.232], (0.312, 0.891], (-0.826, -0.279], (0.312, 0.891], (0.891, 2.232], ..., (-0.826, -0.279], (-0.279, 0.312], (-0.279, 0.312], (-0.826, -0.279], (-0.826, -0.279]]
Length: 100
Categories (5, interval[float64]): [(-2.1599999999999997, -0.826] < (-0.826, -0.279] < (-0.279, 0.312] < (0.312, 0.891] < (0.891, 2.232]]

In [40]:
pd.value_counts(cats)

(0.891, 2.232]                   20
(0.312, 0.891]                   20
(-0.279, 0.312]                  20
(-0.826, -0.279]                 20
(-2.1599999999999997, -0.826]    20
dtype: int64

You can also pass your own quantiles

In [41]:
pd.qcut(data, [0., 0.1, 0.5, 0.9, 1.])

[(1.378, 2.232], (0.0978, 1.378], (-1.276, 0.0978], (0.0978, 1.378], (0.0978, 1.378], ..., (-1.276, 0.0978], (-1.276, 0.0978], (0.0978, 1.378], (-1.276, 0.0978], (-1.276, 0.0978]]
Length: 100
Categories (4, interval[float64]): [(-2.1599999999999997, -1.276] < (-1.276, 0.0978] < (0.0978, 1.378] < (1.378, 2.232]]

### Detecting and filtering Outliers

Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:

In [42]:
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.01757,0.05288,-0.030803,0.017207
std,1.024012,0.98622,0.965596,1.009916
min,-3.286707,-3.543714,-3.265643,-2.918023
25%,-0.667339,-0.59377,-0.659105,-0.661184
50%,0.039549,0.053952,-0.020406,-0.009808
75%,0.695083,0.729569,0.641472,0.713299
max,3.467195,3.202953,3.600287,3.54612


Suppose you wanted to find values in one of the columns exceeding 3 in absolute value:

In [43]:
col = data[2]
col[np.abs(col) > 3]

238    3.048525
568    3.600287
702   -3.265643
Name: 2, dtype: float64

To select all rows having a value exceeding 3 og -3, you can use the *any* method on boolean DataFrame:

In [44]:
data[(np.abs(data) > 3).any(1)]

Unnamed: 0,0,1,2,3
13,-3.079434,-0.227662,-1.050335,0.429975
72,3.467195,0.555816,-0.658473,-0.736075
238,-2.027088,1.077093,3.048525,-0.041575
267,1.022494,3.011934,0.200792,-0.656785
281,-0.534094,-3.543714,1.239407,0.385387
397,-3.0179,0.280159,-0.691191,-0.446412
417,-0.511682,3.147952,0.458012,-0.384289
536,-3.131254,-0.442136,-0.796523,1.113741
546,1.062108,-1.100956,-0.939133,3.339225
568,-1.139334,-2.162619,3.600287,-0.530165


### Permutation and Random Sampling

Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do using the numpy.random.permutation function. Calling *permutation* with the length of the acis you want to permute produces an array of integers indicating the new ordering:

In [45]:
df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))
sampler = np.random.permutation(5)
sampler

array([0, 4, 3, 1, 2])

In [46]:
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [47]:
df.take(sampler)

Unnamed: 0,0,1,2,3
0,0,1,2,3
4,16,17,18,19
3,12,13,14,15
1,4,5,6,7
2,8,9,10,11


To select a random subset without replacement, you can use the sample method on Series and Dataframe.

In [48]:
df.sample(n = 3)

Unnamed: 0,0,1,2,3
2,8,9,10,11
3,12,13,14,15
0,0,1,2,3


To generate a sample *with* replacement (to allow repeat choices), pass replace=True to sample:

In [49]:
choices = pd.Series([5, 7, -1, 6, 4])
draws = choices.sample(n = 10, replace = True)
draws

4    4
1    7
1    7
1    7
2   -1
3    6
0    5
0    5
2   -1
3    6
dtype: int64

### Computing Indicator/Dummy Variables

Another type of transformation for statistical modeling or machine learning applications is converting a categorical variable into a "dummy" or "indicator" matrix. If a column in a DataFrame has k distinct values, you would derive a matrix or DataFrame with k columns containing all 1s and 0s pandas has a get_dummies function for doing this.

In [50]:
df = pd.DataFrame(
    {'key' : ['b', 'b', 'a', 'c', 'a', 'b'],
     'data1' : range(6)}
)

pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


## String manipulation

Python has long been a popular raw data manipulation language in part due to its ease of use for string and text processing. Most text operations are made simple with the string object's builtin methods. For more complex pattern matching and text manipulations, regular expressions may be needed. Panadas adds to the mix by enabling you to apply string and regular expressions concisely on whole arrays of data, additionaly handling the annoyance of missing data.

### String object methods

Comma seperated string can be broken into pieces.

In [51]:
val = 'a,b, guido'
val.split(',')

['a', 'b', ' guido']

split is often combined with strip to trim whitespace

In [52]:
pieces = [x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

In [53]:
first, second, third = pieces

But this is not a practical generic method. A faster and more pythonic way is to pass a list or tuple to the join method on the string ':: '

In [55]:
'::'.join(pieces)

'a::b::guido'

Other methods are concerned with locating substirngs. Using Python's keyword is the best way to detect a substring, though index and find can also be used:

In [56]:
'guido' in val

True

In [57]:
val.index(',')

1

In [58]:
val.find(':')

-1

In [59]:
val.index(':')

ValueError: substring not found

Relatedly, count returns the number of occurences of a particular substring:

In [60]:
val.count(',')

2

replace will substitute occurences of one pattern for another. It is commonly used to delete patterns, too, by passing an empty string:

In [61]:
val.replace(',', '::')

'a::b:: guido'

### Regular expressions

Regular expressions provide a flexible way to rearch or match string patterns in a text. A single expression, commonly called a regex, is a string formed according to the regular expression language. Python's built-in re module is responsible for applying regular expressions to strings.

In [63]:
text = "foo    bar\t bax  \tqux"
re.split('\s+', text)

['foo', 'bar', 'bax', 'qux']

when calling re.split the regular expression is first compiled then it's split.

In [64]:
regex = re.compile('\s+')
regex

re.compile(r'\s+', re.UNICODE)

In [65]:
regex.split(text)

['foo', 'bar', 'bax', 'qux']

If instead you wanted to get a list of all patterns matching the regex, you can use the findall method:

In [66]:
regex.findall(text)

['    ', '\t ', '  \t']

In [67]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""

pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}'

In [68]:
regex = re.compile(pattern, flags=re.IGNORECASE)

In [69]:
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

the search method returns a special match object for the first email address in the text.

In [71]:
m = regex.search(text)
m

<re.Match object; span=(5, 20), match='dave@google.com'>

In [73]:
text[m.start():m.end()]

'dave@google.com'

Relatedly, sub will return a new string with occurences of the pattern replaced by a new string:

In [75]:
print(regex.sub('REDACTED', text))

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED



Supposed you wanted to find email adresses and simultaneously segment each address into it's three components: username, domain name and domain suffix. To do this, put parantheses around the parts of the pattern to segment:

In [79]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0.-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)
m = regex.match('wesm@bright.net')
m.groups()

('wesm', 'bright', 'net')

In [80]:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

### Vectorized String Functions in pandas

Cleaning up a messy dataset for analysis often requires a lot of string munging and regularization. To complicate matters, a column containing strings will sometimes have missing data:

In [83]:
data = {'Dave' : 'dave@google.com', 'Steve' : 'steve@gmail.com', 'Rob' : 'rob@gmail.com', 'Wez' : np.nan}
data = pd.Series(data)

In [84]:
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wez                  NaN
dtype: object

In [87]:
data.isnull()

Dave     False
Steve    False
Rob      False
Wez       True
dtype: bool

You can apply string and regular expression methods to each value using map() but will fail on the NA values. To cope with this, Series has array oriented methods for string operations that skip NA values. These are accessed through Series's str attribute; 

In [88]:
data.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wez        NaN
dtype: object

In [90]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wez                        NaN
dtype: object