# Data Cleaning and Preparation

During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up 80% of more of an analyst's time. Effective data preparation can significantly improve productivity by enabling you to spend more time analysing data and less time getting it ready for analysis. 

Fortunately, pandas, along with built-in Python language features, provides you with a high-level, flexible, and fast set of tools to enable you to manipulate data into the right form. 

## Handling Missing Data

Missing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing data as painless as possible. The way that missing data is represented in pandas objects is somewhat imperfect, but it is functional for a lot of users. For numeric data, pandas uses the floating-point value `NaN` (Not a Number) to represent missing data:

In [1]:
import pandas as pd
import numpy as np

In [2]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

Detecting null values:

In [3]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In pandas, we reference missing data as NA, which stands for *not available*. In statistics applications, NA data may either be data that does not exist or that exists but was not observed (through problems with data collection for example). When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data.

The built-in Python `None` value is also treated as NA in object arrays:

In [4]:
string_data[0] = None
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

See table below for a list of some functions related to missing data handling:

**Argument** | **Description**
--- | ---
`dropna` | Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
`fillna` | Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.
`isnull` | Return boolean values indicating which values are missing/NA.
`notnull` | Negation of `isnull`.

### Filtering Out Missing Data

There are a few ways to filter out missing data. While you always have the option to do it by hand using `pandas.isnull` and boolean indexing, the `dropna` can be helpful. On a Series, it returns the Series with only the non-null data and index values:

In [5]:
from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

This is equivalent to:

In [6]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

Handling missing data in DataFrame objects can be a bit more complex. You may want to drop rows or columns that are all NA or only those containing any NAs. `dropna` by default, drops any row containing a missing value:

In [7]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], 
                     [NA, NA, NA], [NA, 6.5, 3.]])

cleaned = data.dropna()
print("data:\n{}\n".format(data))
cleaned

data:
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0



Unnamed: 0,0,1,2
0,1.0,6.5,3.0


Passing `how='all'` argument will only drop rows that are all NA:

In [8]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


To drop columns in the same way, pass `axis=1`:

In [9]:
data[4] = NA
print("data:\n{}\n".format(data))

data.dropna(axis=1, how='all')

data:
     0    1    2   4
0  1.0  6.5  3.0 NaN
1  1.0  NaN  NaN NaN
2  NaN  NaN  NaN NaN
3  NaN  6.5  3.0 NaN



Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


Another useful option is `thresh`, which is useful for dealing with time series data. Suppose you want to keep only rows containing a certain number of observations. You can indicate this with the `thresh` argument:

In [10]:
df = pd.DataFrame(np.random.randn(7, 3))

# imputing NA values
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA

print("{}\n".format(df))

df.dropna(thresh=2)

          0         1         2
0 -0.712025       NaN       NaN
1 -1.045048       NaN       NaN
2  1.257166       NaN  2.432204
3  0.565182       NaN -1.014928
4  0.775161 -0.521766  0.773446
5  0.613069  0.834693  0.737515
6 -0.493877  1.437790  1.513845



Unnamed: 0,0,1,2
2,1.257166,,2.432204
3,0.565182,,-1.014928
4,0.775161,-0.521766,0.773446
5,0.613069,0.834693,0.737515
6,-0.493877,1.43779,1.513845


### Filling in Missing Data

Rather than filtering out missing data (and potentially discarding other data along with it), you may want to fill in the "holes" in any number of ways. For most purposes, the `fillna` method is the workhorse function to use. Calling `fillna` with a constant replaces missing values with that value:

In [11]:
df.fillna(0)

Unnamed: 0,0,1,2
0,-0.712025,0.0,0.0
1,-1.045048,0.0,0.0
2,1.257166,0.0,2.432204
3,0.565182,0.0,-1.014928
4,0.775161,-0.521766,0.773446
5,0.613069,0.834693,0.737515
6,-0.493877,1.43779,1.513845


Calling `fillna` with a dict, you can use a different fill value for each column:

In [12]:
df.fillna({1: 0.5, 2: 0})

Unnamed: 0,0,1,2
0,-0.712025,0.5,0.0
1,-1.045048,0.5,0.0
2,1.257166,0.5,2.432204
3,0.565182,0.5,-1.014928
4,0.775161,-0.521766,0.773446
5,0.613069,0.834693,0.737515
6,-0.493877,1.43779,1.513845


`fillna()` method returns a new object, but you can modify the existing object in-place:

In [13]:
_ = df.fillna(0, inplace=True)
df

Unnamed: 0,0,1,2
0,-0.712025,0.0,0.0
1,-1.045048,0.0,0.0
2,1.257166,0.0,2.432204
3,0.565182,0.0,-1.014928
4,0.775161,-0.521766,0.773446
5,0.613069,0.834693,0.737515
6,-0.493877,1.43779,1.513845


The same interpolation methods available for reindexing can be used with `fillna`:

In [14]:
df = pd.DataFrame(np.random.randn(6,3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
print("{}\n".format(df))

df.fillna(method='ffill')

          0         1         2
0 -0.256399 -1.245849  0.963731
1 -0.994927  2.385819  0.014212
2 -0.097819       NaN -1.068646
3  0.820815       NaN  0.219291
4 -0.944163       NaN       NaN
5  0.565351       NaN       NaN



Unnamed: 0,0,1,2
0,-0.256399,-1.245849,0.963731
1,-0.994927,2.385819,0.014212
2,-0.097819,2.385819,-1.068646
3,0.820815,2.385819,0.219291
4,-0.944163,2.385819,0.219291
5,0.565351,2.385819,0.219291


In [15]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,-0.256399,-1.245849,0.963731
1,-0.994927,2.385819,0.014212
2,-0.097819,2.385819,-1.068646
3,0.820815,2.385819,0.219291
4,-0.944163,,0.219291
5,0.565351,,0.219291


You might also pass the mean or median value of a Series with `fillna()`:

In [16]:
data = pd.Series([1., NA, 3.5, NA, 7])
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

Table below summarises function arguments for `fillna`:

**Argument** | **Description**
--- | ---
`value` | Scalar value or dict-like object to use to fill missing values
`method` | Interpolation; by default `ffill` if function called with no other arguments
`axis` | Axis to fill on; default `axis=0`
`inplace` | Modify the calling object without producing a copy
`limit` | For forward and backward filling, maximum number of consecutive periods to fill

## Data Transformation

Filtering, cleaning and other transformation are another class of important operations when dealing with data for analysis:

### Removing Duplicates

Duplicate rows may be found in a DataFrame for any number of reasons. Here is an example:

In [17]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                     'k2': [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


The DataFrame method `duplicated` returns a boolean Series indicating whether each row is a duplicate (has been observed in a previous row) or not:

In [18]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

Relatedly, `drop_duplicates` return a DataFrame where the `duplicated` array is `False`:

In [19]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


Alternatively, you can specify any subset of the data to detect duplicates. Suppose we had an additional column of values and wanted to filter duplicates only based on the 'k1' column:

In [20]:
data['v1'] = range(7)
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


`duplicated` and `drop_duplicates` by default keep the first observed value combination. Passing `keep='last'` will return the last one:

In [21]:
data.drop_duplicates(['k1','k2'], keep='last')

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


### Transforming Data Using a Function or Mapping

For many datasets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame. Consider the following hypothetical data collected about various kinds of meat:

In [22]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 
                              'Pastrami', 'corned beef', 'Bacon', 
                              'pastrami', 'honey ham', 'nova lox'],
                    'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Suppose we want to add a column indicating the type of animal that each food came from. We can achieve this using the `map` method - a convenient way to perform element-wise transformations and other data cleaning-related operations. The `map` method on a Series accepts a function or dict-like object containing a mapping. 

But first, we'll need a mapping of each distinct meat type to the kind of animal. We also need to convert each value in `food` column to lowercase using the `str.lower` method before applying the `map` method:

In [23]:
# initialise dict for mapping later on
meat_to_animal = {
    'bacon': 'pig',
    'pulled pork': 'pig',
    'pastrami': 'cow',
    'corned beef': 'cow',
    'honey ham': 'pig',
    'nova lox': 'salmon'
}

# convert all food labels to lowercase before applying mapping
lowercased = data['food'].str.lower()

# append 'animal' column based on mapping earlier
data['animal']  = lowercased.map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


We could pass a function that does all the work using `lambda` function:

In [24]:
data['food'].map(lambda x: meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

### Replacing Values

Filling in missing data with the `fillna` method is a special case of more general value replacement. As you've already seen, `map` can be used to modify a subset of values in an object but `replace` method provides a simpler and more flexible way to do so. Let's consider this Series:

In [25]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [26]:
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [27]:
data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

To use a different replacement for each value, we can pass a list of substitutes:

In [28]:
data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

The argument passed can also be a dict:

In [29]:
data.replace({-999: np.nan, -1000: 0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

### Renaming Axis Indexes

Like values in a Series, axis labels can be similarly transformed by a function or mapping of some form to produce new, differently labeled objects. You can also modify the axes in-place without creating a new data structures. Here's a simple example:

In [30]:
data = pd.DataFrame(np.arange(12).reshape((3,4)),
                   index=['Ohio', 'Colorado', 'New York'],
                   columns=['one', 'two', 'three', 'four'])

transform = lambda x: x[:4].upper()

data.index = data.index.map(transform)
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


You can also use the `rename` method if you want to create a transformed version of a dataset without modifying the original:

In [31]:
data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


Notably, `rename` can also be used in conjunction with a dict-like object providing new values for a subset of the axis labels:

In [32]:
data.rename(index={'OHIO': 'INDIANA'},
           columns={'three': 'peekaboo'})

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


Should you wish to modify a dataset in-place, pass `inplace=True`:

In [33]:
data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
data

Unnamed: 0,one,two,three,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


### Discretisation and Binning

Continuous data is often discretised or otherwise separated into "bins" for analysis. Suppose we have data about a group of people in a study, and we want to group them into discrete age buckets. We can divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older. To do this, we have to use `cut` function in pandas:

In [34]:
ages = [20, 22, 25, 27, 21, 23, 37, 61, 45, 41, 32]
bins = [17, 25, 35, 60, 100]

cats = pd.cut(ages, bins)
cats

[(17, 25], (17, 25], (17, 25], (25, 35], (17, 25], ..., (35, 60], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 11
Categories (4, interval[int64]): [(17, 25] < (25, 35] < (35, 60] < (60, 100]]

The object returned is a special `Categorical` object. The output you see describes the bins computed by `pandas.cut`. 

You can treat it like an array of strings indicating the bin name; internally it contains a `categories` array specifying the distinct category names along with a labelling for the `ages` data in the `codes` attribute:

In [35]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 3, 2, 2, 1], dtype=int8)

In [36]:
cats.categories

IntervalIndex([(17, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')

In [37]:
# doing a value count for cats array
pd.value_counts(cats)

(17, 25]     5
(35, 60]     3
(25, 35]     2
(60, 100]    1
dtype: int64

Note the use of parentheses and square brackets. Consistent with mathematical notation for intervals, a parenthesis means that the side is *open* (exclude), while the square brackets means it is *closed* (inclusive). You can change which side is closed by passing `right=False`:

In [38]:
pd.cut(ages, [18, 26, 36, 61, 100], right=False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [36, 61), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 11
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

Alternatively, you can pass your own bin names by passing a list of array to the `labels` option:

In [39]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels = group_names)

[Youth, Youth, Youth, YoungAdult, Youth, ..., MiddleAged, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 11
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]

Instead of passing in explicit bin edges, you can also pass in an integer number of bins to cut. By doing so, `pandas` will compute equal-length bins based on the minimum and maximum values in the data. Consider the case of some uniformly distributed data chopped into fourths:

In [40]:
data = np.random.rand(20)
pd.cut(data, 4, precision=2)

[(0.53, 0.76], (0.3, 0.53], (0.53, 0.76], (0.76, 0.98], (0.3, 0.53], ..., (0.068, 0.3], (0.76, 0.98], (0.3, 0.53], (0.3, 0.53], (0.53, 0.76]]
Length: 20
Categories (4, interval[float64]): [(0.068, 0.3] < (0.3, 0.53] < (0.53, 0.76] < (0.76, 0.98]]

The `precision=2` option limits the decimal precision to two digits.

A closely related function, `qcut`, bins the data based on sample quantiles. Depending on the distribution of the data, using `cut` will not usually result in each bin having the same number of data points. Since `qcut` uses sample quantiles instead, by definition, you will obtain roughly equal-sized bins:

In [41]:
data = np.random.randn(1000)  # normally distributed

cats = pd.cut(data, 4)  # cut based on value intervals derived from maximum and minimum values
print(cats)
pd.value_counts(cats)   # this will not result in equal-sized bins

[(0.279, 1.853], (0.279, 1.853], (0.279, 1.853], (0.279, 1.853], (0.279, 1.853], ..., (-1.294, 0.279], (-1.294, 0.279], (0.279, 1.853], (-1.294, 0.279], (-1.294, 0.279]]
Length: 1000
Categories (4, interval[float64]): [(-2.874, -1.294] < (-1.294, 0.279] < (0.279, 1.853] < (1.853, 3.426]]


(-1.294, 0.279]     506
(0.279, 1.853]      355
(-2.874, -1.294]     99
(1.853, 3.426]       40
dtype: int64

In [42]:
cats = pd.qcut(data, 4, precision=2)  # cut data distribution into quartiles
print(cats)
pd.value_counts(cats)

[(0.72, 3.43], (0.72, 3.43], (0.011, 0.72], (0.72, 3.43], (0.72, 3.43], ..., (0.011, 0.72], (-0.71, 0.011], (0.011, 0.72], (-0.71, 0.011], (-0.71, 0.011]]
Length: 1000
Categories (4, interval[float64]): [(-2.88, -0.71] < (-0.71, 0.011] < (0.011, 0.72] < (0.72, 3.43]]


(0.72, 3.43]      250
(0.011, 0.72]     250
(-0.71, 0.011]    250
(-2.88, -0.71]    250
dtype: int64

Similar to `cut`, you can pass your own quantiles (numbers between 0 and 1, inclusive):

In [43]:
cats = pd.qcut(data, [0, 0.1, 0.5, 0.9, 1])
print(cats)
pd.value_counts(cats)

[(0.0114, 1.353], (0.0114, 1.353], (0.0114, 1.353], (0.0114, 1.353], (0.0114, 1.353], ..., (0.0114, 1.353], (-1.287, 0.0114], (0.0114, 1.353], (-1.287, 0.0114], (-1.287, 0.0114]]
Length: 1000
Categories (4, interval[float64]): [(-2.8689999999999998, -1.287] < (-1.287, 0.0114] < (0.0114, 1.353] < (1.353, 3.426]]


(0.0114, 1.353]                  400
(-1.287, 0.0114]                 400
(1.353, 3.426]                   100
(-2.8689999999999998, -1.287]    100
dtype: int64

### Detecting and Filtering Outliers

Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:

In [44]:
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.068484,0.026913,-0.058024,0.009585
std,0.960168,0.980398,0.961977,0.983646
min,-2.707738,-3.513005,-2.797748,-3.348951
25%,-0.579887,-0.670651,-0.721559,-0.625822
50%,0.064303,0.032526,-0.096069,-0.035012
75%,0.673561,0.682266,0.577471,0.643988
max,2.979471,2.829505,2.499345,3.144543


Now suppose you wanted to find value in one of the columns exceeding 3 in absolute value:

In [45]:
col = data[2]   # taking only 3rd column
col[np.abs(col) > 3]

Series([], Name: 2, dtype: float64)

To select all rows having a value exceeding 3 or -3, you can use the `any` method on a boolean DataFrame:

In [46]:
data[(np.abs(data) > 3).any(1)]

Unnamed: 0,0,1,2,3
345,2.14113,-3.513005,0.865643,-0.152237
789,0.806581,1.914245,-1.30614,3.144543
827,0.822624,1.317006,0.8002,-3.348951


To cap outlier values outside the interval -3 to 3, you can use the `np.sign()` method. The `np.sign(data)` will produce 1 and -1 values based on whether the values in `data` are positive or negative:

In [47]:
np.sign(data).head()

Unnamed: 0,0,1,2,3
0,1.0,1.0,1.0,-1.0
1,-1.0,1.0,1.0,1.0
2,1.0,-1.0,-1.0,-1.0
3,-1.0,1.0,-1.0,1.0
4,-1.0,-1.0,-1.0,-1.0


In [48]:
data[np.abs(data) > 3] = np.sign(data) * 3  # capping values beyond 3 and -3
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.068484,0.027426,-0.058024,0.00979
std,0.960168,0.978676,0.961977,0.982063
min,-2.707738,-3.0,-2.797748,-3.0
25%,-0.579887,-0.670651,-0.721559,-0.625822
50%,0.064303,0.032526,-0.096069,-0.035012
75%,0.673561,0.682266,0.577471,0.643988
max,2.979471,2.829505,2.499345,3.0


### Permutation and Random Sampling

Permuting (randomly reordering) a Series or the rows in a DataFrame can be easily done using the `numpy.random.permutation` function. Calling `permutation` with the length of the axis you want to permute produces an array of integers indicating the new ordering: 

In [49]:
df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))  # generating new DataFrame
sampler = np.random.permutation(5)  # randomise ordering in first 5 rows
sampler

array([0, 3, 1, 4, 2])

That array can then be used in iloc-based indexing or the equivalent `take` function:

In [50]:
print(df)

df.take(sampler)

    0   1   2   3
0   0   1   2   3
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15
4  16  17  18  19


Unnamed: 0,0,1,2,3
0,0,1,2,3
3,12,13,14,15
1,4,5,6,7
4,16,17,18,19
2,8,9,10,11


To select a random subset without replacement, you can use the `sample` method on Series and DataFrame:

In [51]:
df.sample(n=3)

Unnamed: 0,0,1,2,3
2,8,9,10,11
0,0,1,2,3
4,16,17,18,19


For generating a sample _with_ replacement (i.e. to allow repeat choices), pass `replace=True` to `sample`:

In [52]:
choices = pd.Series([5, 7, -1, 6, 4])
draws = choices.sample(n=10, replace=True)
draws

1    7
0    5
3    6
2   -1
0    5
1    7
0    5
1    7
2   -1
2   -1
dtype: int64

### Computing Indicator/Dummy Variables

Another type of transformation for statistical modeling or machine learning applications is converting a categorical variable into a "dummy" or "indicator" matrix. If a column in a DataFrame has `k` distinct values, you would derive a matrix or DataFrame with `k` columns containing all 1s and 0s. `pandas` has a `get_dummies` function for doing this, though devising one yourself is not difficult. Consider this example:

In [53]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                   'data1': range(6)})

pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In some cases, you may want to add a prefix to the columns in the indicator DataFrame, which can then be merged with the other data. `get_dummies` has a prefix argument for doing this:

In [54]:
dummies = pd.get_dummies(df['key'], prefix='key')
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy

Unnamed: 0,data1,key_a,key_b,key_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


If a row in a DataFrame belongs to multiple categories, things are a bit more complicated. Let\'s look at the MovieLens 1M dataset:

In [55]:
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('datasets/movielens/movies.dat', sep='::', header=None, names = mnames)
movies[:10]

  movies = pd.read_table('datasets/movielens/movies.dat', sep='::', header=None, names = mnames)


Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children's
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


Adding indicator variables for each genre requires a little bit of wrangling. First, we extract the list of unique genres in the dataset:

In [56]:
all_genres = []
for x in movies.genres:
    all_genres.extend(x.split('|'))  # extend list, append elements from the iterable

genres = pd.unique(all_genres)
genres

array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)

One way to construct the indicator DataFrame is to start with a DataFrame of all zeroes. After which we have to iterate through each movie and set entries in each row of dummies to 1. To do this, we use the `dummies.columns` to compute the column indices for each genre:

In [57]:
zero_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns=genres)

In [58]:
gen = movies.genres[0]
gen.split('|')

['Animation', "Children's", 'Comedy']

In [59]:
dummies.columns.get_indexer(gen.split('|'))

array([0, 1, 2])

Then, we can use `.iloc` to set values based on these indices:

In [60]:
for i, gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1

In [61]:
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]

movie_id                                       1
title                           Toy Story (1995)
genres               Animation|Children's|Comedy
Genre_Animation                                1
Genre_Children's                               1
Genre_Comedy                                   1
Genre_Adventure                                0
Genre_Fantasy                                  0
Genre_Romance                                  0
Genre_Drama                                    0
Genre_Action                                   0
Genre_Crime                                    0
Genre_Thriller                                 0
Genre_Horror                                   0
Genre_Sci-Fi                                   0
Genre_Documentary                              0
Genre_War                                      0
Genre_Musical                                  0
Genre_Mystery                                  0
Genre_Film-Noir                                0
Genre_Western       

_**Note**: For much larger data, this method of constructing indicator variables with multiple membership is not especially speedy. It is better to write a lower-level function that writes directly to a NumPy array, and then wrap the result in a DataFrame._

A useful recipe for statistical applications is to combine `get_dummies` witha discretisation function like `cut`:

In [62]:
values = np.random.rand(10)

bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
pd.get_dummies(pd.cut(values, bins))

Unnamed: 0,"(0.0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1.0]"
0,1,0,0,0,0
1,0,0,0,0,1
2,0,0,0,0,1
3,0,0,0,0,1
4,1,0,0,0,0
5,1,0,0,0,0
6,0,0,0,1,0
7,0,1,0,0,0
8,0,0,0,0,1
9,1,0,0,0,0


## String Manipulation 

Python has long been a popular raw data manipulation language in part due to its ease of use for string and text processing. Most text operations are made simple with the string object's built-in methods. For more complex pattern matching and text manipulations, regular expressions may be needed. `pandas` adds to the mix by enabling you to apply string and regular expressions concisely on whole array of data, additionally handling the annoyance of missing data.

### String Object Methods

 In many string munging and scripting applications, built-in string methods are sufficient. As an example, a comma-separated string can be broken into pieces with split:

In [63]:
val = 'a,b, guido'
val.split(',')

['a', 'b', ' guido']

`split` is often combined with `strip` to trim whitespaces (including line breaks):

In [64]:
pieces = [x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

These substrings can be concatenated together with a two-colon delimiter using addition:

In [65]:
first, second, third = pieces
first + '::' + second + '::' + third

'a::b::guido'

A faster and more Pythonic way to concatenate in the above example is to pass a list or tuple to the `join` method on the string '::':

In [66]:
'::'.join(pieces)

'a::b::guido'

Other methods are concerned with locating substrings. Using Python's `in` keyword is best way to detect a substring, though `index` and `find` can also be used:

In [67]:
print('{}'.format('guido' in val))
print('{}'.format(val.index(',')))
print('{}'.format(val.find(':')))

True
1
-1


Table below shows a listing of some of Python's string methods

**Method** | **Description**
--- | ---
`count` | Return the number of non-overlapping occurrences of substring in the string
`endswith` | Returns `True` if string ends with suffix
`startswith` | Returns `True` if string starts with prefix
`join` | Use string as delimiter for concatenating a sequence of other strings
`index` | Return position of first character in substring if found in the string; raises `ValueError` if not found.
`find` | Return position of first character of _first_ occurrence of substring in the string; like `Index` but returns -1 if not found
`rfind` | Returns position of first character of _last_ occurrence of substring in the string; returns -1 if not found.
`replace` | Replace occurrences of string with another string
`strip, rstrip, lstrip` | Trim whitespace, including newlines; equivalent to `x.strip()` (and `rstrip` and `lstrip`, respectively
`split` | Break string into list of substrings using passed delimiter.
`lower` | Convert alphabet characters to lowercase
`upper` | Convert alphabet characters to uppercase
`casefold` | Convert characters to lowercase, and convert any region-specific variable character combinations to a common comparable form.
`ljust, rjust` | Left justify or right justify, respectively; pad opposite side of string with spaces (or some other fill character) to return a string with a minimum width

## Regular Expressions

_Regular expressions_ provide a flexible way to search or match (often more complex) string patterns in text. A single expression, commonly called a _regex_, is a string formed according to the regular expression language. Python's built-in `re` module is responsible for applying regular expressions to strings; I'll give a number of examples of its use here.

The `re` module functions fall into three categories: pattern matching, substitution, and splitting. Naturally these are all related; a regex describes a pattern to locate in the text, which can then be used for many purposes. Let's look at a simple example:

In [68]:
import re

text = "foo    bar\t baz    \tqux"
re.split('\s+', text)  # \s+: regex for describing one or more whitespace characters

['foo', 'bar', 'baz', 'qux']

When you call `re.split('\s+', text)`, the regular expression is first _compiled_, and then its `split` method is called on the passed text. You can compile the regex yourself with `re.compile`, forming a reusable regex object:

In [69]:
regex = re.compile('\s+')  # store regex object reference
regex.split(text)

['foo', 'bar', 'baz', 'qux']

If, instead, you wanted to get a list of all patterns matching the regex, you can use the `findall` method:

In [70]:
regex.findall(text)

['    ', '\t ', '    \t']

_**Note:** To avoid unwanted escaping with \ in a regular expression, use the raw string literals like `r'C:\x'` instead of the equivalent `'C:\\x'`_.

Creating a regex object with `re.compile` is highly recommended if you intend to apply the same expression to many strings; doing so will save CPU cycles. 

`match` and `search` are closely related to `findall`. While `findall` returns all matches in a string, `search` returns only the first match. More rigidly, `match` _only_ matches at the beginning of the string. As a less trivial example, let's consider a block of text and a regex capable of identifying most email addresses:

In [71]:
text = """Dave dave@google.com
    Steve steve@gmail.com
    Rob rob@gmail.com
    Ryan ryan@yahoo.com
    """

pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

`search` returns a special match object for the first email address in the text. For the preceding regex, the match object can only tell us the start and end position of the pattern in the string:

In [72]:
m = regex.search(text)
m

<re.Match object; span=(5, 20), match='dave@google.com'>

In [73]:
text[m.start():m.end()]

'dave@google.com'

`regex.match` returns `None`, as it will only match if the pattern occurs at the start of the string:

In [74]:
print(regex.match(text))

None


Relatedly, `sub` will return a new string with occurrences of the pattern replaced by a new string:

In [75]:
print(regex.sub('REDACTED', text))

Dave REDACTED
    Steve REDACTED
    Rob REDACTED
    Ryan REDACTED
    


Suppose you wanted to find email addresses and simultaneously segment each address into its three components: username, domain name, and domain suffix. To do this, put parentheses around the parts of the pattern to segment:

In [76]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)

m = regex.match('wesm@bright.net')
m.groups()  # display each text segment captured

('wesm', 'bright', 'net')

`findall` returns a list of tuples when the pattern has groups:

In [77]:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

`sub` also has access to groups in each match using special symbols like `\1` and `\2`. The symbol `\1` corresponds to the first match group, `\2` corresponds to the second, and so forth:

In [78]:
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))

Dave Username: dave, Domain: google, Suffix: com
    Steve Username: steve, Domain: gmail, Suffix: com
    Rob Username: rob, Domain: gmail, Suffix: com
    Ryan Username: ryan, Domain: yahoo, Suffix: com
    


There is much more to regular expressions in Python, most of which are not covered in this note. Table below shows a brief summary:

**Method** | **Description**
--- | ---
`findall` | Return all non-overlapping matching patterns in a string as a list
`finditer` | Like `findall`, but returns an iterator
`match` |  Match pattern at start of the string, and optionally segment pattern components into groups; if the pattern matches, returns a match object, and otherwise `None`. 
`search` | Scan string for match to pattern; returning a match object if so; unlike `match`, the match can be anywhere in the string as opposed to only at the begiining
`split` | Break string into pieces at each occurence of pattern
`sub, subn` | Replace all `(sub)` or first _n_ occurrences `(subn)` of pattern in string with replacement expression; use symbols `\1`, `\2`, ... to refer to match group elements in the replacement string

### Vectorised String Functions in pandas

Cleaning up a messy dataset for analysis often requires a lot of string munging and regularisation. To complicate matters, a column containing strings will sometimes have missing data:

In [79]:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com', 'Rob': 'rob@gmail.com', 'Wes': np.nan}
data = pd.Series(data)

In [80]:
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

In [81]:
data.isnull()

Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

You can apply string and regular expression methods can be applied (passing a `lambda` or other function) to each value using `data.map`, but it will fail on the NA (null) values. To cope with this, Series has array-oriented methods for string operations that skip NA values. These are accessed through Series's `str` attribute: for example, we could check whether each email address has 'gmail' in it with `str.contains`:

In [82]:
data.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

Regular expressions can be used too, along with any `re` options like IGNORECASE:

In [83]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
data.str.findall(pattern, flags = re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

There are a couple of ways to do vectorised element retrieval. Either use `str.get` or index into the `str` attribute:

In [84]:
matches = data.str.findall(pattern, flags=re.IGNORECASE).str[0]  

# notice this returns the first tuple value for each row; previous example reveal a list containing only 
# a single tuple value
matches

Dave     (dave, google, com)
Steve    (steve, gmail, com)
Rob        (rob, gmail, com)
Wes                      NaN
dtype: object

In [85]:
matches.str.get(1)

Dave     google
Steve     gmail
Rob       gmail
Wes         NaN
dtype: object

You can similarly slice strings using this syntax:

In [86]:
data.str[:5]

Dave     dave@
Steve    steve
Rob      rob@g
Wes        NaN
dtype: object

The `extract` method will return the captured groups of a regular expression as a DataFrame:

In [87]:
data.str.extract(pattern, flags=re.IGNORECASE)

Unnamed: 0,0,1,2
Dave,dave,google,com
Steve,steve,gmail,com
Rob,rob,gmail,com
Wes,,,


Table below shows more `pandas` string methods:

**Method** | **Description**
--- | ---
`cat` | Concatenate strings element-wise with optional delimiter
`contains` | Return boolean array if each string contains pattern/regex
`count` | Count occurrences of pattern
`extract` | Use a regular expression with groups to extract one or more strings from a Series of strings; the result will be a DataFrame with one column per group
`endswith` | Equivalent to `x.endswith(pattern)` for each element
`startswith` | Equivalent to `x.startswith(pattern)` for each element
`findall` | Compute list of all occurrences of pattern/regex for each string
`get` | Index into each element (retrieve _i_-th element)
`isalnum` | Equivalent to built-in `str.alnum`
`isalpha` | Equivalent to built-in `str.alpha`
`isdecimal` | Equivalent to built-in `str.isdecimal`
`isdigit` | Equivalent to built-in `str.isdigit`
`islower` | Equivalent to built-in `str.islower`
`isnumeric` | Equivalent to built-in `str.isnumeric`
`isupper`| Equivalent to built-in `str.isupper`
`join` | Join strings in each element of the Series with passed separator
`len` | Compute length of each string
`lower, upper` | Convert cases; equivalent to `x.lower()` or `x.upper()` for each element
`match` | Use `re.match` with the passed regular expression on each element, returning `True` or `False` whether it matches
`extract` | Extract captured group element (if any) by index from each string
`pad` | Add whitespace to left, right or both sides of the string
`center` | Equivalent to `pad(side='both')`
`repeat` | Duplicate values (e.g., `s.str.repeat(3)` is equivalent to `x * 3` for each string
`replace` | Replace occurrences of pattern/regex with some other string
`slice` | Slice each string in the Series
`split` | Split strings on delimiter or regular expression
`strip` | Trim whitespace from both sides, including newlines
`rstrip` | Trim whitespace on right side
`lstrip` | Trim whitespace on left side