# Data Cleaning and Preparation

During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up 80% of more of an analyst's time. Fortunately, pandas, along with built-in Python language features, provides you with a high-level, flexible, and fast set of tools to enable you to manipulate data into the right form. 

## Handling Missing Data

Missing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing data as painless as possible. The way that missing data is represented in pandas objects is somewhat imperfect, but it is functional for a lot of users. For numeric data, pandas uses the floating-point value `NaN` (Not a Number) to represent missing data:

In [1]:
import pandas as pd
import numpy as np

In [2]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

Detecting null values:

In [3]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In pandas, we reference missing data as NA, which stands for *not available*. In statistics applications, NA data may either be data that does not exist or that exists but was not observed (through problems with data collection for example). When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data.

The built-in Python `None` value is also treated as NA in object arrays:

In [4]:
string_data[0] = None
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

See table below for a list of some functions related to missing data handling:

**Argument** | **Description**
--- | ---
`dropna` | Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
`fillna` | Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.
`isnull` | Return boolean values indicating which values are missing/NA.
`notnull` | Negation of `isnull`.

### Filtering Out Missing Data

There are a few ways to filter out missing data. While you always have the option to do it by hand using `pandas.isnull` and boolean indexing, the `dropna` can be helpful. On a Series, it returns the Series with only the non-null data and index values:

In [5]:
from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

This is equivalent to:

In [6]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

Handling missing data in DataFrame objects can be a bit more complex. You may want to drop rows or columns that are all NA or only those containing any NAs. `dropna` by default, drops any row containing a missing value:

In [7]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], 
                     [NA, NA, NA], [NA, 6.5, 3.]])

cleaned = data.dropna()
print("data:\n{}\n".format(data))
cleaned

data:
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0



Unnamed: 0,0,1,2
0,1.0,6.5,3.0


Passing `how='all'` argument will only drop rows that are all NA:

In [8]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


To drop columns in the same way, pass `axis=1`:

In [9]:
data[4] = NA
print("data:\n{}\n".format(data))

data.dropna(axis=1, how='all')

data:
     0    1    2   4
0  1.0  6.5  3.0 NaN
1  1.0  NaN  NaN NaN
2  NaN  NaN  NaN NaN
3  NaN  6.5  3.0 NaN



Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


Another useful option is `thresh`, which is useful for dealing with time series data. Suppose you want to keep only rows containing a certain number of observations. You can indicate this with the `thresh` argument:

In [10]:
df = pd.DataFrame(np.random.randn(7, 3))

# imputing NA values
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA

print("{}\n".format(df))

df.dropna(thresh=2)

          0         1         2
0 -0.296363       NaN       NaN
1 -1.793640       NaN       NaN
2  0.806509       NaN -1.766476
3  0.121775       NaN -0.328653
4 -0.270465  2.483854 -0.144814
5  0.332265  0.931763  0.806218
6  0.696322  1.096967  1.197714



Unnamed: 0,0,1,2
2,0.806509,,-1.766476
3,0.121775,,-0.328653
4,-0.270465,2.483854,-0.144814
5,0.332265,0.931763,0.806218
6,0.696322,1.096967,1.197714


### Filling in Missing Data

Rather than filtering out missing data (and potentially discarding other data along with it), you may want to fill in the "holes" in any number of ways. For most purposes, the `fillna` method is the workhorse function to use. Calling `fillna` with a constant replaces missing values with that value:

In [11]:
df.fillna(0)

Unnamed: 0,0,1,2
0,-0.296363,0.0,0.0
1,-1.79364,0.0,0.0
2,0.806509,0.0,-1.766476
3,0.121775,0.0,-0.328653
4,-0.270465,2.483854,-0.144814
5,0.332265,0.931763,0.806218
6,0.696322,1.096967,1.197714


Calling `fillna` with a dict, you can use a different fill value for each column:

In [12]:
df.fillna({1: 0.5, 2: 0})

Unnamed: 0,0,1,2
0,-0.296363,0.5,0.0
1,-1.79364,0.5,0.0
2,0.806509,0.5,-1.766476
3,0.121775,0.5,-0.328653
4,-0.270465,2.483854,-0.144814
5,0.332265,0.931763,0.806218
6,0.696322,1.096967,1.197714


`fillna()` method returns a new object, but you can modify the existing object in-place:

In [13]:
_ = df.fillna(0, inplace=True)
df

Unnamed: 0,0,1,2
0,-0.296363,0.0,0.0
1,-1.79364,0.0,0.0
2,0.806509,0.0,-1.766476
3,0.121775,0.0,-0.328653
4,-0.270465,2.483854,-0.144814
5,0.332265,0.931763,0.806218
6,0.696322,1.096967,1.197714


The same interpolation methods available for reindexing can be used with `fillna`:

In [14]:
df = pd.DataFrame(np.random.randn(6,3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
print("{}\n".format(df))

df.fillna(method='ffill')

          0         1         2
0 -1.008460  0.011764  1.953649
1  0.017188 -1.617426 -0.670752
2 -1.508648       NaN -0.364697
3 -1.021564       NaN  0.518296
4 -0.414409       NaN       NaN
5  0.045485       NaN       NaN



Unnamed: 0,0,1,2
0,-1.00846,0.011764,1.953649
1,0.017188,-1.617426,-0.670752
2,-1.508648,-1.617426,-0.364697
3,-1.021564,-1.617426,0.518296
4,-0.414409,-1.617426,0.518296
5,0.045485,-1.617426,0.518296


In [15]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,-1.00846,0.011764,1.953649
1,0.017188,-1.617426,-0.670752
2,-1.508648,-1.617426,-0.364697
3,-1.021564,-1.617426,0.518296
4,-0.414409,,0.518296
5,0.045485,,0.518296


You might also pass the mean or median value of a Series with `fillna()`:

In [16]:
data = pd.Series([1., NA, 3.5, NA, 7])
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

Table below summarises function arguments for `fillna`:

**Argument** | **Description**
--- | ---
`value` | Scalar value or dict-like object to use to fill missing values
`method` | Interpolation; by default `ffill` if function called with no other arguments
`axis` | Axis to fill on; default `axis=0`
`inplace` | Modify the calling object without producing a copy
`limit` | For forward and backward filling, maximum number of consecutive periods to fill

## Data Transformation

Filtering, cleaning and other transformation are another class of important operations when dealing with data for analysis:

### Removing Duplicates

Duplicate rows may be found in a DataFrame for any number of reasons. Here is an example:

In [17]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                     'k2': [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


The DataFrame method `duplicated` returns a boolean Series indicating whether each row is a duplicate (has been observed in a previous row) or not:

In [18]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

Relatedly, `drop_duplicates` return a DataFrame where the `duplicated` array is `False`:

In [19]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


Alternatively, you can specify any subset of the data to detect duplicates. Suppose we had an additional column of values and wanted to filter duplicates only based on the 'k1' column:

In [20]:
data['v1'] = range(7)
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


`duplicated` and `drop_duplicates` by default keep the first observed value combination. Passing `keep='last'` will return the last one:

In [21]:
data.drop_duplicates(['k1','k2'], keep='last')

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


### Transforming Data Using a Function or Mapping

For many datasets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame. Consider the following hypothetical data collected about various kinds of meat:

In [22]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 
                              'Pastrami', 'corned beef', 'Bacon', 
                              'pastrami', 'honey ham', 'nova lox'],
                    'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Suppose we want to add a column indicating the type of animal that each food came from. We can achieve this using the `map` method - a convenient way to perform element-wise transformations and other data cleaning-related operations. The `map` method on a Series accepts a function or dict-like object containing a mapping. 

But first, we'll need a mapping of each distinct meat type to the kind of animal. We also need to convert each value in `food` column to lowercase using the `str.lower` method before applying the `map` method:

In [23]:
# initialise dict for mapping later on
meat_to_animal = {
    'bacon': 'pig',
    'pulled pork': 'pig',
    'pastrami': 'cow',
    'corned beef': 'cow',
    'honey ham': 'pig',
    'nova lox': 'salmon'
}

# convert all food labels to lowercase before applying mapping
lowercased = data['food'].str.lower()

# append 'animal' column based on mapping earlier
data['animal']  = lowercased.map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


We could pass a function that does all the work using `lambda` function:

In [24]:
data['food'].map(lambda x: meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

### Replacing Values

Filling in missing data with the `fillna` method is a special case of more general value replacement. As you've already seen, `map` can be used to modify a subset of values in an object but `replace` method provides a simpler and more flexible way to do so. Let's consider this Series:

In [25]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [26]:
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [27]:
data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

To use a different replacement for each value, we can pass a list of substitutes:

In [28]:
data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

The argument passed can also be a dict:

In [29]:
data.replace({-999: np.nan, -1000: 0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

### Renaming Axis Indexes

Like values in a Series, axis labels can be similarly transformed by a function or mapping of some form to produce new, differently labeled objects. You can also modify the axes in-place without creating a new data structures. Here's a simple example:

In [30]:
data = pd.DataFrame(np.arange(12).reshape((3,4)),
                   index=['Ohio', 'Colorado', 'New York'],
                   columns=['one', 'two', 'three', 'four'])

transform = lambda x: x[:4].upper()

data.index = data.index.map(transform)
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


You can also use the `rename` method if you want to create a transformed version of a dataset without modifying the original:

In [31]:
data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


Notably, `rename` can also be used in conjunction with a dict-like object providing new values for a subset of the axis labels:

In [32]:
data.rename(index={'OHIO': 'INDIANA'},
           columns={'three': 'peekaboo'})

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


Should you wish to modify a dataset in-place, pass `inplace=True`:

In [33]:
data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
data

Unnamed: 0,one,two,three,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


### Discretisation and Binning

Continuous data is often discretised or otherwise separated into "bins" for analysis. Suppose we have data about a group of people in a study, and we want to group them into discrete age buckets. We can divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older. To do this, we have to use `cut` function in pandas:

In [34]:
ages = [20, 22, 25, 27, 21, 23, 37, 61, 45, 41, 32]
bins = [17, 25, 35, 60, 100]

cats = pd.cut(ages, bins)
cats

[(17, 25], (17, 25], (17, 25], (25, 35], (17, 25], ..., (35, 60], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 11
Categories (4, interval[int64]): [(17, 25] < (25, 35] < (35, 60] < (60, 100]]

The object returned is a special `Categorical` object. The output you see describes the bins computed by `pandas.cut`. 

You can treat it like an array of strings indicating the bin name; internally it contains a `categories` array specifying the distinct category names along with a labelling for the `ages` data in the `codes` attribute:

In [35]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 3, 2, 2, 1], dtype=int8)

In [36]:
cats.categories

IntervalIndex([(17, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')

In [37]:
# doing a value count for cats array
pd.value_counts(cats)

(17, 25]     5
(35, 60]     3
(25, 35]     2
(60, 100]    1
dtype: int64

Note the use of parentheses and square brackets. Consistent with mathematical notation for intervals, a parenthesis means that the side is *open* (exclude), while the square brackets means it is *closed* (inclusive). You can change which side is closed by passing `right=False`:

In [38]:
pd.cut(ages, [18, 26, 36, 61, 100], right=False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [36, 61), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 11
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

Alternatively, you can pass your own bin names by passing a list of array to the `labels` option:

In [39]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels = group_names)

[Youth, Youth, Youth, YoungAdult, Youth, ..., MiddleAged, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 11
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]

Instead of passing in explicit bin edges, you can also pass in an integer number of bins to cut. By doing so, `pandas` will compute equal-length bins based on the minimum and maximum values in the data. Consider the case of some uniformly distributed data chopped into fourths:

In [40]:
data = np.random.rand(20)
pd.cut(data, 4, precision=2)

[(0.1, 0.28], (0.28, 0.46], (0.64, 0.81], (0.1, 0.28], (0.46, 0.64], ..., (0.28, 0.46], (0.46, 0.64], (0.64, 0.81], (0.1, 0.28], (0.28, 0.46]]
Length: 20
Categories (4, interval[float64]): [(0.1, 0.28] < (0.28, 0.46] < (0.46, 0.64] < (0.64, 0.81]]

The `precision=2` option limits the decimal precision to two digits.

A closely related function, `qcut`, bins the data based on sample quantiles. Depending on the distribution of the data, using `cut` will not usually result in each bin having the same number of data points. Since `qcut` uses sample quantiles instead, by definition, you will obtain roughly equal-sized bins:

In [41]:
data = np.random.randn(1000)  # normally distributed

cats = pd.cut(data, 4)  # cut based on value intervals derived from maximum and minimum values
print(cats)
pd.value_counts(cats)   # this will not result in equal-sized bins

[(1.398, 3.083], (-0.287, 1.398], (-0.287, 1.398], (-0.287, 1.398], (-0.287, 1.398], ..., (-1.972, -0.287], (-0.287, 1.398], (-0.287, 1.398], (-0.287, 1.398], (-1.972, -0.287]]
Length: 1000
Categories (4, interval[float64]): [(-3.664, -1.972] < (-1.972, -0.287] < (-0.287, 1.398] < (1.398, 3.083]]


(-0.287, 1.398]     515
(-1.972, -0.287]    374
(1.398, 3.083]       75
(-3.664, -1.972]     36
dtype: int64

In [42]:
cats = pd.qcut(data, 4, precision=2)  # cut data distribution into quartiles
print(cats)
pd.value_counts(cats)

[(0.67, 3.08], (-0.69, -0.026], (0.67, 3.08], (-0.026, 0.67], (0.67, 3.08], ..., (-3.67, -0.69], (-0.026, 0.67], (-0.026, 0.67], (-0.69, -0.026], (-0.69, -0.026]]
Length: 1000
Categories (4, interval[float64]): [(-3.67, -0.69] < (-0.69, -0.026] < (-0.026, 0.67] < (0.67, 3.08]]


(0.67, 3.08]       250
(-0.026, 0.67]     250
(-0.69, -0.026]    250
(-3.67, -0.69]     250
dtype: int64

Similar to `cut`, you can pass your own quantiles (numbers between 0 and 1, inclusive):

In [43]:
cats = pd.qcut(data, [0, 0.1, 0.5, 0.9, 1])
print(cats)
pd.value_counts(cats)

[(1.266, 3.083], (-1.359, -0.0255], (-0.0255, 1.266], (-0.0255, 1.266], (-0.0255, 1.266], ..., (-1.359, -0.0255], (-0.0255, 1.266], (-0.0255, 1.266], (-1.359, -0.0255], (-1.359, -0.0255]]
Length: 1000
Categories (4, interval[float64]): [(-3.658, -1.359] < (-1.359, -0.0255] < (-0.0255, 1.266] < (1.266, 3.083]]


(-0.0255, 1.266]     400
(-1.359, -0.0255]    400
(1.266, 3.083]       100
(-3.658, -1.359]     100
dtype: int64

### Detecting and Filtering Outliers

