# Data Cleaning and Preparation

During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up 80% of more of an analyst's time. Fortunately, pandas, along with built-in Python language features, provides you with a high-level, flexible, and fast set of tools to enable you to manipulate data into the right form. 

## Handling Missing Data

Missing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing daa as painless as possible. The way that missing data is represented in pandas objects is somewhat imperfect, but it is functional for a lot of users. For numeric data, pandas uses the floating-point value `NaN` (Not a Number) to represent missing data:

In [1]:
import pandas as pd
import numpy as np

In [2]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

Detecting null values:

In [3]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In pandas, we reference missing data as NA, which stands for *not available*. In statistics applications, NA data may either be data that does not exist or that exists but was not observed (through problems with data collection for example). When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data.

The built-in Python `None` value is also treated as NA in object arrays:

In [4]:
string_data[0] = None
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

See table below for a list of some functions related to missing data handling:

**Argument** | **Description**
--- | ---
`dropna` | Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
`fillna` | Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.
`isnull` | Return boolean values indicating which values are missing/NA.
`notnull` | Negation of `isnull`.

### Filtering Out Missing Data

There are a few ways to filter out missing data. While you always have the option to do it by hand using `pandas.isnull` and boolean indexing, the `dropna` can be helpful. On a Series, it returns the Series with only the non-null data and index values:

In [5]:
from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

This is equivalent to:

In [6]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

Handling missing data in DataFrame objects can be a bit more complex. You may want to drop rows or columns that are all NA or only those containing any NAs. `dropna` by default, drops any row containing a missing value:

In [7]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], 
                     [NA, NA, NA], [NA, 6.5, 3.]])

cleaned = data.dropna()
print("data:\n{}\n".format(data))
cleaned

data:
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0



Unnamed: 0,0,1,2
0,1.0,6.5,3.0


Passing `how='all'` argument will only drop rows that are all NA:

In [8]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


To drop columns in the same way, pass `axis=1`:

In [9]:
data[4] = NA
print("data:\n{}\n".format(data))

data.dropna(axis=1, how='all')

data:
     0    1    2   4
0  1.0  6.5  3.0 NaN
1  1.0  NaN  NaN NaN
2  NaN  NaN  NaN NaN
3  NaN  6.5  3.0 NaN



Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


Another useful option is `thresh`, which is useful for dealing with time series data. Suppose you want to keep only rows containing a certain number of observations. You can indicate this with the `thresh` argument:

In [10]:
df = pd.DataFrame(np.random.randn(7, 3))

# imputing NA values
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA

print("{}\n".format(df))

df.dropna(thresh=2)

          0         1         2
0  0.530999       NaN       NaN
1  1.241754       NaN       NaN
2  0.374597       NaN  0.472291
3 -0.672533       NaN  0.596092
4  0.103727  2.486859  0.072569
5 -1.019245 -0.284886  1.233716
6 -1.128860  0.392091 -1.499893



Unnamed: 0,0,1,2
2,0.374597,,0.472291
3,-0.672533,,0.596092
4,0.103727,2.486859,0.072569
5,-1.019245,-0.284886,1.233716
6,-1.12886,0.392091,-1.499893


### Filling in Missing Data

Rather than filtering out missing data (and potentially discarding other data along with it), you may want to fill in the "holes" in any number of ways. For most purposes, the `fillna` method is the workhorse function to use. Calling `fillna` with a constant replaces missing values with that value:

In [11]:
df.fillna(0)

Unnamed: 0,0,1,2
0,0.530999,0.0,0.0
1,1.241754,0.0,0.0
2,0.374597,0.0,0.472291
3,-0.672533,0.0,0.596092
4,0.103727,2.486859,0.072569
5,-1.019245,-0.284886,1.233716
6,-1.12886,0.392091,-1.499893


Calling `fillna` with a dict, you can use a different fill value for each column:

In [12]:
df.fillna({1: 0.5, 2: 0})

Unnamed: 0,0,1,2
0,0.530999,0.5,0.0
1,1.241754,0.5,0.0
2,0.374597,0.5,0.472291
3,-0.672533,0.5,0.596092
4,0.103727,2.486859,0.072569
5,-1.019245,-0.284886,1.233716
6,-1.12886,0.392091,-1.499893


`fillna()` method returns a new object, but you can modify the existing object in-place:

In [13]:
_ = df.fillna(0, inplace=True)
df

Unnamed: 0,0,1,2
0,0.530999,0.0,0.0
1,1.241754,0.0,0.0
2,0.374597,0.0,0.472291
3,-0.672533,0.0,0.596092
4,0.103727,2.486859,0.072569
5,-1.019245,-0.284886,1.233716
6,-1.12886,0.392091,-1.499893


The same interpolation methods available for reindexing can be used with `fillna`:

In [14]:
df = pd.DataFrame(np.random.randn(6,3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
print("{}\n".format(df))

df.fillna(method='ffill')

          0         1         2
0 -0.743189 -1.425826  0.644702
1  0.418718  0.140889 -0.512678
2 -0.371659       NaN -0.782360
3  0.222653       NaN -1.164093
4  0.202138       NaN       NaN
5  0.563572       NaN       NaN



Unnamed: 0,0,1,2
0,-0.743189,-1.425826,0.644702
1,0.418718,0.140889,-0.512678
2,-0.371659,0.140889,-0.78236
3,0.222653,0.140889,-1.164093
4,0.202138,0.140889,-1.164093
5,0.563572,0.140889,-1.164093


In [15]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,-0.743189,-1.425826,0.644702
1,0.418718,0.140889,-0.512678
2,-0.371659,0.140889,-0.78236
3,0.222653,0.140889,-1.164093
4,0.202138,,-1.164093
5,0.563572,,-1.164093


You might also pass the mean or median value of a Series with `fillna()`:

In [16]:
data = pd.Series([1., NA, 3.5, NA, 7])
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

Table below summarises function arguments for `fillna`:

**Argument** | **Description**
--- | ---
`value` | Scalar value or dict-like object to use to fill missing values
`method` | Interpolation; by default `ffill` if function called with no other arguments
`axis` | Axis to fill on; default `axis=0`
`inplace` | Modify the calling object without producing a copy
`limit` | For forward and backward filling, maximum number of consecutive periods to fill

## Data Transformation

Filtering, cleaning and other transformation are another class of important operations when dealing with data for analysis:

### Removing Duplicates

Duplicate rows may be found in a DataFrame for any number of reasons. Here is an example:

In [17]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                     'k2': [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


The DataFrame method `duplicated` returns a boolean Series indicating whether each row is a duplicate (has been observed in a previous row) or not:

In [18]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

Relatedly, `drop_duplicates` return a DataFrame where the `duplicated` array is `False`:

In [19]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


Alternatively, you can specify any subset of the data to detect duplicates. Suppose we had an additional column of values and wanted to filter duplicates only based on the 'k1' column:

In [20]:
data['v1'] = range(7)
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


`duplicated` and `drop_duplicates` by default keep the first observed value combination. Passing `keep='last'` will return the last one:

In [21]:
data.drop_duplicates(['k1','k2'], keep='last')

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


### Transforming Data Using a Function or Mapping

For many datasets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame. Consider the following hypothetical data collected about various kinds of meat:

In [22]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 
                              'Pastrami', 'corned beef', 'Bacon', 
                              'pastrami', 'honey ham', 'nova lox'],
                    'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Suppose we want to add a column indicating the type of animal that each food came from. We can achieve this using the `map` method - a convenient way to perform element-wise transformations and other data cleaning-related operations. The `map` method on a Series accepts a function or dict-like object containing a mapping. 

But first, we'll need a mapping of each distinct meat type to the kind of animal. We also need to convert each value in `food` column to lowercase using the `str.lower` method before applying the `map` method:

In [23]:
# initialise dict for mapping later on
meat_to_animal = {
    'bacon': 'pig',
    'pulled pork': 'pig',
    'pastrami': 'cow',
    'corned beef': 'cow',
    'honey ham': 'pig',
    'nova lox': 'salmon'
}

# convert all food labels to lowercase before applying mapping
lowercased = data['food'].str.lower()

# append 'animal' column based on mapping earlier
data['animal']  = lowercased.map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


We could pass a function that does all the work using `lambda` function:

In [24]:
data['food'].map(lambda x: meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object