# Data Cleaning and Preparation

During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up 80% of more of an analyst's time. Fortunately, pandas, along with built-in Python language features, provides you with a high-level, flexible, and fast set of tools to enable you to manipulate data into the right form. 

## Handling Missing Data

Missing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing daa as painless as possible. The way that missing data is represented in pandas objects is somewhat imperfect, but it is functional for a lot of users. For numeric data, pandas uses the floating-point value `NaN` (Not a Number) to represent missing data:

In [1]:
import pandas as pd
import numpy as np

In [2]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

Detecting null values:

In [3]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In pandas, we reference missing data as NA, which stands for *not available*. In statistics applications, NA data may either be data that does not exist or that exists but was not observed (through problems with data collection for example). When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data.

The built-in Python `None` value is also treated as NA in object arrays:

In [4]:
string_data[0] = None
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

See table below for a list of some functions related to missing data handling:

**Argument** | **Description**
--- | ---
`dropna` | Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
`fillna` | Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.
`isnull` | Return boolean values indicating which values are missing/NA.
`notnull` | Negation of `isnull`.

### Filtering Out Missing Data

There are a few ways to filter out missing data. While you always have the option to do it by hand using `pandas.isnull` and boolean indexing, the `dropna` can be helpful. On a Series, it returns the Series with only the non-null data and index values:

In [5]:
from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

This is equivalent to:

In [6]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

Handling missing data in DataFrame objects can be a bit more complex. You may want to drop rows or columns that are all NA or only those containing any NAs. `dropna` by default, drops any row containing a missing value:

In [7]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], 
                     [NA, NA, NA], [NA, 6.5, 3.]])

cleaned = data.dropna()
print("data:\n{}\n".format(data))
cleaned

data:
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0



Unnamed: 0,0,1,2
0,1.0,6.5,3.0


Passing `how='all'` argument will only drop rows that are all NA:

In [8]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


To drop columns in the same way, pass `axis=1`:

In [9]:
data[4] = NA
print("data:\n{}\n".format(data))

data.dropna(axis=1, how='all')

data:
     0    1    2   4
0  1.0  6.5  3.0 NaN
1  1.0  NaN  NaN NaN
2  NaN  NaN  NaN NaN
3  NaN  6.5  3.0 NaN



Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


Another useful option is `thresh`, which is useful for dealing with time series data. Suppose you want to keep only rows containing a certain number of observations. You can indicate this with the `thresh` argument:

In [20]:
df = pd.DataFrame(np.random.randn(7, 3))

# imputing NA values
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA

print("{}\n".format(df))

df.dropna(thresh=2)

          0         1         2
0  2.163935       NaN       NaN
1  0.636145       NaN       NaN
2 -0.692907       NaN -1.001913
3 -0.381253       NaN -2.138760
4  1.324190 -1.855107  1.008716
5  0.817317  1.613417 -0.175767
6 -0.110647  0.272229 -1.198221



Unnamed: 0,0,1,2
2,-0.692907,,-1.001913
3,-0.381253,,-2.13876
4,1.32419,-1.855107,1.008716
5,0.817317,1.613417,-0.175767
6,-0.110647,0.272229,-1.198221


### Filling in Missing Data

Rather than filtering out missing data (and potentially discarding other data along with it), you may want to fill in the "holes" in any number of ways. For most purposes, the `fillna` method is the workhorse function to use. Calling `fillna` with a constant replaces missing values with that value:

In [21]:
df.fillna(0)

Unnamed: 0,0,1,2
0,2.163935,0.0,0.0
1,0.636145,0.0,0.0
2,-0.692907,0.0,-1.001913
3,-0.381253,0.0,-2.13876
4,1.32419,-1.855107,1.008716
5,0.817317,1.613417,-0.175767
6,-0.110647,0.272229,-1.198221


Calling `fillna` with a dict, you can use a different fill value for each column:

In [22]:
df.fillna({1: 0.5, 2: 0})

Unnamed: 0,0,1,2
0,2.163935,0.5,0.0
1,0.636145,0.5,0.0
2,-0.692907,0.5,-1.001913
3,-0.381253,0.5,-2.13876
4,1.32419,-1.855107,1.008716
5,0.817317,1.613417,-0.175767
6,-0.110647,0.272229,-1.198221


`fillna()` method returns a new object, but you can modify the existing object in-place:

In [23]:
_ = df.fillna(0, inplace=True)
df

Unnamed: 0,0,1,2
0,2.163935,0.0,0.0
1,0.636145,0.0,0.0
2,-0.692907,0.0,-1.001913
3,-0.381253,0.0,-2.13876
4,1.32419,-1.855107,1.008716
5,0.817317,1.613417,-0.175767
6,-0.110647,0.272229,-1.198221


The same interpolation methods available for reindexing can be used with `fillna`:

In [28]:
df = pd.DataFrame(np.random.randn(6,3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
print("{}\n".format(df))

df.fillna(method='ffill')

          0         1         2
0  1.177772 -1.607769 -0.484385
1  0.330288 -0.573075 -1.065586
2 -0.025674       NaN  0.654641
3  0.209889       NaN  1.290323
4 -0.456622       NaN       NaN
5 -0.990816       NaN       NaN



Unnamed: 0,0,1,2
0,1.177772,-1.607769,-0.484385
1,0.330288,-0.573075,-1.065586
2,-0.025674,-0.573075,0.654641
3,0.209889,-0.573075,1.290323
4,-0.456622,-0.573075,1.290323
5,-0.990816,-0.573075,1.290323


In [29]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,1.177772,-1.607769,-0.484385
1,0.330288,-0.573075,-1.065586
2,-0.025674,-0.573075,0.654641
3,0.209889,-0.573075,1.290323
4,-0.456622,,1.290323
5,-0.990816,,1.290323


You might also pass the mean or median value of a Series with `fillna()`:

In [30]:
data = pd.Series([1., NA, 3.5, NA, 7])
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

Table below summarises function arguments for `fillna`:

**Argument** | **Description**
--- | ---
`value` | Scalar value or dict-like object to use to fill missing values
`method` | Interpolation; by default `ffill` if function called with no other arguments
`axis` | Axis to fill on; default `axis=0`
`inplace` | Modify the calling object without producing a copy
`limit` | For forward and backward filling, maximum number of consecutive periods to fill

## Data Transformation

Filtering, cleaning and other transformation are another class of important operations when dealing with data for analysis: