# CHAPTER 7
# Data Cleaning and Preparation
- Data preparation (loading, cleaning, transforming, rearranging) is reported to take up to 80% or more of an analyst's time.
- **pandas** along with buil-in Python features provide a high-level, flexible, and fast set of tools to manipulate data into the right form.

## Handling Missing Data
- Missing data occurs in many data analysis applications.
- **pandas** try to make working with missing data as painless as possible.
- For example all of the descriptive statistics on **pandas objects** exclude missing data by default.
- For numeric data, pandas uses the floating-point value **NaN (Not a Number)** to represent missing data - this is called a *sentinel value*.
- For other data types pandas uses **NA (not available)** to represent missing values. 
- In statistics applications, **NA data** may either be data that does not exist or that exists but was not observed (through problems with data collection, for example). 
- When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data.

In [2]:
# Import libraries
import pandas as pd
import numpy as np

In [3]:
# Create a pandas Series of strings contaning one missing value
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [4]:
# Use isnull function to check for missing values
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [5]:
# The built-in Python None value is also treated as NA in object arrays

# Replace the first element in string_data with None
string_data[0] = None

# Check for missing values
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

**TABLE**: NA handling methods

| Argument                  | Description |
| :---                  |    :----    |
|dropna| Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
|fillna| Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.
|isnull| Return boolean values indicating which values are missing/NA.
|notnull| Negation of isnull.

### Filtering Out Missing Data
- You can filter out missing data by hand using **pandas.isnull** and boolean indexing.
- Or you can use the **dropna** function. 
- On a Series, **dropna** returns the Series with only the non-null data and index values.
- With DataFrame objects, things are a bit more complex. You may want to drop rows or columns that are all NA or only those containing any NAs. **dropna** by default drops any row containing a missing value.

In [6]:
# Create a Series that contains missing values
data = pd.Series([1, np.nan, 3.5, np.nan, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [7]:
# Use dropna to remove the missing values
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [8]:
# Create a DataFrame with missing values
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],
                     [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [9]:
# Use dropna to remove row with missing values
data.dropna()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [10]:
# Passing how='all' will only drop rows that are all NA
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [11]:
# Create a 4th column with all values NA
data[4] = np.nan
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [12]:
# To drop columns in the same way, pass axis=1
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


- Suppose you want to keep only rows containing a certain number of observations. 
- You can indicate this with the **thresh** argument for **dropna**.

In [13]:
# Create a DataFrame with 7 rows and 3 columns
df = pd.DataFrame(np.random.randn(7, 3))

# Replace first 4 values for column 1 with NA values
df.iloc[:4, 1] = np.nan

# Replace the first 2 values for column 2 with NA values
df.iloc[:2, 2] = np.nan

df

Unnamed: 0,0,1,2
0,1.757228,,
1,-0.787853,,
2,0.120561,,0.390708
3,1.128888,,0.832685
4,-1.023418,0.21294,-0.885268
5,-1.308407,0.691072,-0.963753
6,1.313737,1.89295,0.123536


In [14]:
# If we use dropnan with no arguments all rows with at least 1 NA value will be filtered
df.dropna()

Unnamed: 0,0,1,2
4,-1.023418,0.21294,-0.885268
5,-1.308407,0.691072,-0.963753
6,1.313737,1.89295,0.123536


In [15]:
# Using thresh argument we can indicate how many NA values in a row are allowed
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,0.120561,,0.390708
3,1.128888,,0.832685
4,-1.023418,0.21294,-0.885268
5,-1.308407,0.691072,-0.963753
6,1.313737,1.89295,0.123536


### Filling In Missing Data
- Rather than filtering out missing data you may want to fill in the “holes” in any number of ways. 
- For most purposes, the **fillna** method is the workhorse function to use.

In [16]:
# Calling fillna with a constant replaces missing values with that value
df.fillna(0)

Unnamed: 0,0,1,2
0,1.757228,0.0,0.0
1,-0.787853,0.0,0.0
2,0.120561,0.0,0.390708
3,1.128888,0.0,0.832685
4,-1.023418,0.21294,-0.885268
5,-1.308407,0.691072,-0.963753
6,1.313737,1.89295,0.123536


In [17]:
# Calling fillna with a dict, you can use a different fill value for each column
df.fillna({1: 0.5, 2: 0})

Unnamed: 0,0,1,2
0,1.757228,0.5,0.0
1,-0.787853,0.5,0.0
2,0.120561,0.5,0.390708
3,1.128888,0.5,0.832685
4,-1.023418,0.21294,-0.885268
5,-1.308407,0.691072,-0.963753
6,1.313737,1.89295,0.123536


In [18]:
# fillna returns a new object, but you can modify the existing object in-place
df.fillna(0, inplace=True)
df

Unnamed: 0,0,1,2
0,1.757228,0.0,0.0
1,-0.787853,0.0,0.0
2,0.120561,0.0,0.390708
3,1.128888,0.0,0.832685
4,-1.023418,0.21294,-0.885268
5,-1.308407,0.691072,-0.963753
6,1.313737,1.89295,0.123536


In [19]:
# Create a new DataFrame
df2 = pd.DataFrame(np.random.randn(6, 3))

# Insert some missing values
df2.iloc[2:, 1] = np.nan
df2.iloc[4:, 2] = np.nan

df2

Unnamed: 0,0,1,2
0,0.462895,-0.141267,0.05833
1,0.250372,0.104162,2.621661
2,1.19389,,3.230661
3,0.070931,,-0.8125
4,-0.404251,,
5,1.816715,,


In [20]:
# The same interpolation methods available for reindexing can be used with fillna
df2.fillna(method='ffill')

# 'ffill' = forward fill method

Unnamed: 0,0,1,2
0,0.462895,-0.141267,0.05833
1,0.250372,0.104162,2.621661
2,1.19389,0.104162,3.230661
3,0.070931,0.104162,-0.8125
4,-0.404251,0.104162,-0.8125
5,1.816715,0.104162,-0.8125


In [21]:
# Create a new Series with missing values
data = pd.Series([1., np.nan, 3.5, np.nan, 7])

# For example you might pass the mean or median value of a Series
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

**TABLE**: *fillna* function arguments

| Argument                  | Description |
| :---                  |    :----    |
|value| Scalar value or dict-like object to use to fill missing values
|method| Interpolation; by default 'ffill' if function called with no other arguments
|axis| Axis to fill on; default axis=0
|inplace| Modify the calling object without producing a copy
|limit| For forward and backward filling, maximum number of consecutive periods to fill

## Data Transformation
### Removing Duplicates

In [22]:
# Create a DataFrame containing duplicates
df3 = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                    'k2': [1, 1, 2, 3, 3, 4, 4]})
df3

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [23]:
# The DataFrame method duplicated returns a boolean Series indicating whether each
# row is a duplicate (has been observed in a previous row) or not

df3.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [24]:
# drop_duplicates returns a DataFrame where the duplicated array is False
df3.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


In [25]:
# Filter duplicates only based on the 'k1' column
df3.drop_duplicates(['k1'])

Unnamed: 0,k1,k2
0,one,1
1,two,1


In [26]:
# duplicated and drop_duplicates by default keep the first observed value combination
# Passing keep='last' will return the last one

df3.drop_duplicates(['k1', 'k2'], keep='last')

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
6,two,4


### Transforming Data Using a Function or Mapping
- For many datasets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame.

In [27]:
# Create a DataFrame with data about various kinds of meats
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                              'Pastrami', 'corned beef', 'Bacon',
                              'pastrami', 'honey ham', 'nova lox'],
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


In [28]:
# Suppose you wanted to add a column indicating the type of animal that each food came from

# Create a dict to map each meat to the animal
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}

# The map method on a Series accepts a function or dict-like object containing a mapping

In [29]:
# First we need to convert each value from 'food' column to lowercase using the str.lower
lowercased = data['food'].str.lower()
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [30]:
# Nowe we can use the Series map method to create an extra column called 'animal'
data['animal'] = lowercased.map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


In [31]:
# Doing the same thing using a lambda function
data['food'].map(lambda x: meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

### Replacing Values
- Filling in missing data with the **fillna** method is a special case of more general value replacement. 
- **map** can be used to modify a subset of values in an object but **replace** provides a simpler and more flexible way to do so.

In [32]:
# Create a Series
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [33]:
# We can use replace to modify certail values
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [34]:
# replace multiple values at once
data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

In [35]:
# To use a different replacement for each value, pass a list of substitutes
data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

### Renaming Axis Indexes
- Like values in a Series, axis labels can be similarly transformed by a function or mapping of some form to produce new, differently labeled objects. 
- You can also modify the axes in-place without creating a new data structure.

In [36]:
# Create a DataFrame
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Ohio', 'Colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [37]:
# Like a Series, the axis indexes have a map method
data.index = data.index.map(lambda x: x[:4].upper())
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [38]:
# If you want to create a transformed version of a dataset without modifying the original, 
# a useful method is rename
data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


### Discretization and Binning
- Continuous data is often discretized or otherwise separated into “bins” for analysis.

In [39]:
# Suppose you have data about a group of people and you want to group
# them into discrete age buckets

ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

In [40]:
# Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older
bins = [18, 25, 35, 60, 100]

In [41]:
# To create the actual bins for the data we can use the pandas.cut function
cats = pd.cut(ages, bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

**pandas.cut**:
- The object pandas returns is a special **Categorical object**. 
- The output describes the bins computed by **pandas.cut** 
- It contains a categories array specifying the distinct **category names** along with a labeling for the ages data in the **codes attribute**.
- **pd.value_counts(cats)** are the bin counts for the result of **pandas.cut**
- You can change which side is closed by passing **right=False**.

In [42]:
# Check the codes
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [43]:
# Check the categories
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')

In [44]:
# Check the bin count
pd.value_counts(cats)

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

In [45]:
# You can also pass your own bin names by passing a list or array to the labels option
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)

['Youth', 'Youth', 'Youth', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'Senior', 'MiddleAged', 'MiddleAged', 'YoungAdult']
Length: 12
Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']

In [46]:
# If you pass an integer number of bins to cut instead of explicit bin edges, it will compute 
# equal-length bins based on the minimum and maximum values in the data

data = np.random.randint(20, size=20)
data

array([ 6,  3, 16, 14,  8, 13,  8,  3,  2,  7, 12, 17,  7,  2,  3,  2, 15,
       11, 16, 10])

In [47]:
# Create 4 bins of equal-length
cats = pd.cut(data, 4, precision=0)
cats

[(6.0, 10.0], (2.0, 6.0], (13.0, 17.0], (13.0, 17.0], (6.0, 10.0], ..., (2.0, 6.0], (13.0, 17.0], (10.0, 13.0], (13.0, 17.0], (10.0, 13.0]]
Length: 20
Categories (4, interval[float64]): [(2.0, 6.0] < (6.0, 10.0] < (10.0, 13.0] < (13.0, 17.0]]

In [48]:
# Count the number of values in each bin
pd.value_counts(cats)

(2.0, 6.0]      6
(13.0, 17.0]    5
(6.0, 10.0]     5
(10.0, 13.0]    4
dtype: int64

- A closely related function, **qcut**, bins the data based on sample **quantiles**. 
- Depending on the distribution of the data, using cut will not usually result in each bin having the same number of data points. 
- Since qcut uses sample quantiles instead, by definition you will obtain roughly equal-size bins:

In [49]:
# Create a sample of normally distributed numbers
data = np.random.randn(1000)

# Cut into quartiles
cats = pd.qcut(data, 4) 
cats

[(-0.0183, 0.598], (0.598, 2.886], (0.598, 2.886], (-0.699, -0.0183], (0.598, 2.886], ..., (-3.053, -0.699], (-0.699, -0.0183], (0.598, 2.886], (-0.0183, 0.598], (-0.699, -0.0183]]
Length: 1000
Categories (4, interval[float64]): [(-3.053, -0.699] < (-0.699, -0.0183] < (-0.0183, 0.598] < (0.598, 2.886]]

In [50]:
# Count the number of values in each bin
pd.value_counts(cats)

(0.598, 2.886]       250
(-0.0183, 0.598]     250
(-0.699, -0.0183]    250
(-3.053, -0.699]     250
dtype: int64

### Detecting and Filtering Outliers
- Filtering or transforming outliers is largely a matter of applying array operations.

In [51]:
# Consider a DataFrame with some normally distributed data
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.029931,0.056211,-0.016759,0.017143
std,1.018538,1.020406,1.032588,1.018075
min,-3.545938,-3.140599,-3.574483,-2.83886
25%,-0.660644,-0.629101,-0.663264,-0.732615
50%,0.016914,0.062919,0.004181,0.046796
75%,0.717914,0.769892,0.679126,0.70804
max,3.40696,3.462624,3.105705,3.048711


In [52]:
# Find values in one of the columns exceeding 3 in absolute value
col = data[2]
col[np.abs(col) > 3]

205   -3.574483
607    3.105705
Name: 2, dtype: float64

In [53]:
# To select all rows having a value exceeding 3 or –3, you can use the any method on a
# boolean DataFrame

data[(np.abs(data) > 3).any(1)]

Unnamed: 0,0,1,2,3
42,0.623133,3.206916,-0.427564,0.261203
63,3.122911,0.922723,1.000699,0.443686
177,-3.545938,-0.323821,0.327629,-0.370684
187,0.408279,-3.140599,-0.46645,0.793055
205,-0.286145,1.665412,-3.574483,-0.896578
259,-1.535319,3.083255,-0.427867,0.440696
315,0.165413,3.462624,1.50334,-0.927113
373,-0.945507,-3.126693,0.942789,-0.309851
542,-0.942523,0.059258,-0.66006,3.027876
607,-0.720062,1.079372,3.105705,-0.61332


### Permutation and Random Sampling
- Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do using the **numpy.random.permutation** function. 
- Calling **permutation** with the length of the axis you want to permute produces an array of integers indicating the new ordering.

In [54]:
# Create a DataFrame
df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [55]:
# Use the permutation function to create a sampler array
sampler = np.random.permutation(5)
sampler

array([1, 4, 0, 2, 3])

In [58]:
# Use the sampler array as input for take function = Return the elements in the given 
# positional indices along an axis
df.take(sampler)

Unnamed: 0,0,1,2,3
1,4,5,6,7
4,16,17,18,19
0,0,1,2,3
2,8,9,10,11
3,12,13,14,15


In [59]:
# To select a random subset without replacement, you can use the sample method
df.sample(n=3)

Unnamed: 0,0,1,2,3
1,4,5,6,7
0,0,1,2,3
4,16,17,18,19


### Computing Indicator/Dummy Variables