# CHAPTER 7
# Data Cleaning and Preparation
- Data preparation (loading, cleaning, transforming, rearranging) is reported to take up to 80% or more of an analyst's time.
- **pandas** along with buil-in Python features provide a high-level, flexible, and fast set of tools to manipulate data into the right form.

## Handling Missing Data
- Missing data occurs in many data analysis applications.
- **pandas** try to make working with missing data as painless as possible.
- For example all of the descriptive statistics on **pandas objects** exclude missing data by default.
- For numeric data, pandas uses the floating-point value **NaN (Not a Number)** to represent missing data - this is called a *sentinel value*.
- For other data types pandas uses **NA (not available)** to represent missing values. 
- In statistics applications, **NA data** may either be data that does not exist or that exists but was not observed (through problems with data collection, for example). 
- When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data.

In [None]:
# Import libraries
import pandas as pd
import numpy as np

In [None]:
# Create a pandas Series of strings contaning one missing value
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

In [None]:
# Use isnull function to check for missing values
string_data.isnull()

In [None]:
# The built-in Python None value is also treated as NA in object arrays

# Replace the first element in string_data with None
string_data[0] = None

# Check for missing values
string_data.isnull()

**TABLE**: NA handling methods

| Argument                  | Description |
| :---                  |    :----    |
|dropna| Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
|fillna| Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.
|isnull| Return boolean values indicating which values are missing/NA.
|notnull| Negation of isnull.

### Filtering Out Missing Data
- You can filter out missing data by hand using **pandas.isnull** and boolean indexing.
- Or you can use the **dropna** function. 
- On a Series, **dropna** returns the Series with only the non-null data and index values.
- With DataFrame objects, things are a bit more complex. You may want to drop rows or columns that are all NA or only those containing any NAs. **dropna** by default drops any row containing a missing value.

In [None]:
# Create a Series that contains missing values
data = pd.Series([1, np.nan, 3.5, np.nan, 7])
data

In [None]:
# Use dropna to remove the missing values
data.dropna()

In [None]:
# Create a DataFrame with missing values
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],
                     [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])
data

In [None]:
# Use dropna to remove row with missing values
data.dropna()

In [None]:
# Passing how='all' will only drop rows that are all NA
data.dropna(how='all')

In [None]:
# Create a 4th column with all values NA
data[4] = np.nan
data

In [None]:
# To drop columns in the same way, pass axis=1
data.dropna(axis=1, how='all')

- Suppose you want to keep only rows containing a certain number of observations. 
- You can indicate this with the **thresh** argument for **dropna**.

In [None]:
# Create a DataFrame with 7 rows and 3 columns
df = pd.DataFrame(np.random.randn(7, 3))

# Replace first 4 values for column 1 with NA values
df.iloc[:4, 1] = np.nan

# Replace the first 2 values for column 2 with NA values
df.iloc[:2, 2] = np.nan

df

In [None]:
# If we use dropnan with no arguments all rows with at least 1 NA value will be filtered
df.dropna()

In [None]:
# Using thresh argument we can indicate how many NA values in a row are allowed
df.dropna(thresh=2)

### Filling In Missing Data
- Rather than filtering out missing data you may want to fill in the “holes” in any number of ways. 
- For most purposes, the **fillna** method is the workhorse function to use.

In [None]:
# Calling fillna with a constant replaces missing values with that value
df.fillna(0)

In [None]:
# Calling fillna with a dict, you can use a different fill value for each column
df.fillna({1: 0.5, 2: 0})

In [None]:
# fillna returns a new object, but you can modify the existing object in-place
df.fillna(0, inplace=True)
df

In [None]:
# Create a new DataFrame
df2 = pd.DataFrame(np.random.randn(6, 3))

# Insert some missing values
df2.iloc[2:, 1] = np.nan
df2.iloc[4:, 2] = np.nan

df2

In [None]:
# The same interpolation methods available for reindexing can be used with fillna
df2.fillna(method='ffill')

# 'ffill' = forward fill method

In [None]:
# Create a new Series with missing values
data = pd.Series([1., np.nan, 3.5, np.nan, 7])

# For example you might pass the mean or median value of a Series
data.fillna(data.mean())

**TABLE**: *fillna* function arguments

| Argument                  | Description |
| :---                  |    :----    |
|value| Scalar value or dict-like object to use to fill missing values
|method| Interpolation; by default 'ffill' if function called with no other arguments
|axis| Axis to fill on; default axis=0
|inplace| Modify the calling object without producing a copy
|limit| For forward and backward filling, maximum number of consecutive periods to fill

## Data Transformation
### Removing Duplicates

In [None]:
# Create a DataFrame containing duplicates
df3 = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                    'k2': [1, 1, 2, 3, 3, 4, 4]})
df3

In [None]:
# The DataFrame method duplicated returns a boolean Series indicating whether each
# row is a duplicate (has been observed in a previous row) or not

df3.duplicated()

In [None]:
# drop_duplicates returns a DataFrame where the duplicated array is False
df3.drop_duplicates()

In [None]:
# Filter duplicates only based on the 'k1' column
df3.drop_duplicates(['k1'])

In [None]:
# duplicated and drop_duplicates by default keep the first observed value combination
# Passing keep='last' will return the last one

df3.drop_duplicates(['k1', 'k2'], keep='last')

### Transforming Data Using a Function or Mapping
- For many datasets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame.

In [None]:
# Create a DataFrame with data about various kinds of meats
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                              'Pastrami', 'corned beef', 'Bacon',
                              'pastrami', 'honey ham', 'nova lox'],
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

In [None]:
# Suppose you wanted to add a column indicating the type of animal that each food came from

# Create a dict to map each meat to the animal
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}

# The map method on a Series accepts a function or dict-like object containing a mapping

In [None]:
# First we need to convert each value from 'food' column to lowercase using the str.lower
lowercased = data['food'].str.lower()
lowercased

In [None]:
# Nowe we can use the Series map method to create an extra column called 'animal'
data['animal'] = lowercased.map(meat_to_animal)
data

In [None]:
# Doing the same thing using a lambda function
data['food'].map(lambda x: meat_to_animal[x.lower()])

### Replacing Values
- Filling in missing data with the **fillna** method is a special case of more general value replacement. 
- **map** can be used to modify a subset of values in an object but **replace** provides a simpler and more flexible way to do so.

In [None]:
# Create a Series
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

In [None]:
# We can use replace to modify certail values
data.replace(-999, np.nan)

In [None]:
# replace multiple values at once
data.replace([-999, -1000], np.nan)

In [None]:
# To use a different replacement for each value, pass a list of substitutes
data.replace([-999, -1000], [np.nan, 0])

### Renaming Axis Indexes
- Like values in a Series, axis labels can be similarly transformed by a function or mapping of some form to produce new, differently labeled objects. 
- You can also modify the axes in-place without creating a new data structure.

In [None]:
# Create a DataFrame
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Ohio', 'Colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

In [None]:
# Like a Series, the axis indexes have a map method
data.index = data.index.map(lambda x: x[:4].upper())
data

In [None]:
# If you want to create a transformed version of a dataset without modifying the original, 
# a useful method is rename
data.rename(index=str.title, columns=str.upper)

### Discretization and Binning
- Continuous data is often discretized or otherwise separated into “bins” for analysis.

In [None]:
# Suppose you have data about a group of people and you want to group
# them into discrete age buckets

ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

In [None]:
# Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older
bins = [18, 25, 35, 60, 100]

In [None]:
# To create the actual bins for the data we can use the pandas.cut function
cats = pd.cut(ages, bins)
cats

**pandas.cut**:
- The object pandas returns is a special **Categorical object**. 
- The output describes the bins computed by **pandas.cut** 
- It contains a categories array specifying the distinct **category names** along with a labeling for the ages data in the **codes attribute**.
- **pd.value_counts(cats)** are the bin counts for the result of **pandas.cut**
- You can change which side is closed by passing **right=False**.

In [None]:
# Check the codes
cats.codes

In [None]:
# Check the categories
cats.categories

In [None]:
# Check the bin count
pd.value_counts(cats)

In [None]:
# You can also pass your own bin names by passing a list or array to the labels option
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)

In [None]:
# If you pass an integer number of bins to cut instead of explicit bin edges, it will compute 
# equal-length bins based on the minimum and maximum values in the data

data = np.random.randint(20, size=20)
data

In [None]:
# Create 4 bins of equal-length
cats = pd.cut(data, 4, precision=0)
cats

In [None]:
# Count the number of values in each bin
pd.value_counts(cats)

- A closely related function, **qcut**, bins the data based on sample **quantiles**. 
- Depending on the distribution of the data, using cut will not usually result in each bin having the same number of data points. 
- Since qcut uses sample quantiles instead, by definition you will obtain roughly equal-size bins:

In [None]:
# Create a sample of normally distributed numbers
data = np.random.randn(1000)

# Cut into quartiles
cats = pd.qcut(data, 4) 
cats

In [None]:
# Count the number of values in each bin
pd.value_counts(cats)

### Detecting and Filtering Outliers
- Filtering or transforming outliers is largely a matter of applying array operations.

In [None]:
# Consider a DataFrame with some normally distributed data
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()

In [None]:
# Find values in one of the columns exceeding 3 in absolute value
col = data[2]
col[np.abs(col) > 3]

In [None]:
# To select all rows having a value exceeding 3 or –3, you can use the any method on a
# boolean DataFrame

data[(np.abs(data) > 3).any(1)]

### Permutation and Random Sampling
- Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do using the **numpy.random.permutation** function. 
- Calling **permutation** with the length of the axis you want to permute produces an array of integers indicating the new ordering.

In [None]:
# Create a DataFrame
df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))
df

In [None]:
# Use the permutation function to create a sampler array
sampler = np.random.permutation(5)
sampler

In [None]:
# Use the sampler array as input for take function = Return the elements in the given 
# positional indices along an axis
df.take(sampler)

In [None]:
# To select a random subset without replacement, you can use the sample method
df.sample(n=3)

### Computing Indicator/Dummy Variables
- Another type of transformation for statistical modeling or machine learning applications is converting a categorical variable into a **dummy** or **indicator matrix**. 
- If a column in a DataFrame has k distinct values, you would derive a matrix or DataFrame with k columns containing all 1s and 0s. 
- pandas has a **get_dummies** function for doing this, though devising one yourself is not difficult.

In [None]:
# Create a DataFrame
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                   'data1': range(6)})
df

In [None]:
# Use get_dummies on key column function to derive a matrix with 3 columns
pd.get_dummies(df['key'])

In [None]:
# get_dummies has a prefix argument if you want to add a prefix to the columns in the 
# indicator DataFrame

# Create the dummy matrix with prefix
dummies = pd.get_dummies(df['key'], prefix='key')

# Join the dummy matrix with the original data
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy

**EXAMPLE**: MovieLens 1M dataset

In [None]:
# Define name for the columns
mnames = ['movie_id', 'title', 'genres']

# Read the data from movies.dat file
movies = pd.read_table('datasets/movielens/movies.dat', sep='::',
                       header=None, names=mnames)

movies.head(10)

In [None]:
# Since a row in the DataFrame belongs to multiple categories
# Some wrangling is necessary to get the dummy matrix

# First, we extract the list of unique genres in the dataset
all_genres = []

for x in movies.genres:
    all_genres.extend(x.split('|'))

genres = pd.unique(all_genres)

genres

In [None]:
# We start with a DataFrame of all zeros
zero_matrix = np.zeros((len(movies), len(genres)))
zero_matrix

In [None]:
# Rename the columns using the list of unq genres we created before
dummies = pd.DataFrame(zero_matrix, columns=genres)
dummies.head()

In [None]:
# Select first element as an example
gen = movies.genres[0]

# Split the text based on '|'
gen.split('|')

In [None]:
# Use the dummies.columns to compute the column indices for each genre
dummies.columns.get_indexer(gen.split('|'))

In [None]:
# We can use .iloc to set values based on these indices:
for i, gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1

In [None]:
# Combine this with movies
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[5]

## String Manipulation
### String Object Methods
- In many string munging and scripting applications, built-in string methods are sufficient.

In [None]:
# As an example, a comma-separated string can be broken into pieces with split
val = 'a,b, guido'
val.split(',')

In [None]:
# split is often combined with strip to trim whitespace (including line breaks)
pieces = [x.strip() for x in val.split(',')]
pieces

In [None]:
# These substrings could be concatenated together with a two-colon delimiter using addition
first, second, third = pieces
first + '::' + second + '::' + third

In [None]:
# But this isn’t a practical generic method. A faster and more Pythonic way is to pass a
# list or tuple to the join method on the string '::'
'::'.join(pieces)

In [None]:
# Using Python’s in keyword is the best way to detect a substring
'guido' in val

In [None]:
# Get the index of a certain substring
val.index(',')

# index raises an exception if the substring isn’t found

In [None]:
# You can also use find to get the index of asub string
val.find('a')

# find return -1 if a substring isn't found

In [None]:
# count returns the number of occurrences of a particular substring
val.count(',')

In [None]:
# replace will substitute occurrences of one pattern for another
# It is commonly used to delete patterns, too, by passing an empty string
val.replace(',', '')

**TABLE**: Python built-in string methods

| Argument                  | Description |
| :---                  |    :----    |
|count| Return the number of non-overlapping occurrences of substring in the string.
|endswith| Returns True if string ends with suffix.
|startswith| Returns True if string starts with prefix.
|join| Use string as delimiter for concatenating a sequence of other strings.
|index |Return position of first character in substring if found in the string; raises ValueError if not found.
|find| Return position of first character of rst occurrence of substring in the string; like index, but returns –1 if not found.
|rfind| Return position of first character of last occurrence of substring in the string; returns –1 if not found.
|replace| Replace occurrences of string with another string.
|strip, rstrip, lstrip| Trim whitespace, including newlines; equivalent to x.strip() (and rstrip, lstrip, respectively) for each element.
|split| Break string into list of substrings using passed delimiter.
|lower| Convert alphabet characters to lowercase.
|upper| Convert alphabet characters to uppercase.
|casefold| Convert characters to lowercase, and convert any region-specific variable character  ombinations to a common comparable form.
|ljust, rjust| Left justify or right justify, respectively; pad opposite side of string with spaces (or some other fill character) to return a string with a minimum width.

### Regular Expressions
- Regular expressions provide a flexible way to search or match (often more complex) string patterns in text. 
- A single expression, commonly called a **regex**, is a string formed according to the regular expression language. 
- Python’s built-in **re module** is responsible for applying regular expressions to strings.
- The **re module** functions fall into three categories: pattern matching, substitution, and splitting. 
- A **regex** describes a pattern to locate in the text, which can then be used for many purposes.

In [None]:
# Import the re module
import re

# Define a string with variable number of whitespace characters (tabs, spaces, and newlines)
text = "foo bar\t baz \tqux"

In [None]:
# To remvoe the white spaces you can use the regex describing one or more whitespace 
# characters \s+ combined with re.split method
re.split('\s+', text)

In [None]:
# If you want to get a list of all patterns matching the regex, you can use the findall method
re.findall('\s+', text)

**Compile Regex**:
- When you call **re.split('\s+', text)**, the regular expression is first compiled, and then its split method is called on the passed text. 
- You can compile the regex yourself with **re.compile**, forming a reusable regex object.
- Creating a regex object with re.compile is highly recommended if you intend to apply the same expression to many strings; doing so will save CPU cycles.

In [None]:
# Let’s consider a block of text and a regular expression capable of identifying 
# most email addresses

text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""

pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

In [None]:
# Using findall on the text produces a list of the email addresses
regex.findall(text)

In [None]:
# search returns a special match object for the first email address in the text
regex.search(text)

In [None]:
# regex.match returns None, as it only will match if the pattern 
# occurs at the start of the string
print(regex.match(text))

In [None]:
# sub will return a new string with occurrences of the pattern replaced by the
# a new string
print(regex.sub('REDACTED', text))

**TABLE**: Regular expression methods

| Argument                  | Description |
| :---                  |    :----    |
|findall| Return all non-overlapping matching patterns in a string as a list
|finditer| Like findall, but returns an iterator
|match| Match pattern at start of string and optionally segment pattern components into groups; if the pattern matches, returns a match object, and otherwise None
|search| Scan string for match to pattern; returning a match object if so; unlike match, the match can be anywhere in the string as opposed to only at the beginning
|split| Break string into pieces at each occurrence of pattern
|sub, subn| Replace all (sub) or first n occurrences (subn) of pattern in string with replacement expression; use symbols \1, \2, ... to refer to match group elements in the replacement string

### Vectorized String Functions in pandas
- Cleaning up a messy dataset for analysis often requires a lot of string munging and regularization. 
- To complicate matters, a column containing strings will sometimes have missing data.
- Series has array-oriented methods for string operations that skip NA values. These are accessed through Series’s **str** attribute.

In [None]:
# Create a Series of strings with NA values
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
        'Rob': 'rob@gmail.com', 'Wes': np.nan}
data = pd.Series(data)
data

In [None]:
# Check whether each email address has 'gmail' in it with str.contains
data.str.contains('gmail')

In [None]:
# Regular expressions can be used, too, along with any re options like IGNORECASE
data.str.findall('([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})', flags=re.IGNORECASE)

In [None]:
# Match pattern at start of string - returns a match object
matches = data.str.match('([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})', flags=re.IGNORECASE)
matches

In [None]:
# You can slice strings using this syntax
data.str[:5]

**TABLE**: Partial listing of vectorized string methods

| Method                  | Description |
| :---                  |    :----    |
|cat| Concatenate strings element-wise with optional delimiter
|contains| Return boolean array if each string contains pattern/regex
|count| Count occurrences of pattern
|extract| Use a regular expression with groups to extract one or more strings from a Series of strings; the result will be a DataFrame with one column per group
|endswith| Equivalent to x.endswith(pattern) for each element
|startswith| Equivalent to x.startswith(pattern) for each element
|findall| Compute list of all occurrences of pattern/regex for each string
|get| Index into each element (retrieve i-th element)
|isalnum| Equivalent to built-in str.alnum
|isalpha| Equivalent to built-in str.isalpha
|isdecimal| Equivalent to built-in str.isdecimal
|isdigit| Equivalent to built-in str.isdigit
|islower| Equivalent to built-in str.islower
|isnumeric| Equivalent to built-in str.isnumeric
|isupper| Equivalent to built-in str.isupper
|join| Join strings in each element of the Series with passed separator
|len| Compute length of each string
|lower, upper| Convert cases; equivalent to x.lower() or x.upper() for each element
|match| Use re.match with the passed regular expression on each element, returning matched groups as list
|pad| Add whitespace to left, right, or both sides of strings
|center| Equivalent to pad(side='both')
|repeat| Duplicate values (e.g., s.str.repeat(3) is equivalent to x * 3 for each string)
|replace| Replace occurrences of pattern/regex with some other string
|slice| Slice each string in the Series
|split| Split strings on delimiter or regular expression
|strip| Trim whitespace from both sides, including newlines
|rstrip| Trim whitespace on right side
|lstrip| Trim whitespace on left side