# Data Cleaning and Preparation

During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up 80% or more of an analyst’s time. Sometimes the way that data is stored in files or databases is not in the right format for a particular task.

Fortunately, pandas, along with the built-in Python language features, provides you with a high-level, flexible, and fast set of tools to enable you to manipulate data into the right form.

In this chapter we discuss tools for missing data, duplicate data, string manipulation, and some other analytical data transformations.

In [2]:
import pandas as pd
import numpy as np

## Handling Missing Data

In [5]:
string_data = pd.Series(['orange', 'watermelon', np.nan, 'mango'])

In [6]:
string_data

0        orange
1    watermelon
2           NaN
3         mango
dtype: object

In [7]:
pd.isnull(string_data)

0    False
1    False
2     True
3    False
dtype: bool

In [10]:
from numpy import nan as NA

data = pd.Series([1, NA, 3.5, NA, 7])

data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [11]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [12]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

In [13]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],[NA, NA, NA], [NA, 6.5, 3.]])

In [14]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [15]:
data.dropna()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [16]:
data.dropna(axis=1)

0
1
2
3


In [17]:
df = pd.DataFrame(np.random.randn(7, 3))

df.iloc[:4, 1] = NA

df.iloc[:2, 2] = NA

In [18]:
df

Unnamed: 0,0,1,2
0,-0.055479,,
1,-0.669051,,
2,-0.603817,,-0.449571
3,-1.213257,,-0.814114
4,-0.303501,1.303041,0.482263
5,-0.301736,0.880402,-1.365236
6,-0.413435,1.319963,0.86739


A related way to filter out DataFrame rows tends to concern time series data. Suppose you want to keep only rows containing a certain number of observations. You can indicate this with the thresh argument:

In [19]:
df.dropna(thresh=2) # 2 or more NA in a row

Unnamed: 0,0,1,2
2,-0.603817,,-0.449571
3,-1.213257,,-0.814114
4,-0.303501,1.303041,0.482263
5,-0.301736,0.880402,-1.365236
6,-0.413435,1.319963,0.86739


### FIlling in mising data

In [25]:
df

Unnamed: 0,0,1,2
0,-0.055479,,
1,-0.669051,,
2,-0.603817,,-0.449571
3,-1.213257,,-0.814114
4,-0.303501,1.303041,0.482263
5,-0.301736,0.880402,-1.365236
6,-0.413435,1.319963,0.86739


In [26]:
df.fillna(0)

Unnamed: 0,0,1,2
0,-0.055479,0.0,0.0
1,-0.669051,0.0,0.0
2,-0.603817,0.0,-0.449571
3,-1.213257,0.0,-0.814114
4,-0.303501,1.303041,0.482263
5,-0.301736,0.880402,-1.365236
6,-0.413435,1.319963,0.86739


In [27]:
df.fillna({1:0.5, 2:0})

Unnamed: 0,0,1,2
0,-0.055479,0.5,0.0
1,-0.669051,0.5,0.0
2,-0.603817,0.5,-0.449571
3,-1.213257,0.5,-0.814114
4,-0.303501,1.303041,0.482263
5,-0.301736,0.880402,-1.365236
6,-0.413435,1.319963,0.86739


In [31]:
df

Unnamed: 0,0,1,2
0,-0.055479,,
1,-0.669051,,
2,-0.603817,,-0.449571
3,-1.213257,,-0.814114
4,-0.303501,1.303041,0.482263
5,-0.301736,0.880402,-1.365236
6,-0.413435,1.319963,0.86739


In [39]:
df.fillna({1:df[1].mean(),2:df[2].mean()})

Unnamed: 0,0,1,2
0,-0.055479,1.167802,-0.255853
1,-0.669051,1.167802,-0.255853
2,-0.603817,1.167802,-0.449571
3,-1.213257,1.167802,-0.814114
4,-0.303501,1.303041,0.482263
5,-0.301736,0.880402,-1.365236
6,-0.413435,1.319963,0.86739


In [40]:
df[1].mean()

1.167802092483264

In [41]:
df[1]

0         NaN
1         NaN
2         NaN
3         NaN
4    1.303041
5    0.880402
6    1.319963
Name: 1, dtype: float64

## Data Transformation

### Removing duplicates

In [42]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],'k2': [1, 1, 2, 3, 3, 4, 4]})

In [43]:
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [44]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


In [46]:
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2
0,one,1
1,two,1


In [47]:
data.drop_duplicates(['k1'], keep='last')

Unnamed: 0,k1,k2
4,one,3
6,two,4


### Transforming data using function or mapping

In [48]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon','Pastrami', 'corned beef', 'Bacon',
                              'pastrami', 'honey ham', 'nova lox'],
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

In [49]:
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


In [50]:
meat_to_animal = {
  'bacon': 'pig',
  'pulled pork': 'pig',
  'pastrami': 'cow',
  'corned beef': 'cow',
  'honey ham': 'pig',
  'nova lox': 'salmon'
}

In [51]:
def get_animal(x):
    return meat_to_animal[x]

In [58]:
data['animal'] = data['food'].map(lambda x: get_animal(x.lower()))

In [59]:
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


In [82]:
def count_len(x):
    return len(str(x))

In [83]:
data.apply(count_len)

food      178
ounces    108
animal    135
dtype: int64

In [84]:
data.applymap(lambda x: count_len(x))

Unnamed: 0,food,ounces,animal
0,5,3,3
1,11,3,3
2,5,4,3
3,8,3,3
4,11,3,3
5,5,3,3
6,8,3,3
7,9,3,3
8,8,3,6


In [92]:
data.index.map(lambda x: x+100)

Int64Index([100, 101, 102, 103, 104, 105, 106, 107, 108], dtype='int64')

In [93]:
index2 = data.index.map(lambda x: x+100)

In [94]:
data.set_index(index2)

Unnamed: 0,food,ounces,animal
100,bacon,4.0,pig
101,pulled pork,3.0,pig
102,bacon,12.0,pig
103,Pastrami,6.0,cow
104,corned beef,7.5,cow
105,Bacon,8.0,pig
106,pastrami,3.0,cow
107,honey ham,5.0,pig
108,nova lox,6.0,salmon


### Replacing values

In [85]:
data.replace([3.0, 8.0], 9999)

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,9999.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,9999.0,pig
6,pastrami,9999.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


### Discretization and binning

If a variable can take on any value between two specified values, it is called a continuous variable; otherwise, it is called a discrete variable.

Continuous data is often discretized or otherwise separated into “bins” for analysis. Suppose you have data about a group of people in a study, and you want to group them into discrete age buckets:


In [95]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older. To do so, you have to use cut, a function in pandas:


In [96]:

bins = [18, 25, 35, 60, 100]

cats = pd.cut(ages, bins)

In [97]:
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

In [98]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [99]:
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]]
              closed='right',
              dtype='interval[int64]')

In [101]:
pd.value_counts(cats)

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

In [103]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']

In [104]:
pd.cut(ages, bins, labels=group_names)

[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]

If you pass an integer number of bins to cut instead of explicit bin edges, it will compute equal-length bins based on the minimum and maximum values in the data. Consider the case of some uniformly distributed data chopped into fourths:


In [105]:
data = np.random.rand(20)

pd.cut(data, 4, precision=2)

[(0.29, 0.53], (0.76, 0.99], (0.76, 0.99], (0.76, 0.99], (0.76, 0.99], ..., (0.061, 0.29], (0.29, 0.53], (0.76, 0.99], (0.76, 0.99], (0.061, 0.29]]
Length: 20
Categories (4, interval[float64]): [(0.061, 0.29] < (0.29, 0.53] < (0.53, 0.76] < (0.76, 0.99]]

A closely related function, qcut, bins the data based on sample quantiles. Depending on the distribution of the data, using cut will not usually result in each bin having the same number of data points. Since qcut uses sample quantiles instead, by definition you will obtain roughly equal-size bins:


In [106]:
data = np.random.randn(1000)  # Normally distributed

cats = pd.qcut(data, 4)  # Cut into quartiles

In [107]:
cats.value_counts()

(-2.927, -0.7]      250
(-0.7, 0.00905]     250
(0.00905, 0.688]    250
(0.688, 3.23]       250
dtype: int64

### Detecting and Filtering Outliers

Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:


In [108]:
data = pd.DataFrame(np.random.randn(1000, 4))

data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.093739,0.038241,-0.035956,0.023261
std,0.993193,0.985946,0.971084,0.977097
min,-3.155654,-3.592892,-3.143519,-3.205319
25%,-0.578782,-0.611162,-0.66789,-0.640727
50%,0.109837,0.048154,-0.078959,0.004329
75%,0.788611,0.735856,0.61401,0.724184
max,3.123581,3.081811,3.256618,3.394928


Values can be set based on these criteria. Here is code to cap values outside the interval –3 to 3:

In [109]:
data[np.abs(data) > 3] = np.sign(data) * 3

In [111]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.093771,0.038937,-0.036069,0.023072
std,0.992325,0.983099,0.969797,0.975153
min,-3.0,-3.0,-3.0,-3.0
25%,-0.578782,-0.611162,-0.66789,-0.640727
50%,0.109837,0.048154,-0.078959,0.004329
75%,0.788611,0.735856,0.61401,0.724184
max,3.0,3.0,3.0,3.0


In [113]:
np.sign(data).head()

Unnamed: 0,0,1,2,3
0,1.0,-1.0,1.0,-1.0
1,1.0,-1.0,-1.0,-1.0
2,-1.0,-1.0,-1.0,-1.0
3,1.0,1.0,1.0,1.0
4,1.0,-1.0,1.0,1.0


### Permutation and random sampling

In [114]:
df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))

sampler = np.random.permutation(5)

sampler

array([3, 1, 2, 0, 4])

In [115]:
df.take(sampler)

Unnamed: 0,0,1,2,3
3,12,13,14,15
1,4,5,6,7
2,8,9,10,11
0,0,1,2,3
4,16,17,18,19


In [117]:
df.sample(3)

Unnamed: 0,0,1,2,3
4,16,17,18,19
1,4,5,6,7
2,8,9,10,11


In [118]:
df.sample(3, replace=True)

Unnamed: 0,0,1,2,3
3,12,13,14,15
3,12,13,14,15
2,8,9,10,11


### Dummy Variables

Another type of transformation for statistical modeling or machine learning applications is converting a categorical variable into a “dummy” or “indicator” matrix. If a column in a DataFrame has k distinct values, you would derive a matrix or DataFrame with k columns containing all 1s and 0s.

In [120]:
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('../datasets/movielens/movies.dat', sep='::',header=None, names=mnames)

  


In [121]:
movies.shape

(3883, 3)

In [122]:
movies.head()

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [123]:
all_genres = []

for x in movies.genres:
    all_genres.extend(x.split('|'))

genres = pd.unique(all_genres)

In [124]:
genres

array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)

In [125]:
zero_matrix = np.zeros((len(movies), len(genres)))

dummies = pd.DataFrame(zero_matrix, columns=genres)

In [126]:
dummies

Unnamed: 0,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [128]:
from tqdm import tqdm_notebook
for i, gen in tqdm_notebook(enumerate(movies.genres)):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




In [131]:
for col in dummies.columns.tolist():
    dummies[col] = dummies[col].astype(int)

In [134]:
df_movies = movies.join(dummies.add_prefix('Genre_'))

In [135]:
df_movies.head()

Unnamed: 0,movie_id,title,genres,Genre_Animation,Genre_Children's,Genre_Comedy,Genre_Adventure,Genre_Fantasy,Genre_Romance,Genre_Drama,...,Genre_Crime,Genre_Thriller,Genre_Horror,Genre_Sci-Fi,Genre_Documentary,Genre_War,Genre_Musical,Genre_Mystery,Genre_Film-Noir,Genre_Western
0,1,Toy Story (1995),Animation|Children's|Comedy,1,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),Adventure|Children's|Fantasy,0,1,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),Comedy|Romance,0,0,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,4,Waiting to Exhale (1995),Comedy|Drama,0,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,5,Father of the Bride Part II (1995),Comedy,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


This is a also called one hot encoding of catgorical varriables. A useful recipe is to combine one hot encoding with binning.

In [136]:
np.random.seed(12345)

values = np.random.rand(10)

values

array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
       0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])

In [137]:
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]

In [138]:
pd.get_dummies(pd.cut(values, bins))

Unnamed: 0,"(0.0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1.0]"
0,0,0,0,0,1
1,0,1,0,0,0
2,1,0,0,0,0
3,0,1,0,0,0
4,0,0,1,0,0
5,0,0,1,0,0
6,0,0,0,0,1
7,0,0,0,1,0
8,0,0,0,1,0
9,0,0,0,1,0


## String manipulation

In [142]:
val = 'a,b,  guido'

val.split(',')

['a', 'b', '  guido']

split is often combined with strip to trim whitespace (including line breaks):

In [143]:
pieces = [x.strip() for x in val.split(',')]

pieces

['a', 'b', 'guido']

These substrings could be concatenated together with a two-colon delimiter using addition:

In [144]:
first, second, third = pieces

### Regular expressions

Regular expressions provide a flexible way to search or match (often more complex) string patterns in text. A single expression, commonly called a regex, is a string formed according to the regular expression language. Python’s built-in re module is responsible for applying regular expressions to strings; I’ll give a number of examples of its use here.

The re module functions fall into three categories: pattern matching, substitution, and splitting. Naturally these are all related; a regex describes a pattern to locate in the text, which can then be used for many purposes.

Basic Patterns: Ordinary Characters
You can easily tackle many basic patterns in Python using the ordinary characters. Ordinary characters are the simplest regular expressions. They match themselves exactly and do not have a special meaning in their regular expression syntax.

Examples are 'A', 'a', 'X', '5'.

Ordinary characters can be used to perform simple exact matches:

In [191]:
import re
pattern = r"Cookie"
sequence = "Cookie"
if re.match(pattern, sequence):
  print("Match!")
else: print("Not a match!")

Match!


Special characters are characters which do not match themselves as seen but actually have a special meaning when used in a regular expression.

The most widely used special characters are:

. - A period. Matches any single character except newline character.

In [192]:
re.search(r'Co.k.e', 'Cookie').group()

'Cookie'

\w - Lowercase w. Matches any single letter, digit or underscore.<br>
\W - Uppercase w. Matches any character not part of \w (lowercase w).

In [194]:
re.search(r'C\Wke', 'C@ke').group()

'C@ke'

\s - Lowercase s. Matches a single whitespace character like: space, newline, tab, return. <br>
\S - Uppercase s. Matches any character not part of \s (lowercase s).

[Tutorial in python regular expressions](https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial)

In [145]:
import re

In [146]:
text = "foo    bar\t baz  \tqux"

re.split('\s+', text)

['foo', 'bar', 'baz', 'qux']

When you call re.split('\s+', text), the regular expression is first compiled, and then its split method is called on the passed text. You can compile the regex yourself with re.compile, forming a reusable regex object:


In [147]:
regex = re.compile('\s+')

In [148]:
regex.split(text)

['foo', 'bar', 'baz', 'qux']

If, instead, you wanted to get a list of all patterns matching the regex, you can use the findall method:

In [195]:
regex.findall(text)

['    ', '\t ', '  \t']

![Fig](imgs/re_001.png)

### Vectorized string functions in pandas

In [151]:
 data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com','Rob': 'rob@gmail.com', 'Wes': np.nan}

In [155]:
data = pd.DataFrame(data, index=[0])

In [156]:
data

Unnamed: 0,Dave,Steve,Rob,Wes
0,dave@google.com,steve@gmail.com,rob@gmail.com,


In [157]:
data.isnull()

Unnamed: 0,Dave,Steve,Rob,Wes
0,False,False,False,True


In [159]:
data["Dave"].str.contains('gmail')

0    False
Name: Dave, dtype: bool

In [188]:
data["Steve"].str.contains('gmail')

0    True
Name: Steve, dtype: bool

In [189]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'

In [190]:
data["Steve"].str.findall(pattern, flags=re.IGNORECASE)

0    [(steve, gmail, com)]
Name: Steve, dtype: object