## Data cleaning and preparation



-   We have looked at how some functions in pandas handle missing data
-   Also have cleaned up some malformed data
-   We will look at some of the tools to deal with
    -   missing data
    -   duplicate data
    -   string manipulation
    -   other data transformations



### Handling missing data



We can either use `np.nan` or `None` to represent missing data



In [1]:
import numpy as np
import pandas as pd
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado', None])
string_data.isnull()

0    False
1    False
2     True
3    False
4     True
dtype: bool

### NA handling methods



![img](images/na_methods.png)



### Filter out missing data



Find two different ways to filter out the missing data



In [3]:
from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])

In [6]:
#data.dropna()
#data[-data.isnull()]
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

Look at the documentation of `dropna` and filter out the missing data in this DataFrame



In [11]:
data = pd.DataFrame([[1., 6.5, 3., NA], [1., NA, NA, NA], [NA, NA, NA, NA], [NA, 6.5, 3., NA]])
?data.dropna

Try to remove only rows or columns that have missing values for all their elements



In [15]:
data.dropna(how='all', thresh=2)

Unnamed: 0,0,1,2,3
0,1.0,6.5,3.0,
3,,6.5,3.0,


### Filling in missing data



In [16]:
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df

Unnamed: 0,0,1,2
0,-1.083945,,
1,-1.345471,,
2,1.091692,,0.655203
3,-0.670729,,1.090486
4,0.41953,-0.5537,-0.515702
5,-1.147186,1.281315,0.462877
6,-0.195988,-0.67981,-0.614315


Fill in the missing values with zero



In [18]:
df

Unnamed: 0,0,1,2
0,-1.083945,,
1,-1.345471,,
2,1.091692,,0.655203
3,-0.670729,,1.090486
4,0.41953,-0.5537,-0.515702
5,-1.147186,1.281315,0.462877
6,-0.195988,-0.67981,-0.614315


Now fill the missing data with 0.5 in column 1 and 0 in column 2



In [20]:
df.fillna({1:0.5, 2:0})

Unnamed: 0,0,1,2
0,-1.083945,0.5,0.0
1,-1.345471,0.5,0.0
2,1.091692,0.5,0.655203
3,-0.670729,0.5,1.090486
4,0.41953,-0.5537,-0.515702
5,-1.147186,1.281315,0.462877
6,-0.195988,-0.67981,-0.614315


The same interpolation methods used for reindexing can be used here. Fill the up to 2 missing values in each column with the last non-missing value



In [26]:
print(df)
df.fillna(method='ffill', limit=2, axis=1)

          0         1         2
0 -1.083945       NaN       NaN
1 -1.345471       NaN       NaN
2  1.091692       NaN  0.655203
3 -0.670729       NaN  1.090486
4  0.419530 -0.553700 -0.515702
5 -1.147186  1.281315  0.462877
6 -0.195988 -0.679810 -0.614315


Unnamed: 0,0,1,2
0,-1.083945,-1.083945,-1.083945
1,-1.345471,-1.345471,-1.345471
2,1.091692,1.091692,0.655203
3,-0.670729,-0.670729,1.090486
4,0.41953,-0.5537,-0.515702
5,-1.147186,1.281315,0.462877
6,-0.195988,-0.67981,-0.614315


## Data transformation



We have already seen how we can transform data using a function or mapping!



### Removing duplicates



Duplicate rows can be found in data and can be removed



In [28]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'], 'k2': [1, 1, 2, 3, 3, 4, 4]})
#data

In [29]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [30]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


Now remove the duplicates based only on the `k1` column



In [32]:
print(data)
data.drop_duplicates('k1')

    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
5  two   4
6  two   4


Unnamed: 0,k1,k2
0,one,1
1,two,1


### Replacing values



In [34]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])
#data

We can replace the -999 with NA



In [35]:
data.replace(-999, NA)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [36]:
data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

### Renaming index labels



We've seen how we can transform the row and/or column labels, but here's another way



In [37]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                      index=['ohio', 'colorado', 'new york'],
                      columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
ohio,0,1,2,3
colorado,4,5,6,7
new york,8,9,10,11


In [38]:
data.rename(index=str.rindex, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [40]:
data.rename(index={'ohio': 'Indiana'}, columns={'three': 'foo'})
data

Unnamed: 0,one,two,three,four
ohio,0,1,2,3
colorado,4,5,6,7
new york,8,9,10,11


## Detecting and filtering outliers



Filtering or transforming outliers is largely a matter of applying array operations



In [41]:
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.024187,0.061139,0.029868,-0.064112
std,0.997065,1.019774,0.982827,1.00844
min,-3.249334,-3.893956,-2.825533,-3.043383
25%,-0.707042,-0.627301,-0.649955,-0.770615
50%,-0.019374,0.067603,0.021833,-0.07513
75%,0.664962,0.727022,0.701222,0.614738
max,2.61602,3.200973,3.794216,4.349922


Let's find values that are either larger than 3 or smaller than -3



In [46]:
data[np.abs(data) < 3].min()

0   -2.900370
1   -2.819559
2   -2.825533
3   -2.981231
dtype: float64

Now use the `np.sign` function to cap values outside the interval -3 to 3



## String manipulation



-   We have looked at string operations, but not *regular expressions*!
-   Regular expressions provide a flexible way to search or match (often more complex) string patterns in text
-   Suppose we wanted to split a string with a variable number of whitespace characters



In [1]:
import re
text = "foo     bar\t baz \tqux"
text

'foo     bar\t baz \tqux'

In [7]:
re.split(' [4]', text)

['foo     bar\t baz \tqux']

What we are doing here is **compiling** a pattern and then calling the `split` method



In [9]:
regex = re.compile('\s+')
regex.split(text)

['foo', 'bar', 'baz', 'qux']

In [10]:
regex.findall(text)

['     ', '\t ', ' \t']

Creating a regex object with `re.compile` is highly recommended if you intend to
apply the same expression to many strings.



### A less trivial example



In [26]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
regex = re.compile(pattern, flags=re.IGNORECASE)
regex

re.compile(r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}', re.IGNORECASE|re.UNICODE)

`findall` returns all matches



In [27]:
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

`search` returns a match object, and `match` returns the same object only if the match occurs at the start of the string

Try it out!



In [24]:
re.split('b|r', text)
re.compile('fo').match(text)

<re.Match object; span=(0, 2), match='fo'>

`sub` will return a new string with occurrences of the pattern replaced



In [28]:
print(regex.sub('REDACTED', text))

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED



-   Suppose you wanted to find email addresses and simultaneously segment each address into its three components: username, domain name, and domain suffix
-   To do this, put parentheses around the parts of the pattern to segment



In [29]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

`sub` also has access to the groups matched



In [30]:
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))

Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com

