Here, you'll dive into some of the grittier aspects of data cleaning. You'll learn about string manipulation and pattern matching to deal with unstructured data, and then explore techniques to deal with missing or duplicate data. You'll also learn the valuable skill of programmatically checking your data for consistency, which will give you confidence that your code is running correctly and that the results of your analysis are reliable!

### Prepare and clean data

In [2]:
ls

[0m[01;32m01Exploring_your_data.ipynb[0m*             [01;32mliteracy_birth_rate.csv[0m*
[01;32m02Tidying_data_for_analysis.ipynb[0m*       [01;32mmp_data.csv[0m*
[01;32m03Combining_data_for_analysis.ipynb[0m*     [01;32mnyc_uber_2014.csv[0m*
[01;32m04Cleaning_data_for_analysis.ipynb[0m*      [01;32mstate_cod.csv[0m*
[01;32m05Case_study.ipynb[0m*                      [01;32mstate_pop.csv[0m*
[01;32mairquality.csv[0m*                          [01;32mtb.csv[0m*
[01;32mconcat2.PNG[0m*                             [01;32mtiddy.csv[0m*
[01;32mconcat.PNG[0m*                              [01;32mtips.csv[0m*
[01;32mdob_job_application_filings_subset.csv[0m*  [01;32mw1.csv[0m*
[01;32mebola.csv[0m*                               [01;32mw2.csv[0m*
[01;32mgapminder.csv[0m*                           [01;32mweather_tidy.csv[0m*


In [5]:
import pandas as pd
df = pd.read_csv('treat.csv')
df

Unnamed: 0,name,sex,treatment a,treatment b
0,Daniel,male,-,42
1,Jhon,male,12,31
2,Jane,female,24,27


### Datatypes

- There may be times we want to convert from one type to
another
- Numeric columns can be strings, or vice versa

In [7]:
df.dtypes

name           object
sex            object
treatment a    object
treatment b     int64
dtype: object

### Converting data types

In [8]:
df['treatment b'] = df['treatment b'].astype(str)

df['sex'] = df['sex'].astype('category')

df.dtypes

name             object
sex            category
treatment a      object
treatment b      object
dtype: object

### Categorical data
- Converting categorical data to ‘category’ dtype:
- Can make the DataFrame smaller in memory
- Can make them be utilized by other Python libraries for analysis

### Cleaning data
- Numeric data loaded as a string

In [10]:
df

Unnamed: 0,name,sex,treatment a,treatment b
0,Daniel,male,-,42
1,Jhon,male,12,31
2,Jane,female,24,27


In [11]:
df['treatment a'] = pd.to_numeric(df['treatment a'],
                                  errors='coerce')

df.dtypes

name             object
sex            category
treatment a     float64
treatment b      object
dtype: object

---
# Let’s practice!

# Using regular expressions to clean strings

### String manipulation
- Much of data cleaning involves string manipulation
- Most of the world’s data is unstructured text
- Also have to do string manipulation to make datasets consistent with one another

### Validate values
- 17
- $17
- $17.89
- $17.895

### String manipulation
- Many built-in and external libraries
- ‘re’ library for regular expressions
    - A formal way of specifying a pa!ern
    - Sequence of characters
- Pattern matching
    - Similar to globbing

### Example match

![re1](re1.PNG)

### Using regular expressions
- Compile the pa!ern
- Use the compiled pa!ern to match values
- This lets us use the pa!ern over and over again
- Useful since we want to match values down a column of values

In [13]:
import re
pattern = re.compile('\$\d*\.\d{2}')
result = pattern.match('$17.89')
bool(result)

True

---
# Let’s practice!

# Using functions to clean data

### Complex cleaning
- Cleaning step requires multiple steps
    - Extract number from string
    - Perform transformation on extracted number
- Python function

### Apply

In [16]:
import numpy as np
df

Unnamed: 0,name,sex,treatment a,treatment b
0,Daniel,male,,42
1,Jhon,male,12.0,31
2,Jane,female,24.0,27


In [26]:
#df.apply(np.mean, axis=0)
df = pd.read_csv('tiddy_done.csv',index_col='name')
df

Unnamed: 0_level_0,treatment a,treatment b
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Daniel,18,42
Jhon,12,31
Jane,24,27


In [27]:
df.apply(np.mean, axis=0)

treatment a    18.000000
treatment b    33.333333
dtype: float64

In [28]:
df.apply(np.mean, axis=1)

name
Daniel    30.0
Jhon      21.5
Jane      25.5
dtype: float64

### Applying functions

In [45]:
df = pd.read_csv('dob_job_application_filings_subset.csv')
df = df[['Job #', 'Doc #', 'Borough', 'Initial Cost', 'Total Est. Fee']]
df_subset = df.head()
df_subset

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,Job #,Doc #,Borough,Initial Cost,Total Est. Fee
0,121577873,2,MANHATTAN,$75000.00,$986.00
1,520129502,1,STATEN ISLAND,$0.00,$1144.00
2,121601560,1,MANHATTAN,$30000.00,$522.50
3,121601203,1,MANHATTAN,$1500.00,$225.00
4,121601338,1,MANHATTAN,$19500.00,$389.50


In [46]:
df_subset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
Job #             5 non-null int64
Doc #             5 non-null int64
Borough           5 non-null object
Initial Cost      5 non-null object
Total Est. Fee    5 non-null object
dtypes: int64(2), object(3)
memory usage: 280.0+ bytes


### Write the regular expression

In [47]:
import re
from numpy import NaN

pattern = re.compile('^\$\d*\.\d{2}$')

### Writing a function

In [48]:
def my_function(input1, input2):
    # function body
    return value

### Write the function

In [49]:
def diff_money(row, pattern):
    icost = row['Initial Cost']
    tef = row['Total Est. Fee']
    if bool(pattern.match(icost)) and bool(pattern.match(tef)):
        icost = icost.replace("$", "")
        tef = tef.replace("$", "")
        icost = float(icost)
        tef = float(tef)
        return icost - tef
    else:
        return(NaN)

In [50]:
df_subset['diff'] = df_subset.apply(diff_money,
                                    axis=1,
                                    pattern=pattern)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [51]:
df_subset

Unnamed: 0,Job #,Doc #,Borough,Initial Cost,Total Est. Fee,diff
0,121577873,2,MANHATTAN,$75000.00,$986.00,74014.0
1,520129502,1,STATEN ISLAND,$0.00,$1144.00,-1144.0
2,121601560,1,MANHATTAN,$30000.00,$522.50,29477.5
3,121601203,1,MANHATTAN,$1500.00,$225.00,1275.0
4,121601338,1,MANHATTAN,$19500.00,$389.50,19110.5


---
# Let’s practice!

# Duplicate and missing data

### Duplicate data
- Can skew results
- ‘.drop_duplicates()’ method

In [52]:
df = pd.read_csv('treat_duplicate.csv')
df

Unnamed: 0,name,sex,treatment a,treatment b
0,Daniel,male,-,42
1,Jhon,male,12,31
2,Jane,female,24,27
3,Daniel,male,-,42


# Drop duplicates

In [53]:
df = df.drop_duplicates()
df

Unnamed: 0,name,sex,treatment a,treatment b
0,Daniel,male,-,42
1,Jhon,male,12,31
2,Jane,female,24,27


### Missing data
- Leave as-is
- Drop them
- Fill missing value

In [64]:
tips_nan = pd.read_csv('tips_nan.csv')
#tips_nan = tips_nan.head()
tips_nan.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,,Male,No,Sun,Dinner,2
4,24.59,3.61,,,Sun,,4


### Count missing values

In [65]:
tips_nan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    243 non-null float64
tip           243 non-null float64
sex           243 non-null object
smoker        243 non-null object
day           244 non-null object
time          243 non-null object
size          244 non-null int64
dtypes: float64(2), int64(1), object(4)
memory usage: 13.4+ KB


### Drop missing values

In [66]:
tips_dropped = tips_nan.dropna()
tips_dropped.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 241 entries, 0 to 243
Data columns (total 7 columns):
total_bill    241 non-null float64
tip           241 non-null float64
sex           241 non-null object
smoker        241 non-null object
day           241 non-null object
time          241 non-null object
size          241 non-null int64
dtypes: float64(2), int64(1), object(4)
memory usage: 15.1+ KB


### Fill missing values with .fillna()
- Fill with provided value
- Use a summary statistic

In [67]:
tips_nan['sex'] = tips_nan['sex'].fillna('missing')

tips_nan[['total_bill', 'size']] = tips_nan[['total_bill', 'size']].fillna(0)

tips_nan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           243 non-null float64
sex           244 non-null object
smoker        243 non-null object
day           244 non-null object
time          243 non-null object
size          244 non-null int64
dtypes: float64(2), int64(1), object(4)
memory usage: 13.4+ KB


### Fill missing values with a test statistic
- Careful when using test statistics to fill
- Have to make sure the value you are filling in makes sense
- Median is a be!er statistic in the presence of outliers

In [68]:
mean_value = tips_nan['tip'].mean()
mean_value

2.9969958847736624

In [69]:
tips_nan['tip'] = tips_nan['tip'].fillna(mean_value)

In [70]:
tips_nan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null object
smoker        243 non-null object
day           244 non-null object
time          243 non-null object
size          244 non-null int64
dtypes: float64(2), int64(1), object(4)
memory usage: 13.4+ KB


---
# Let’s practice!

# Testing with asserts

### Assert statements
- Programmatically vs visually checking
- If we drop or fill NaNs, we expect 0 missing values
- We can write an assert statement to verify this
- We can detect early warnings and errors
- This gives us confidence that our code is running correctly

### Asserts

In [71]:
assert 1 == 1

In [72]:
assert 1 == 2

AssertionError: 

---
# Let’s practice!