Here, you'll dive into some of the grittier aspects of data cleaning. You'll learn about string manipulation and pattern matching to deal with unstructured data, and then explore techniques to deal with missing or duplicate data. You'll also learn the valuable skill of programmatically checking your data for consistency, which will give you confidence that your code is running correctly and that the results of your analysis are reliable!

### Prepare and clean data

In [1]:
ls

[0m[01;32m01Exploring_your_data.ipynb[0m*             [01;32mre1.PNG[0m*
[01;32m02Tidying_data_for_analysis.ipynb[0m*       [01;32mstate_cod.csv[0m*
[01;32m03Combining_data_for_analysis.ipynb[0m*     [01;32mstate_pop.csv[0m*
[01;32m04Cleaning_data_for_analysis.ipynb[0m*      [01;32mtb.csv[0m*
[01;32m05Case_study.ipynb[0m*                      [01;32mtiddy.csv[0m*
[01;32mairquality.csv[0m*                          [01;32mtiddy_done.csv[0m*
[01;32mconcat2.PNG[0m*                             [01;32mtips.csv[0m*
[01;32mconcat.PNG[0m*                              [01;32mtips_nan.csv[0m*
[01;32mdob_job_application_filings_subset.csv[0m*  [01;32mtreat.csv[0m*
[01;32mebola.csv[0m*                               [01;32mtreat_duplicate.csv[0m*
[01;32mgapminder.csv[0m*                           [01;32mw1.csv[0m*
[01;32mliteracy_birth_rate.csv[0m*                 [01;32mw2.csv[0m*
[01;32mmp_data.csv[0m*                             [01;

In [2]:
import pandas as pd
df = pd.read_csv('treat.csv')
df

Unnamed: 0,name,sex,treatment a,treatment b
0,Daniel,male,-,42
1,Jhon,male,12,31
2,Jane,female,24,27


### Datatypes

- There may be times we want to convert from one type to
another
- Numeric columns can be strings, or vice versa

In [3]:
df.dtypes

name           object
sex            object
treatment a    object
treatment b     int64
dtype: object

### Converting data types

In [4]:
df['treatment b'] = df['treatment b'].astype(str)

df['sex'] = df['sex'].astype('category')

df.dtypes

name             object
sex            category
treatment a      object
treatment b      object
dtype: object

### Categorical data
- Converting categorical data to ‘category’ dtype:
    - Can make the DataFrame smaller in memory
    - Can make them be utilized by other Python libraries for analysis

### Cleaning data
- Numeric data loaded as a string

In [5]:
df

Unnamed: 0,name,sex,treatment a,treatment b
0,Daniel,male,-,42
1,Jhon,male,12,31
2,Jane,female,24,27


In [6]:
df['treatment a'] = pd.to_numeric(df['treatment a'],
                                  errors='coerce')

df.dtypes

name             object
sex            category
treatment a     float64
treatment b      object
dtype: object

---
# Let’s practice!

In [33]:
tips = pd.read_csv('tips.csv')
tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null object
smoker        244 non-null object
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: float64(2), int64(1), object(4)
memory usage: 13.4+ KB


In [34]:
# Convert the sex column to type 'category'
tips.sex = tips.sex.astype('category')

# Convert the smoker column to type 'category'
tips.smoker = tips.smoker.astype('category')

# Print the info of tips
print(tips.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null category
smoker        244 non-null category
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: category(2), float64(2), int64(1), object(2)
memory usage: 10.3+ KB
None


In [35]:
# Convert 'total_bill' to a numeric dtype
tips['total_bill'] = pd.to_numeric(tips['total_bill'], errors='coerce')

# Convert 'tip' to a numeric dtype
tips['tip'] = pd.to_numeric(tips['tip'], errors='coerce')

# Print the info of tips
print(tips.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null category
smoker        244 non-null category
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: category(2), float64(2), int64(1), object(2)
memory usage: 10.3+ KB
None


# Using regular expressions to clean strings

### String manipulation
- Much of data cleaning involves string manipulation
- Most of the world’s data is unstructured text
- Also have to do string manipulation to make datasets consistent with one another

### Validate values
- 17
- $17

- $17.89

- $17.895

### String manipulation
- Many built-in and external libraries
- ‘re’ library for regular expressions
    - A formal way of specifying a pa!ern
    - Sequence of characters
- Pattern matching
    - Similar to globbing

- `\d*` : any digit, matchet 0 or more times
- `\$\d*` : integer that has a $ in it, escape then $, follow by digit 0 to more times
- `\$\d*\.\d*` : monitary value with decimal points, use previous then extended to find escape . then digit..
- `\$\d*\.\d{2}` : same above but only 2 decimal digits
- `^\$\d*\.\d{2}$` :better than above




### Example match

![re1](re1.PNG)

### Using regular expressions
- Compile the pa!ern
- Use the compiled pa!ern to match values
- This lets us use the pa!ern over and over again
- Useful since we want to match values down a column of values

In [40]:
import re
pattern = re.compile('\$\d*\.\d{2}')
result = pattern.match('$17.89')
bool(result)

True

---
# Let’s practice!

In [41]:
# Import the regular expression module
import re

# Compile the pattern: prog
prog = re.compile('\d{3}-\d{3}-\d{4}')

# See if the pattern matches
result = prog.match('123-456-7890')
print(bool(result))

# See if the pattern matches
result = prog.match('1123-456-7890')
print(bool(result))


True
False


In [42]:
# Import the regular expression module
import re

# Find the numeric values: matches
matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana')

# Print the matches
print(matches)

['10', '1']


In [43]:
# Write the first pattern
pattern1 = bool(re.match(pattern='\d{3}-\d{3}-\d{4}', string='123-456-7890'))
print(pattern1)

# Write the second pattern
pattern2 = bool(re.match(pattern='\$\d*\.\d{2}', string='$123.45'))
print(pattern2)

# Write the third pattern
pattern3 = bool(re.match(pattern='[A-Z]\w*', string='Australia'))
print(pattern3)

True
True
True


# Using functions to clean data

### Complex cleaning
- Cleaning step requires multiple steps
    - Extract number from string
    - Perform transformation on extracted number
- Python function

### Apply

In [8]:
import numpy as np
df

Unnamed: 0,name,sex,treatment a,treatment b
0,Daniel,male,,42
1,Jhon,male,12.0,31
2,Jane,female,24.0,27


In [9]:
#df.apply(np.mean, axis=0)
df = pd.read_csv('tiddy_done.csv',index_col='name')
df

Unnamed: 0_level_0,treatment a,treatment b
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Daniel,18,42
Jhon,12,31
Jane,24,27


In [10]:
df.apply(np.mean, axis=0)

treatment a    18.000000
treatment b    33.333333
dtype: float64

In [11]:
df.apply(np.mean, axis=1)

name
Daniel    30.0
Jhon      21.5
Jane      25.5
dtype: float64

### Applying functions

In [12]:
df = pd.read_csv('dob_job_application_filings_subset.csv')
df = df[['Job #', 'Doc #', 'Borough', 'Initial Cost', 'Total Est. Fee']]
df_subset = df.head()
df_subset

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,Job #,Doc #,Borough,Initial Cost,Total Est. Fee
0,121577873,2,MANHATTAN,$75000.00,$986.00
1,520129502,1,STATEN ISLAND,$0.00,$1144.00
2,121601560,1,MANHATTAN,$30000.00,$522.50
3,121601203,1,MANHATTAN,$1500.00,$225.00
4,121601338,1,MANHATTAN,$19500.00,$389.50


In [13]:
df_subset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
Job #             5 non-null int64
Doc #             5 non-null int64
Borough           5 non-null object
Initial Cost      5 non-null object
Total Est. Fee    5 non-null object
dtypes: int64(2), object(3)
memory usage: 280.0+ bytes


### Write the regular expression

In [14]:
import re
from numpy import NaN

pattern = re.compile('^\$\d*\.\d{2}$')

### Writing a function

In [15]:
def my_function(input1, input2):
    # function body
    return value

### Write the function

- input: row of df, and pattern we wil use to validate monetary values
- create variable of sliced rows

In [16]:
def diff_money(row, pattern):
    icost = row['Initial Cost']
    tef = row['Total Est. Fee']
    if bool(pattern.match(icost)) and bool(pattern.match(tef)):
        icost = icost.replace("$", "")
        tef = tef.replace("$", "")
        icost = float(icost)
        tef = float(tef)
        return icost - tef
    else:
        return(NaN)

In [17]:
df_subset['diff'] = df_subset.apply(diff_money,
                                    axis=1,
                                    pattern=pattern)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [18]:
df_subset

Unnamed: 0,Job #,Doc #,Borough,Initial Cost,Total Est. Fee,diff
0,121577873,2,MANHATTAN,$75000.00,$986.00,74014.0
1,520129502,1,STATEN ISLAND,$0.00,$1144.00,-1144.0
2,121601560,1,MANHATTAN,$30000.00,$522.50,29477.5
3,121601203,1,MANHATTAN,$1500.00,$225.00,1275.0
4,121601338,1,MANHATTAN,$19500.00,$389.50,19110.5


---
# Let’s practice!

In [44]:
# Define recode_sex()
def recode_sex(sex_value):

    # Return 1 if sex_value is 'Male'
    if sex_value == 'Male':
        return 1
    
    # Return 0 if sex_value is 'Female'    
    elif sex_value == 'Female':
        return 0
    
    # Return np.nan    
    else:
        return np.nan

# Apply the function to the sex column
tips['sex_recode'] = tips.sex.apply(recode_sex)

# Print the first five rows of tips
print(tips.head())


   total_bill   tip     sex smoker  day    time  size sex_recode
0       16.99  1.01  Female     No  Sun  Dinner     2          0
1       10.34  1.66    Male     No  Sun  Dinner     3          1
2       21.01  3.50    Male     No  Sun  Dinner     3          1
3       23.68  3.31    Male     No  Sun  Dinner     2          1
4       24.59  3.61  Female     No  Sun  Dinner     4          0


```python
# Write the lambda function using replace
tips['total_dollar_replace'] = tips.total_dollar.apply(lambda x: x.replace('$', ''))

# Write the lambda function using regular expressions
tips['total_dollar_re'] = tips.total_dollar.apply(lambda x: re.findall('\d+\.\d+', x)[0])

# Print the head of tips
print(tips.head())```

# Duplicate and missing data

### Duplicate data
- Can skew results
- ‘.drop_duplicates()’ method

In [19]:
df = pd.read_csv('treat_duplicate.csv')
df

Unnamed: 0,name,sex,treatment a,treatment b
0,Daniel,male,-,42
1,Jhon,male,12,31
2,Jane,female,24,27
3,Daniel,male,-,42


# Drop duplicates

In [20]:
df = df.drop_duplicates()
df

Unnamed: 0,name,sex,treatment a,treatment b
0,Daniel,male,-,42
1,Jhon,male,12,31
2,Jane,female,24,27


### Missing data
- Leave as-is
- Drop them
- Fill missing value

In [21]:
tips_nan = pd.read_csv('tips_nan.csv')
#tips_nan = tips_nan.head()
tips_nan.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,,Male,No,Sun,Dinner,2
4,24.59,3.61,,,Sun,,4


### Count missing values

In [22]:
tips_nan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    243 non-null float64
tip           243 non-null float64
sex           243 non-null object
smoker        243 non-null object
day           244 non-null object
time          243 non-null object
size          244 non-null int64
dtypes: float64(2), int64(1), object(4)
memory usage: 13.4+ KB


### Drop missing values

In [23]:
tips_dropped = tips_nan.dropna()
tips_dropped.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 241 entries, 0 to 243
Data columns (total 7 columns):
total_bill    241 non-null float64
tip           241 non-null float64
sex           241 non-null object
smoker        241 non-null object
day           241 non-null object
time          241 non-null object
size          241 non-null int64
dtypes: float64(2), int64(1), object(4)
memory usage: 15.1+ KB


### Fill missing values with .fillna()
- Fill with provided value
- Use a summary statistic

In [24]:
tips_nan['sex'] = tips_nan['sex'].fillna('missing')

tips_nan[['total_bill', 'size']] = tips_nan[['total_bill', 'size']].fillna(0)

tips_nan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           243 non-null float64
sex           244 non-null object
smoker        243 non-null object
day           244 non-null object
time          243 non-null object
size          244 non-null int64
dtypes: float64(2), int64(1), object(4)
memory usage: 13.4+ KB


### Fill missing values with a test statistic
- Careful when using test statistics to fill
- Have to make sure the value you are filling in makes sense
- Median is a be!er statistic in the presence of outliers

In [25]:
mean_value = tips_nan['tip'].mean()
mean_value

2.9969958847736624

In [26]:
tips_nan['tip'] = tips_nan['tip'].fillna(mean_value)

In [27]:
tips_nan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null object
smoker        243 non-null object
day           244 non-null object
time          243 non-null object
size          244 non-null int64
dtypes: float64(2), int64(1), object(4)
memory usage: 13.4+ KB


---
# Let’s practice!

```python
# Create the new DataFrame: tracks
tracks = billboard[['year', 'artist', 'track', 'time']]

# Print info of tracks
print(tracks.info())

# Drop the duplicates: tracks_no_duplicates
tracks_no_duplicates = tracks.drop_duplicates()

# Print info of tracks
print(tracks_no_duplicates.info())
```

```python
# Calculate the mean of the Ozone column: oz_mean
oz_mean = airquality.Ozone.mean()

# Replace all the missing values in the Ozone column with the mean
airquality['Ozone'] = airquality['Ozone'].fillna(oz_mean)

# Print the info of airquality
print(airquality.info())```

# Testing with asserts

### Assert statements
- Programmatically vs visually checking
- If we drop or fill NaNs, we expect 0 missing values
- We can write an assert statement to verify this
- We can detect early warnings and errors
- This gives us confidence that our code is running correctly

### Asserts

In [28]:
assert 1 == 1 # True: give you nothing

In [29]:
assert 1 == 2 # False: give you an error

AssertionError: 

- Check for missing values, it will return True if there is a value, and False if ther is a missing value., chanining .all() method to test if all the values are not_null

```python
assert google.Close.notnull().all()

---------------------------------------------------------------------------
AssertionError 
```


- fill all the missing value sin teh dataframea with the value 0
```python
google_0 = google.fillna(value=0)
```


```python
assert google.Close.notnull().all()
```

---
# Let’s practice!

```python
# Assert that there are no missing values
assert pd.notnull(ebola).all().all()

# Assert that all values are >= 0
assert (ebola >= 0).all().all()
```