# Cleaning data for analysis

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-types" data-toc-modified-id="Data-types-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data types</a></span></li><li><span><a href="#Using-regular-expressions-to-clean-strings" data-toc-modified-id="Using-regular-expressions-to-clean-strings-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Using regular expressions to clean strings</a></span></li><li><span><a href="#Using-functions-to-clean-data" data-toc-modified-id="Using-functions-to-clean-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Using functions to clean data</a></span></li><li><span><a href="#Duplicate-and-missing-data" data-toc-modified-id="Duplicate-and-missing-data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Duplicate and missing data</a></span></li><li><span><a href="#Testing-with-asserts" data-toc-modified-id="Testing-with-asserts-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Testing with asserts</a></span></li></ul></div>

## Data types

- Data types
         In [1]: print(df.dtypes)
            name            object
            sex               object
            treatment a    object
            treatment b    int64
            dtype: object
    - There may be times we want to convert from one type to another
        - Numeric columns can be strings, or vice versa
- Converting data types
        In [2]: df['treatment b'] = df['treatment b'].astype(str)
        In [3]: df['sex'] = df['sex'].astype('category')
        In [4]: df.dtypes
        Out[4]:
            name            object
            sex               category
            treatment a    object
            treatment b    object
            dtype: object
    - Categorical data
        - Converting categorical data to ‘category’ dtype
            - Can make the DataFrame smaller in memory
            - Can make them be utilized by other Python libraries for analysis
- Cleaning bad data
        In [5]: df['treatment a'] = pd.to_numeric(df['treatment a'],
        ...:                                   errors='coerce')
        In [6]: df.dtypes
        Out[6]:
            name            object
            sex               category
            treatment a    float64
            treatment b    object
            dtype: object
    - errors='coerce'
        - force unconvertable data convert to NaN

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import jupyterthemes.jtplot as jtplot
%matplotlib inline
jtplot.style(theme='onedork')

In [2]:
tips = pd.read_csv('exercise/tips.csv')

# Convert the sex column to type 'category'
#tips.sex = tips.sex.astype('category')
tips['sex'] = tips['sex'].astype('category')
# Convert the smoker column to type 'category'
tips.smoker = tips.smoker.astype('category')

# Convert 'total_bill' to a numeric dtype
tips['total_bill'] = pd.to_numeric(tips['total_bill'], errors='coerce')

# Convert 'tip' to a numeric dtype
tips['tip'] = pd.to_numeric(tips['tip'], errors='coerce')

# Print the info of tips
print(tips.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null category
smoker        244 non-null category
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: category(2), float64(2), int64(1), object(2)
memory usage: 10.3+ KB
None


## Using regular expressions to clean strings

- String manipulation
    - Much of data cleaning involves string manipulation
        - Most of the world’s data is unstructured text
    - Also have to do string manipulation to make datasets consistent with one another
    - Many built-in and external libraries
    - ‘re’ library for regular expressions
        - A formal way of specifying a pattern
        - Sequence of characters
    - Pattern matching
        - Similar to globbing
- Example match
    - 17               
        - \d*
    - \\$17           
        - \\\\\$\d*
    - \\$17.00      
        - \\\\\$\d* \.\d*
    - \\$17.89       
        - \\\\\$\d*\.\d{2}
    - \\$17.895     
        - ^\\\\\$\d*\.\d{2}\\$
        - match 規則
            - 從第一個字開始對
            - 對完指定格式之後，後面的都不會管
            - 特殊規則
                    ^-----這裡夾格式---$
                - '^'開頭，'\\$'結尾，只會抓絕對符合格式的
                
- Using regular expressions
    - Compile the pattern
    - Use the compiled pattern to match values
    - This lets us use the pattern over and over again
    - Useful since we want to match values down a column of values
            In [1]: import re
            In [2]: pattern = re.compile('\$\d*\.\d{2}')
            In [3]: result = pattern.match('$17.89')
            In [4]: bool(result)
            True
- Write patterns to match:
    - \d
        - digit
    - [A-Z]
        - A capital letter witch in the range
        - lowercase is capable
        - so as number
    - \w
        - a-z, A-Z, 0-9
    - follow with
        - {x}
            - x times
        - *
            - arbitrary number of times(include 0)
        - +
            - one or more times

In [3]:
# Import the regular expression module
import re

# Compile the pattern: prog
prog = re.compile('\d{3}-\d{3}-\d{4}')

# See if the pattern matches
result = prog.match('123-456-7890')
print(bool(result))

# See if the pattern matches
result2 = prog.match('1123-456-7890')
print(bool(result2))


True
False


In [4]:
# Find the numeric values: matches
matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana')

# Print the matches
print(matches)


['10', '1']


In [5]:
# Write the first pattern
pattern1 = bool(re.match(pattern='\d{3}-\d{3}-\d{4}', string='123-456-7890'))
print(pattern1)

# Write the second pattern
pattern2 = bool(re.match(pattern='\$\d*\.\d{2}', string='$0.45'))
print(pattern2)

# Write the third pattern
pattern3 = bool(re.match(pattern='[A-Z]\w+', string='Australia'))
print(pattern3)

pattern4 = bool(re.match(pattern='\S', string='$'))
print(pattern4)

pattern5 = bool(re.match(pattern='^\w*$', string='qedaM$'))
print(pattern5)

True
True
True
True
False


## Using functions to clean data

- Complex cleaning
    - Cleaning step requires multiple steps 
        - Extract number from string
        - Perform transformation on extracted number
- Apply
        In [1]: print(df)
                    treatment a  treatment b
        Daniel           18           42
        John             12           31
        Jane             24           27
        In [2]: df.apply(np.mean, axis=0)
        Out[2]:
        treatment a    18.000000
        treatment b    33.333333
        dtype: float64
        In [4]: df.apply(np.mean, axis=1)
        Out[4]:
        Daniel    30.0
        John      21.5
        Jane      25.5
        dtype: float64
    - axis=0, row-wise
    - axis=1, col-wise

In [6]:
tips = pd.read_csv('exercise/tips.csv')

# Define recode_gender()
def recode_gender(gender):

    # Return 0 if gender is 'Female'
    if gender == 'Female':
        return 0
    
    # Return 1 if gender is 'Male'    
    elif gender == 'Male':
        return 1
    
    # Return np.nan    
    else:
        return np.nan

# Apply the function to the sex column
tips['recode'] = tips['sex'].apply(recode_gender)

# Print the first five rows of tips
print(tips.tail())

     total_bill   tip     sex smoker   day    time  size  recode
239       29.03  5.92    Male     No   Sat  Dinner     3       1
240       27.18  2.00  Female    Yes   Sat  Dinner     2       0
241       22.67  2.00    Male    Yes   Sat  Dinner     2       1
242       17.82  1.75    Male     No   Sat  Dinner     2       1
243       18.78  3.00  Female     No  Thur  Dinner     2       0


In [7]:
tips['$total_dollar'] = tips.total_bill.apply(lambda x: '$' + str(x))
# Write the lambda function using replace
tips['total_dollar'] = tips['$total_dollar'].apply(lambda x: float(x.replace('$', '')))

# Write the lambda function using regular expressions
# re.findall function  returns a list, [0] to access the value
#tips['total_dollar2'] = tips.total_dollar.apply(lambda x: re.findall('\d+\.\d+', x)[0])

# Print the head of tips
print(tips.head())

   total_bill   tip     sex smoker  day    time  size  recode $total_dollar  \
0       16.99  1.01  Female     No  Sun  Dinner     2       0        $16.99   
1       10.34  1.66    Male     No  Sun  Dinner     3       1        $10.34   
2       21.01  3.50    Male     No  Sun  Dinner     3       1        $21.01   
3       23.68  3.31    Male     No  Sun  Dinner     2       1        $23.68   
4       24.59  3.61  Female     No  Sun  Dinner     4       0        $24.59   

   total_dollar  
0         16.99  
1         10.34  
2         21.01  
3         23.68  
4         24.59  


## Duplicate and missing data

- Duplicate data
    - Can skew results 
    - ‘.drop_duplicates()’ method
             In [1]: df = df.drop_duplicates()
- Missing data
    - Leave as-is(維持原樣)
    - Drop them
            In [4]: tips_dropped = tips_nan.dropna()
    - Fill missing value
            tips_nan['sex'] = tips_nan['sex'].fillna('missing')
            tips_nan[['total_bill', 'size']] = tips_nan[['total_bill','size']].fillna(0)
    - Fill missing values with a test statistic
        - Careful when using test statistics to fill
        - Have to make sure the value you are filling in makes sense
        - Median is a be!er statistic in the presence of outliers

In [8]:
airquality = pd.read_csv('exercise/airquality.csv')
print(airquality.info())
# Calculate the mean of the Ozone column: oz_mean
oz_mean = airquality['Ozone'].mean()

# Replace all the missing values in the Ozone column with the mean
airquality['Ozone'] = airquality['Ozone'].fillna(oz_mean)
airquality = airquality.drop_duplicates()

# Print the info of airquality
print(airquality.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 6 columns):
Ozone      116 non-null float64
Solar.R    146 non-null float64
Wind       153 non-null float64
Temp       153 non-null int64
Month      153 non-null int64
Day        153 non-null int64
dtypes: float64(3), int64(3)
memory usage: 7.3 KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 153 entries, 0 to 152
Data columns (total 6 columns):
Ozone      153 non-null float64
Solar.R    146 non-null float64
Wind       153 non-null float64
Temp       153 non-null int64
Month      153 non-null int64
Day        153 non-null int64
dtypes: float64(3), int64(3)
memory usage: 8.4 KB
None


## Testing with asserts

- Assert statements
    - Programmatically vs visually checking
    - If we drop or fill NaNs, we expect 0 missing values
    - We can write an assert statement to verify this 
    - We can detect early warnings and errors
    - This gives us confidence that our code is running correctly
    - ex:
            assert google.Close.notnull().all()
                        df.column.notnull().all()
    - assert
        - 接在後面的敘述是否為True
            - 是的話不會發生什麼
            - 若否，則 raise error
            - 只接受 bool object
    - pd.notnull()
        -  For scalar input, returns a scalar boolean.
        - For array input, returns an array of boolean indicating whether each corresponding element is valid.
    - all(iterable, /)
        - Return True if bool(x) is True for all values x in the iterable.
        - If the iterable is empty, return True.
        - 可以看成降階的效果
                ebola.notnull().all().all()
            - pd.notnull()給出 整個df每一個值 是否 not-NaN
            - 第一個 all()給出 這個df中的每一行 是否 所有值都為True，是個 Series
                - 迭代一個df, 會得到其有哪些column
            - 第二個 all()給出 上個all()給的Series 是否所以值都為 True
            - df -> Series -> bool


In [10]:
ebola = pd.read_csv('exercise/ebola.csv')
#print(ebola.info())
ebola = ebola.fillna(0)
#print(ebola.info())
# Assert that there are no missing values
print(type(ebola.notnull()))
print(type(ebola.notnull().all()))
print(type(ebola.notnull().all().all()))

assert ebola.notnull().all().all()

# Assert that all values are >= 0
#assert (ebola >= 0).all().all()


<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
<class 'numpy.bool_'>
