## Here, you'll dive into some of the grittier aspects of data cleaning. You'll learn about string manipulation and pattern matching to deal with unstructured data, and then explore techniques to deal with missing or duplicate data. You'll also learn the valuable skill of programmatically checking your data for consistency, which will give you confidence that your code is running correctly and that the results of your analysis are reliable!

# Converting data types
In this exercise, you'll see how ensuring all categorical variables in a DataFrame are of type category reduces memory usage.

The tips dataset has been loaded into a DataFrame called tips. This data contains information about how much a customer tipped, whether the customer was male or female, a smoker or not, etc.

Look at the output of tips.info() in the IPython Shell. You'll note that two columns that should be categorical - sex and smoker - are instead of type object, which is pandas' way of storing arbitrary strings. Your job is to convert these two columns to type category and note the reduced memory usage.



In [10]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [11]:
tips = pd.read_csv('tips.csv')

In [12]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [13]:
tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null object
smoker        244 non-null object
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: float64(2), int64(1), object(4)
memory usage: 9.6+ KB


In [14]:
# Convert the sex column to type 'category'
tips.sex = tips.sex.astype('category')


In [15]:
# Convert the smoker column to type 'category'
tips.smoker = tips.smoker.astype('category')


In [16]:
tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null category
smoker        244 non-null category
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: category(2), float64(2), int64(1), object(2)
memory usage: 8.2+ KB


## Working with numeric data
If you expect the data type of a column to be numeric (int or float), but instead it is of type object, this typically means that there is a non numeric value in the column, which also signifies bad data.

You can use the pd.to_numeric() function to convert a column into a numeric data type. If the function raises an error, you can be sure that there is a bad value within the column. You can either use the techniques you learned in Chapter 1 to do some exploratory data analysis and find the bad value, or you can choose to ignore or coerce the value into a missing value, NaN.  
Let's replace a numeric value by a string to the 'total_bill' column and make it an object type

In [17]:
import warnings; warnings.simplefilter('ignore')

In [18]:
tips['total_bill'][4] = 'missing'
tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null object
tip           244 non-null float64
sex           244 non-null category
smoker        244 non-null category
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: category(2), float64(1), int64(1), object(3)
memory usage: 7.3+ KB


In [19]:
# Convert 'total_bill' to a numeric dtype
tips['total_bill'] = pd.to_numeric(tips['total_bill'], errors='coerce')

In [20]:
tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    243 non-null float64
tip           244 non-null float64
sex           244 non-null category
smoker        244 non-null category
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: category(2), float64(2), int64(1), object(2)
memory usage: 8.2+ KB


## String parsing with regular expressions
When working with data, it is sometimes necessary to write a regular expression to look for properly entered values. Phone numbers in a dataset is a common field that needs to be checked for validity. Your job in this exercise is to define a regular expression to match US phone numbers that fit the pattern of xxx-xxx-xxxx.

The regular expression module in python is re. When performing pattern matching on data, since the pattern will be used for a match across multiple rows, it's better to compile the pattern first using re.compile(), and then use the compiled pattern to match values.

### Instructions
- Import re.
- Compile a pattern that matches a phone number of the format xxx-xxx-xxxx.
  - Use \d{x} to match x digits. Here you'll need to use it three times: twice to match 3 digits, and once to match 4 digits.
  - Place the regular expression inside re.compile().
- Using the .match() method on prog, check whether the pattern matches the string '123-456-7890'.
- Using the same approach, now check whether the pattern matches the string '1123-456-7890'

In [21]:
import re
pattern = re.compile('\d{3}-\d{3}-\d{4}')
result = pattern.match('123-456-7890')
bool(result)


True

In [22]:
result = pattern.match('1123-456-7890')
bool(result)

False

### Extracting numerical values from strings
Extracting numbers from strings is a common task, particularly when working with unstructured data or log files.

Say you have the following string: 'the recipe calls for 6 strawberries and 2 bananas'.

It would be useful to extract the 6 and the 2 from this string to be saved for later use when comparing strawberry to banana ratios.

When using a regular expression to extract multiple numbers (or multiple pattern matches, to be exact), you can use the re.findall()  

it is straightforward to use: You pass in a pattern and a string to re.findall(), and it will return a list of the matches.

### Instructions
- Import re.
- Write a pattern that will find all the numbers in the following string: 'the recipe calls for 10 strawberries and 1 banana'.   To do this:
- Use the re.findall() function and pass it two arguments: the pattern, followed by the string.
  \d is the pattern required to find digits. This should be followed with a + so that the previous element is matched one or     more times. This ensures that 10 is viewed as one number and not as 1 and 0.

In [23]:
import re
matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana')
matches

['10', '1']

### Instructions
__Write patterns to match:__
1. A telephone number of the format xxx-xxx-xxxx. You already did this in a previous exercise.
2. A string of the format: A dollar sign, an arbitrary number of digits, a decimal point, 2 digits.
   Use \$ to match the dollar sign, \d* to match an arbitrary number of digits, \. to match the decimal point, and \d{x} to        match x number of digits.
3. A capital letter, followed by an arbitrary number of alphanumeric characters.
   Use [A-Z] to match any capital letter followed by \w* to match an arbitrary number of alphanumeric characters.


In [24]:
from IPython import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [25]:
# Write the first pattern
pattern1 = bool(re.match(pattern='\d{3}-\d{3}-\d{4}', string='123-456-7890'))
pattern1

# Write the second pattern
pattern2 = bool(re.match(pattern='\$\d*\.\d{2}', string='$123.45'))
pattern2

# Write the third pattern
pattern3 = bool(re.match(pattern='[A-Z]\w*', string='Australia'))
pattern3


True

True

True

## Custom functions to clean data
You'll now practice writing functions to clean data.

The tips dataset has been pre-loaded into a DataFrame called tips. It has a 'sex' column that contains the values 'Male' or 'Female'. Your job is to write a function that will recode 'Female' to 0, 'Male' to 1, and return np.nan for all entries of 'sex' that are neither 'Female' nor 'Male'.

Recoding variables like this is a common data cleaning task. Functions provide a mechanism for you to abstract away complex bits of code as well as reuse code. This makes your code more readable and less error prone.

As Dan showed you in the videos, you can use the .apply() method to apply a function across entire rows or columns of DataFrames. However, note that each column of a DataFrame is a pandas Series. Functions can also be applied across Series. Here, you will apply your function over the 'sex' column.



In [28]:
# Define recode_gender()
def recode_gender(gender):

    # Return 0 if gender is 'Female'
    if gender == 'Female':
        return 0
    
    # Return 1 if gender is 'Male'    
    elif gender == 'Male':
        return 1
    
    # Return np.nan    
    else: 
        return np.nan

# Apply the function to the sex column
tips['recode'] = tips.sex.apply(recode_gender)

# Print the first five rows of tips
tips.head()


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,recode
0,16.99,1.01,Female,No,Sun,Dinner,2,0
1,10.34,1.66,Male,No,Sun,Dinner,3,1
2,21.01,3.5,Male,No,Sun,Dinner,3,1
3,23.68,3.31,Male,No,Sun,Dinner,2,1
4,,3.61,Female,No,Sun,Dinner,4,0


## Lambda functions
You'll now be introduced to a powerful Python feature that will help you clean your data more effectively: lambda functions. Instead of using the def syntax that you used in the previous exercise, lambda functions let you make simple, one-line functions.

For example, here's a function that squares a variable used in an .apply() method:
```python
def my_square(x):
    return x ** 2

df.apply(my_square)
```
The equivalent code using a lambda function is:
```python
df.apply(lambda x: x ** 2)
```

__Instructions__  
Use the .replace() method inside a lambda function to remove the dollar sign from the 'total_dollar' column of tips.
You need to specify two arguments to the .replace() method: The string to be replaced ('$'), and the string to replace it by ('').
Apply the lambda function over the 'total_dollar' column of tips.
Use a regular expression to remove the dollar sign from the 'total_dollar' column of tips.
The pattern has been provided for you: It is the first argument of the re.findall() function.
Complete the rest of the lambda function and apply it over the 'total_dollar' column of tips. Notice that because re.findall() returns a list, you have to slice it in order to access the actual value.


In [30]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,recode
0,16.99,1.01,Female,No,Sun,Dinner,2,0
1,10.34,1.66,Male,No,Sun,Dinner,3,1
2,21.01,3.5,Male,No,Sun,Dinner,3,1
3,23.68,3.31,Male,No,Sun,Dinner,2,1
4,,3.61,Female,No,Sun,Dinner,4,0


In [33]:
tips['total_dollar'] = '$' + tips['total_bill'].astype(str)

In [34]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,recode,total_dollar
0,16.99,1.01,Female,No,Sun,Dinner,2,0,$16.99
1,10.34,1.66,Male,No,Sun,Dinner,3,1,$10.34
2,21.01,3.5,Male,No,Sun,Dinner,3,1,$21.01
3,23.68,3.31,Male,No,Sun,Dinner,2,1,$23.68
4,,3.61,Female,No,Sun,Dinner,4,0,$nan


In [37]:
# Write the lambda function using replace
tips['total_dollar_replace'] = tips.total_dollar.apply(lambda x: x.replace('$', ''))
    
# Print the head of tips
print(tips.head())


   total_bill   tip     sex smoker  day    time  size recode total_dollar  \
0       16.99  1.01  Female     No  Sun  Dinner     2      0       $16.99   
1       10.34  1.66    Male     No  Sun  Dinner     3      1       $10.34   
2       21.01  3.50    Male     No  Sun  Dinner     3      1       $21.01   
3       23.68  3.31    Male     No  Sun  Dinner     2      1       $23.68   
4         NaN  3.61  Female     No  Sun  Dinner     4      0         $nan   

  total_dollar_replace  
0                16.99  
1                10.34  
2                21.01  
3                23.68  
4                  nan  
