## String manipulation with regular expressions

In [1]:
import re

In [6]:
pattern = re.compile('\$\d*\.\d{2}')

In [7]:
result = pattern.match("$17.89")

In [10]:
print(bool(result))

True


## String parsing with regular expressions
In the video, Dan introduced you to the basics of regular expressions, which are powerful ways of defining patterns to match strings. This exercise will get you started with writing them.

When working with data, it is sometimes necessary to write a regular expression to look for properly entered values. Phone numbers in a dataset is a common field that needs to be checked for validity. Your job in this exercise is to define a regular expression to match US phone numbers that fit the pattern of xxx-xxx-xxxx.

The regular expression module in python is re. When performing pattern matching on data, since the pattern will be used for a match across multiple rows, it's better to compile the pattern first using re.compile(), and then use the compiled pattern to match values.

- Import re.
- Compile a pattern that matches a phone number of the format xxx-xxx-xxxx.
- Use \d{x} to match x digits. Here you'll need to use it three times: twice to match 3 digits, and once to match 4 digits.
- Place the regular expression inside re.compile().
- Using the .match() method on prog, check whether the pattern matches the string '123-456-7890'.
- Using the same approach, now check whether the pattern matches the string '1123-456-7890'.

In [11]:
# Import the regular expression module
#import re

In [12]:
# Compile the pattern: prog
prog = re.compile('\d{3}-\d{3}-\d{4}')

In [13]:
# See if the pattern matches
result = prog.match('123-456-7890')

print(bool(result))

True


In [14]:
# See if the pattern matches
result2 = prog.match('1123-456-7890')
print(bool(result2))

False


### Extracting numerical values from strings
Extracting numbers from strings is a common task, particularly when working with unstructured data or log files.

Say you have the following string: _'the recipe calls for 6 strawberries and 2 bananas'._

It would be useful to extract the 6 and the 2 from this string to be saved for later use when comparing strawberry to banana ratios.

When using a regular expression to extract multiple numbers (or multiple pattern matches, to be exact), you can use the __re.findall()__ function. Dan did not discuss this in the video, but it is straightforward to use: You pass in a pattern and a string to __re.findall()__, and it will return __a list__ of the matches.

- Import re.
- Write a pattern that will find all the numbers in the following string: 'the recipe calls for 10 strawberries and 1 banana'. To do this:
- Use the re.findall() function and pass it two arguments: the pattern, followed by the string.
- \d is the pattern required to find digits. This should be followed with a + so that the previous element is matched one or more times. This ensures that 10 is viewed as one number and not as 1 and 0.
- Print the matches to confirm that your regular expression found the values 10 and 1.

In [15]:
# Import the regular expression module
#import re

In [18]:
# Find the numeric values: matches
matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana')

In [19]:
# Print the matches
print(matches)

['10', '1']


#### Pattern matching
In this exercise, you'll continue practicing your regular expression skills. For each provided string, your job is to write the appropriate pattern to match it.

Write patterns to match:
- A telephone number of the format xxx-xxx-xxxx. You already did this in a previous exercise.
- A string of the format: A dollar sign, an arbitrary number of digits, a decimal point, 2 digits.
- Use \$ to match the dollar sign, \d* to match an arbitrary number of digits, \. to match the decimal point, and \d{x} to match x number of digits.
- A capital letter, followed by an arbitrary number of alphanumeric characters.
- Use [A-Z] to match any capital letter followed by \w* to match an arbitrary number of alphanumeric characters.

In [20]:
# Write the first pattern
pattern1 = bool(re.match(pattern='\d{3}-\d{3}-\d{4}', string='123-456-7890'))
print(pattern1)

True


In [21]:
# Write the second pattern
pattern2 = bool(re.match(pattern='^\$\d*.\d{2}', string='$123.45'))
print(pattern2)

True


In [23]:
# Write the third pattern
pattern3 = bool(re.match(pattern='[A-Z]{1}\w*', string='Australia'))
print(pattern3)

True


# Using functions to clean data

In [24]:
# import ew
from numpy import NaN

In [25]:
pattern = re.compile("^\$\d*\.\d{2}$")

In [26]:
def diff_money(row, pattern):
    icost = row["Initial Cost"]
    tef = row["Total Est. Fee"]
    if bool(pattern.match(icost) and bool(pattern.match(tef))):
        icost = icost.replace("$", "")
        tef = tef.replace("$", "")
        icost = float(icost)
        tef = float(tef)
        return icost - tef
    else:
        return (NaN)    

In [27]:
import pandas as pd

In [28]:
fee = pd.read_csv("dob_job_application_filings_subset.csv")

  interactivity=interactivity, compiler=compiler, result=result)


In [30]:
#print(fee.info())

In [31]:
fee["diff"] = fee.apply(diff_money, axis=1, pattern=pattern)

In [36]:
print(fee["diff"].head())

0    74014.0
1    -1144.0
2    29477.5
3     1275.0
4    19110.5
Name: diff, dtype: float64


### Custom functions to clean data
You'll now practice writing functions to clean data.

The tips dataset has been pre-loaded into a DataFrame called tips. It has a 'sex' column that contains the values 'Male' or 'Female'. Your job is to write a function that will recode 'Female' to 0, 'Male' to 1, and return np.nan for all entries of 'sex' that are neither 'Female' nor 'Male'.

Recoding variables like this is a common data cleaning task. Functions provide a mechanism for you to abstract away complex bits of code as well as reuse code. This makes your code more readable and less error prone.

As Dan showed you in the videos, you can use the .apply() method to apply a function across entire rows or columns of DataFrames. However, note that each column of a DataFrame is a pandas Series. Functions can also be applied across Series. Here, you will apply your function over the 'sex' column.

- Define a function named recode_gender() that has one parameter: gender.
- If gender equals 'Female', return 0.
- Else, if gender equals 'Male', return 1.
- If gender does not equal 'Male' or 'Female', return np.nan. NumPy has been pre-imported for you.
- Apply your recode_gender() function over tips.sex using the .apply() method to create a new column: 'recode'. Note that when passing in a function inside the .apply() method, you don't need to specify the parentheses after the function name.

In [54]:
tips = pd.read_csv("tips.csv")

In [55]:
def recode_gender(gender):
    # Return 0 if gender is 'Female'
    if gender == "Female":
        return 0
   
    # Return 1 if gender is 'Male'    
    elif gender == "Male":
        return 1
    
    # Return np.nan    
    else:
        return (NaN)    

In [56]:
# Apply the function to the sex column
tips['recode'] = tips.sex.apply(recode_gender)

In [57]:
print(tips.head())

   total_bill   tip     sex smoker  day    time  size  recode
0       16.99  1.01  Female     No  Sun  Dinner     2       0
1       10.34  1.66    Male     No  Sun  Dinner     3       1
2       21.01  3.50    Male     No  Sun  Dinner     3       1
3       23.68  3.31    Male     No  Sun  Dinner     2       1
4       24.59  3.61  Female     No  Sun  Dinner     4       0


In [42]:
print(tips.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 8 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null object
smoker        244 non-null object
day           244 non-null object
time          244 non-null object
size          244 non-null int64
recode        244 non-null int64
dtypes: float64(2), int64(2), object(4)
memory usage: 15.4+ KB
None


## Lambda functions
You'll now be introduced to a powerful Python feature that will help you clean your data more effectively: lambda functions. Instead of using the def syntax that you used in the previous exercise, lambda functions let you make simple, one-line functions.

For example, here's a function that squares a variable used in an .apply() method:

In [None]:
def my_square(x):
    return x ** 2
df.apply(my_square)

#The equivalent code using a lambda function is:
df.apply(lambda x: x ** 2)

The lambda function takes one parameter - the variable x. The function itself just squares x and returns the result, which is whatever the one line of code evaluates to. In this way, lambda functions can make your code concise and Pythonic.

The tips dataset has been pre-loaded into a DataFrame called tips. Your job is to clean its 'total_dollar' column by removing the dollar sign. You'll do this using two different methods: With the .replace() method, and with regular expressions. The regular expression module re has been pre-imported.

- Use the .replace() method inside a lambda function to remove the dollar sign from the 'total_dollar' column of tips.
- You need to specify two arguments to the .replace() method: The string to be replaced ('$'), and the string to replace it by ('').
- Apply the lambda function over the 'total_dollar' column of tips.
- Use a regular expression to remove the dollar sign from the 'total_dollar' column of tips.
- The pattern has been provided for you: It is the first argument of the re.findall() function.
- Complete the rest of the lambda function and apply it over the 'total_dollar' column of tips. Notice that because re.findall() returns a list, you have to slice it (e.g. using [0]) in order to access the actual value.

In [58]:
tips["total_dollar"] = tips.total_bill.apply(lambda x: "$" + str(x))

In [61]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,recode,total_dollar
0,16.99,1.01,Female,No,Sun,Dinner,2,0,$16.99
1,10.34,1.66,Male,No,Sun,Dinner,3,1,$10.34
2,21.01,3.5,Male,No,Sun,Dinner,3,1,$21.01
3,23.68,3.31,Male,No,Sun,Dinner,2,1,$23.68
4,24.59,3.61,Female,No,Sun,Dinner,4,0,$24.59


In [62]:
# Write the lambda function using replace
tips['total_dollar_replace'] = tips.total_dollar.apply(lambda x: x.replace('$', ''))

In [63]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,recode,total_dollar,total_dollar_replace
0,16.99,1.01,Female,No,Sun,Dinner,2,0,$16.99,16.99
1,10.34,1.66,Male,No,Sun,Dinner,3,1,$10.34,10.34
2,21.01,3.5,Male,No,Sun,Dinner,3,1,$21.01,21.01
3,23.68,3.31,Male,No,Sun,Dinner,2,1,$23.68,23.68
4,24.59,3.61,Female,No,Sun,Dinner,4,0,$24.59,24.59


In [64]:
# Write the lambda function using regular expressions
tips['total_dollar_re'] = tips.total_dollar.apply(lambda x: re.findall('\d+\.\d+', x)[0])

In [65]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,recode,total_dollar,total_dollar_replace,total_dollar_re
0,16.99,1.01,Female,No,Sun,Dinner,2,0,$16.99,16.99,16.99
1,10.34,1.66,Male,No,Sun,Dinner,3,1,$10.34,10.34,10.34
2,21.01,3.5,Male,No,Sun,Dinner,3,1,$21.01,21.01,21.01
3,23.68,3.31,Male,No,Sun,Dinner,2,1,$23.68,23.68,23.68
4,24.59,3.61,Female,No,Sun,Dinner,4,0,$24.59,24.59,24.59
