In [1]:
import pandas as pd

## Regular Expressions
Regular expressions are useful when you want to search for, pull out, remove, and/or edit parts of text whose characters share some *pattern*. For instance, what do all of these strings have in common?
- abc123xyz
- define "123"
- var g = 123;

Let's look at what we can do with Regular Expressions (using the python package **re**):

In [62]:
# Creating a data frame so we can put this in a context we're familiar with
df = pd.DataFrame([['abc123xyz'], ['define "123"'], ['var g = 123']], columns=['text'])
df

Unnamed: 0,text
0,abc123xyz
1,"define ""123"""
2,var g = 123


In [None]:
import re

In [21]:
# to pull out the common characteristic, we can use "findall":
function = lambda x: re.findall('123', x)

df.text.apply(function)

0    [123]
1    [123]
2    [123]
Name: text, dtype: object

Notice the output is in the form of a list because it's looking for *all* (non-overlapping) matches - if you wanted *just the string*, you'd need to slice (in this case it's all just 1 item long, so we'll take a slice of index 0):

In [22]:
function = lambda x: re.findall('123', x)[0]

df.text.apply(function)

0    123
1    123
2    123
Name: text, dtype: object

But what if we don't want just the "123", we want the whole string? This is why Regular Expressions are so powerful, there are special characters that allow flexibility (**vocabulary** regular characters are called "literals" and special characters are called "metacharacters" in the Regex world). We'll start with two of the most important ones:
- a period (" . ") represents *any character*
- an asterisk (" \* ") represents *zero or more* repetitions of the same literal or metacharacter
- a plus (" + ") represents *one or more* repititions of the same literal or metacharacter

In combination, " . \* " means a repitition of zero or more or characters - *any* characters. So in the example above, we can put that before and after the "123" to tell "re.findall" that you want it to find and return the *entire* string (including everthing before and after the "123"). Like this: 

In [38]:
# Find the entire string that contains "123" somewhere in it.
function = lambda x: re.findall('.*123.*', x)

df.text.apply(function)

0       [abc123xyz]
1    [define "123"]
2     [var g = 123]
Name: text, dtype: object

Notice we used the " \* " Instead of the "+". See for yourself what would have happened if you'd used the plus. Why did that happen?

In [32]:
# Try with a '+' instead of a '*'


But wait, what if I want to search for a " . " or a " \* " or a " + ", and not have RE think that it's a special character? Use a *backslash*:

In [36]:
re.findall('\.', '.why are there dots?.')

['.', '.']

Another important set of characters are:
- square brackets (" [ ] "), which are used to indicate a *set* of characters ("[abc]" will search for all as, bs, an cs. Some common things to put in here are
    - "[a-z]" which returns any lowercase letter
    - "[A-Z]" which returns any capital letter
    - "[0-9]" which will return any digit
- carets (" ^ "), which have two different meanings:
    - When outside of square brackets, it means "At the beginning" ("^abc" searches for "abc" at the beginning of a string)
    - When inside square brackets it means *NOT* whatever sels is inside the brackets("[^abc]" searches for anything that is not an "a", "b", or "c")
- Dollar Signs (" \$ "), which mean "At the end" ("abc$" searches for "abc" at the end of a string)

In [44]:
# find the strings "123" that are preceded by a letter. Return that letter and "123"
function = lambda x: re.findall('[A-Za-z]123', x)

df.text.apply(function)

0    [c123]
1        []
2        []
Name: text, dtype: object

In [45]:
# find the strings "123" that are preceded by something that is NOT a letter, and return that:
function = lambda x: re.findall('[^A-Za-z]123', x)

df.text.apply(function)

0        []
1    ["123]
2    [ 123]
Name: text, dtype: object

In [58]:
# find the strings "123" and ALL of the non-letters leading up to it:
function = lambda x: re.findall('[^A-Za-z]+123', x)

df.text.apply(function)

0          []
1     [ "123]
2    [ = 123]
Name: text, dtype: object

In [60]:
# find the strings that start with "a" or "d"  and return everything through the "123":
function = lambda x: re.findall('^[ad].*123', x)

df.text.apply(function)

0         [abc123]
1    [define "123]
2               []
Name: text, dtype: object

In [61]:
# find the strings "123" that are at the end of a string
function = lambda x: re.findall('123$', x)

df.text.apply(function)

0       []
1       []
2    [123]
Name: text, dtype: object

Another important metacharacter is the parentheses " ( ) ", which you can use to specify that you want a segment of text that has some specific pattern *before or after* that text. What you want to return goes inside the parens:

In [102]:
# find the strings that start with "a" or "d" and return ONLY the stuff before the "123":
function = lambda x: re.findall('(^[ad].*)123', x)

df.text.apply(function)

0         [abc]
1    [define "]
2            []
Name: text, dtype: object

We're almost done!
- Questionmark (" ? ") indicates that the preceding literal or metacharacter is optional (note, ? means something different if in front of a set of characters, but we won't cover that today)
- Special backslash characters:
    - " \ d " means *any digit*
    - " \ D " means *any NON-digit*
    - " \ w " means *any alphanumeric character*
    - " \ W " means *any NON-alphanumeric character*
    - " \ s " means *any whitespace character* (e.g. space, tab, newline, return, etc.)
        - " \ t " is for horizontal (normal) tab
        - " \ n " is for newline
        - " \ r " is for return
        - " \ f " is for "form feed" (like a page break)
        - " \ v " is for vertical tab

In [63]:
# find the strings that contain white space:
function = lambda x: re.findall('.*\s.*', x)

df.text.apply(function)

0                []
1    [define "123"]
2     [var g = 123]
Name: text, dtype: object

In [66]:
# Return the Non-alphanumeric characters:
function = lambda x: re.findall('\W', x)

df.text.apply(function)

0              []
1       [ , ", "]
2    [ ,  , =,  ]
Name: text, dtype: object

In [67]:
# Return the strings that end in 123 or ending in 123 followed by a special character:
function = lambda x: re.findall('123\W?$', x)

df.text.apply(function)

0        []
1    [123"]
2     [123]
Name: text, dtype: object

Now go to https://regexone.com/lesson/introduction_abcs and practice. Complete Exercises 1-11, and keep going if you feel like it! (Just do the exercises at the bottom, don't read the text unless you really want to - we covered it all here)

## More things you can do with "re"

In [70]:
#Find and replace with re.sub():
re.sub("h[a-z]+", "vinegar", "sweet as honey")

'sweet as vinegar'

In [75]:
#Split 
re.split('\W+', 'Fee, Fi, Fo, Fum.')

['Fee', 'Fi', 'Fo', 'Fum', '']

## Practice

Play around with https://regex101.com/ to try some things out.

In [143]:
# The following is a list of salaries (given in yearly or monthly), but all as one string
# Step 1: Use RE to split the string into a list of salaries (i.e. ['$199,000 a year', '75,000 per year', etc.])
# call this list salary_list
string = "$199,000 a year, 75,000 per year, $45,000 per year, 50000 a year, $150,000 a year, 230,000 per year, 2000 per month, 6,183 a month, $150,000 a year, 100,000 a year, $160,000 a year"

In [144]:
salary_list = 

In [146]:
# Step 2: Use RE to get rid of the salaries that aren't yearly, by subbing them with an empty string, ''
# call this yearly_list. Hint: consider using a list comprehension or for-loop.

In [147]:
yearly_list = 

In [148]:
# Step 3: Use RE to ehttp://localhost:8888/notebooks/dsi2016/dsi-chrispleasants/Regular%20Expressions%20Lab.ipynb#xtract the salary if it is given in a yearly amount 
# (e.g. '$199,000 a year' turns into $199,000)
# call this better_list

In [151]:
better_list = 

In [None]:
# Step 4: Use RE to extract the salary in a numerical format, (e.g. $199,000 becomes 199000.0 as a float)
# call this best_list

In [167]:
best_list = 