# Introduction to Regular expressions

Regular expressions, or regexes, are written in a condensed formatting language and the're a pattern which you give to a regex processor with some source data. The processor then parses that source data using that pattern and returns chunks of text back to the a data scientist or programmer for further manipulation. 

The main reasons you would want to do this is:

- to check whether a pattern exists within some source data

- to get all instances of a complex pattern from some source data

- to clean your source data using a pattern generally through string splitting. 

Regexes are not trivial, but they are a foundational technique for data cleaning in data science applications and a solid understanding of regexs will help you quickly and efficiently manipulate text data for further data science application.

First we'll import the re module, which is where python stores regular expression libraries.


In [1]:
import re

There are several main processing functions in re that you might use:
- match(pattern, string, flags=0) checks for a match that is at the beginning of the string and returns a boolean
- search(pattern, string, flags=0) checks for a match anywhere in the string and returns a boolean
- findall(pattern, string, flags=0) will look for a pattern and pull out all occurences.
- split(pattern, string, maxsplit=0, flags=0) split string by the occurrences of pattern.

In [2]:
#creating a text for an example
text = "This is a good day."
print('text: ' + text)
print('----- MATCH ----')
print(re.match("This", text))
print(re.match("this", text))
print(re.match("this is", text))
print('----- SEARCH ----')
print(re.search("good", text))
print(re.search("a good day", text))
print(re.search("bad", text))

text2 = "Amy works diligently. Amy gets good grades. Our student Amy is succesful."
print('text: ' + text2)
print('----- FINDALL ----')
print(re.split("Amy", text2))
print(re.split("good", text2))
print(re.split("day", text2))
print('----- SPLIT ----')
print(re.split("Amy", text2))
print(re.split("this", text2))

text: This is a good day.
----- MATCH ----
<_sre.SRE_Match object; span=(0, 4), match='This'>
None
None
----- SEARCH ----
<_sre.SRE_Match object; span=(10, 14), match='good'>
<_sre.SRE_Match object; span=(8, 18), match='a good day'>
None
text: Amy works diligently. Amy gets good grades. Our student Amy is succesful.
----- FINDALL ----
['', ' works diligently. ', ' gets good grades. Our student ', ' is succesful.']
['Amy works diligently. Amy gets ', ' grades. Our student Amy is succesful.']
['Amy works diligently. Amy gets good grades. Our student Amy is succesful.']
----- SPLIT ----
['', ' works diligently. ', ' gets good grades. Our student ', ' is succesful.']
['Amy works diligently. Amy gets good grades. Our student Amy is succesful.']


# Patterns and Character Classes

Let's imagine that a string of a single learners grades over a semester in one course across all of their assignments had been created.
1. How many B's were in the grade list?
2. How many A's or B's in the grade list?
3. How many nons As is in the grade list?

we can't use "AB" since this is used to match all A's followed immediately by a B. Instead, we put the characters A and B inside square brackets

In [4]:
print('----- GRADES -----')
grades="ACAAAABCBCBAA"
print(grades)
print('----- (1) -----')
print(re.findall("B",grades))
print('----- (2) using set operator -----') #we can't use "AB" since this is used to match all A's followed immediately by a B.
print(re.findall("[AB]",grades)) #we put the characters A and B inside square brackets
print('----- (3) -----')
print(re.findall("[^A]",grades)) #^ = negative operator

----- GRADES -----
ACAAAABCBCBAA
----- (1) -----
['B', 'B', 'B']
----- (2) using set operator -----
['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'A', 'A']
----- (3) -----
['C', 'B', 'C', 'B', 'C', 'B']


# Quantifiers
Number of times that you want a pattern to be matched in order to match. 

The most basic quantifier is expressed as e{m,n}, where e is the expression or character we are matching, m is the minimum number of times you want it to matched, and n is the maximum number of times the item could be matched.

1. How many times has this student been on a back-to-back A's streak?

In [6]:
print('----- (1) -----')
print(re.findall("A{2,10}",grades)) #we'll use 2 as our min, but ten as our max
print(re.findall("A{1,1}A{1,1}",grades)) #So we see that there were two streaks, one where the student had four A's, and one where they had only two
#A's. We might try and do this using single values and just repeating the pattern
print(re.findall("AA",grades)) #if we don't include a quantifier then the default is {1,1}
print(re.findall("A{2}",grades)) #if you just have one number in the braces, it's considered to be both m and n

----- (1) -----
['AAAA', 'AA']
['AA', 'AA', 'AA']


The first pattern is looking for any combination of two A's up to ten A's in a row. So it sees four A's as a single streak. The second pattern is looking for two A's back to back, so it sees two A's followed immediately by two more A's.

Some other quantifiers that are used as short hand, an asterix * to match 0 or more times, a question mark ? to match one or more times or a + plus sign to match one or more times. Lets look at a more complex example and load some data scraped from wikipedia.

## Exemple