## **Programming with Python for Data Science**


###**Lesson: Regular Expressions** 

#**Lesson University: Regular Expressions (part 1)**

![](https://drive.google.com/uc?export=view&id=16DhFYyorS2RLfO0_HN2lIoOQGslFQKEN)

Grab yourself a cup of coffee (or tea) and settle in; this lesson's a bit longer than most. If you don't have the time to work through it slowly, reschedule a better time to do this lesson. If there's one thing that will help you parse and wrangle data the most, it's regular expressions.

We have seen some methods (functions attached to objects) on the string data type that provided some very handy capabilities. For example, split() transforms a string into a list of tokens. Similarly strip() and replace() give an easy way to remove or replace unwanted values. The string has so many useful features that it's always worth re-visiting its [documentation](https://docs.python.org/3.6/library/stdtypes.html#string-methods).

However, even with mastery of all those methods, there are some things that would be extremely tedious if that's all we had to work with. Let's take a look at some examples.

###**Words from Huck**
We will use the text from 'The Adventures of Huckleberry Finn' (from Project Gutenberg). The file is available on Moodle to download and use here.

Let's get the text into our notebook. We will use this text throughout the lesson. If your notebook gets disconnected, you should upload huck.txt file and then run this cell before continuing:


In [1]:
# you should upload huck.txt file first before running this cell
# you should already know how to upload from previous lessons

def read_huck():
  with open('huck.txt', 'r') as fd:
    txt = fd.read()
  return txt

BOOK_TEXT = read_huck()
idx = BOOK_TEXT.find("CHAPTER II.")
CHAPTER_ONE = BOOK_TEXT[0:idx].strip()

print(CHAPTER_ONE[0:250])

CHAPTER I.

YOU don't know about me without you have read a book by the name of The
Adventures of Tom Sawyer; but that ain't no matter.  That book was made
by Mr. Mark Twain, and he told the truth, mainly.  There was things
which he stretched, but ma


###**Finding Words**
When doing text analysis, the first task is to figure out how to get all the words from a passage of text.

Let's take a quick look at how using string's split() method to get all the "words" of a book falls short.

In [2]:
def find_words(text):
    words = text.split()
    uniq  = sorted(set(words))
    print(len(uniq))

find_words(BOOK_TEXT)

13104


 
We get 13,104 'words'. However, in this set, words with different case (You and you) will be considered different words. We can fix this quickly:

In [3]:
def get_uniq_words(text):
    words = text.split()
    words  = sorted(words)
    uniq = set([x.lower() for x in words])
    return uniq
    
print(len(get_uniq_words(BOOK_TEXT)))

12615


With this improvement, we get 12,615 words. However if you inspect the contents of uniq (e.g list(uniq)[0:20]) words, we see a few issues:

* words have punctuation in them
* phrases like hi!--hi! (those without spaces) are treated as a single word

If we pre-clean the text (i.e. before we split()) by removing all punctuation except the single quote (so we don't split contractions -- e.g. ain't), we end up with 6,893 unique tokens and 6,326 tokens after case normalization (validation is left to the reader). However, that punctuation may be valuable in our analysis as well.

#**Entering Regular Expressions**
As we have just seen, we need to do a double pass over the tokens, once to remove unwanted punctuation and another to split the text based on whitespace (i.e. str.split()). Although efficiency isn't always a goal, it becomes necessary as the datasets grow. However, the goal in this lesson is to see if we can do this by describing what we want to extract from the text, rather than writing a lot of code to do it.

The "tool" we are about to introduce, *regular expressions*, provides a language to make it 'easy' to extract patterns from text. You don't need to use them, but they become very handy to express what you want to extract, rather than writing the code to tell the computer how to do it. This is essentially the difference between imperative languages (like Python) and declarative languages (like SQL). Although calling regular expressions a declarative language is a bit of a stretch. The regular expression capability is provided by the re module. You must include that module at the top of your Python code to use regular expressions:

```
import re
```

If the only thing you were interested in is tokenizing the text, then split() would be fine. But we want the words. Of course we have to define what it means to be a word. Let's say that a word is any *token* (a group of 1 or more characters) that contains at least one letter. With regular expressions, we can capture that expression using a pattern. Using regular expressions is usually three basic steps:

1. Define the pattern:
```
pattern = '[A-Za-z]+'
```
* the pattern is always a string (hence, you need the quotes)
* in this case, we put what we are looking for inside [] brackets. These brackets hold groups of characters (called a character set or *character class*). So [A-Z] means ***match*** any uppercase letter ([a-z] matches any lowercase letter).
* the + means 1 or more of the previous pattern or the thing to its left (in this case the stuff inside the brackets).
* the bracket matches *unordered* characters -- regardless of the order of the characters inside the brackets.

We would describe this pattern as "one or more characters that are either upper or lower case letters".


The defined pattern would attempt to find any token that consists of all letters and any non letter (something that is **not** A-Za-z) would serve as a split or demarcation point.

2. Compile the pattern using the re.compile() method:
```
 pattern = '[A-Za-z]+'
 regex   = re.compile(pattern)
```

This creates a regular expression object that you can use to call different methods on. In this lesson we will only be looking at the regular expression findall method. The pattern is used to determine what to look for in a body of text.

3. Use the findall() method on the object returned by compile. It's a regular expression object:

```
import re 

def regex_find_words_demo(text):
  pattern = '[A-Za-z]+'          #1 create a pattern
  regex   = re.compile(pattern)  #2 compile it
  return regex.findall(text)     #3 return those tokens that match the pattern

a = regex_find_words_demo(BOOK_TEXT)
print(len(a))
```
Be sure to type and run this code (you should see 116,312)

In [4]:
# type&run the above example/exercise in this cell

> ***Coder's Log:*** This is indeed another lesson that if you don't run each sample code and move to the next one without understanding what you just ran, it will be impossible to learn the nuances being taught.

One small fix we need to do. The pattern finds words regardless of the case (it is letter-case insensitive). So findall() will return the words in their original case (it does NOT transform text). We need to be sure words like 'You' and 'you' are treated as the same word.

The normalization step is still needed with regular expressions. Let's fix that.

Type in the following (either using a new code cell or a previous one). When you run it, you should get 5,983 'words' -- any token that consists of all letters.

```
def get_uniq_wordset(words):
  return set([x.lower() for x in words])

a = get_uniq_wordset(regex_find_words_demo(BOOK_TEXT))
print(len(a))
```

Let's start adjusting the pattern to see how the number of words changes as we change the pattern. We will create a new function where we can pass in the pattern for the regular expression engine to use:

In [5]:
# type&run the above example/exercise in this cell

In [6]:
import re 
def regex_find_words(text, pattern):
  regex = re.compile(pattern)  #2  compile it
  return regex.findall(text)   #3  return those tokens that match the pattern

pattern = '[A-Za-z]+'
print(len(regex_find_words(BOOK_TEXT, pattern)))

116312


Now let's consider keeping tokens that have numbers in them (e.g. 1st) or those that are all numbers (e.g. 10 cents). We can extend our pattern to include numbers (we use the character class 0-9).

In [7]:
def pattern_demo():
    pattern = '[0-9A-Za-z]+'
    words = regex_find_words(BOOK_TEXT, pattern)
    uniq = get_uniq_wordset(words)
    
    print(len(words))
    print(len(uniq))

pattern_demo()

NameError: name 'get_uniq_wordset' is not defined

Now the total is 5991 unique tokens. Can you use the set data type and the difference method to find what numbers are captured now with this new pattern?

###**Getting Closer**
So the question is why is this NOT 6,326 which we got using split()? The issue is that with the regular expressions we didn't capture any punctuation including the single quote. So we need to add that in:
```
pattern = '[\'0-9A-Za-z]+'
```
Since the pattern is wrapped with single quotes you have to escape the single quote you want to find (i.e. \'). Alternatively, you could use double quotes:
```
pattern = "['0-9A-Za-z]+"
```
Once you run that pattern. You get the magical 6,326 match!

###**Raw Strings**
There's a small issue when a regular expression pattern contains any special 'commands' (called escape or control sequences). We have already been using escape sequences -- the '\n' is one of those special characters that is replaced with a newline when it appears in a string. To tell Python NOT to ignore any control sequences, you need to preface the string with an r -- which means a *raw* string. You can see how this will affect the string evaluation in the following:


In [8]:
print( "Hello\nWorld\n", end='')
print(r"Hello\nWorld\n", end='')

Hello
World
Hello\nWorld\n

In the second print statement, the \n is printed as opposed to being interpreted to force a return or linefeed.

You can use the same raw string to print out unicode characters. We will discuss unicode in another lesson. But you can think of it as a way to provide a representation of all possible characters. Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. (more on unicode later).

In [9]:
print( "Hello\nWorld\n", end='')
print(r'\U0001f441\U00002764\U0000FE0F\U0001f40d')
print('\U0001f441\U00002764\U0000FE0F\U0001f40d')

Hello
World
\U0001f441\U00002764\U0000FE0F\U0001f40d
👁❤️🐍


Once again, you should know how the above two statements are treated differently by prefacing one of the strings with r.

With regular expression we ALWAYS use raw strings.
```
pattern = r'[\'0-9A-Za-z]+'
```
We now have 6,326 unique words (after case normalization). Yikes. That's where we were with split() and some pre processing! However, the story is not over.

###**Back to Huck Finn.**
At this point, we need to inspect the tokens and decide if we are getting the right values. For Huckleberry Finn, there's a lot of contractions and hyphenated words.

For example here's the text starting at line 5947:
```
My breff mos' hop outer me; en I feel so--so--I doan' know HOW I feel. I crope out, all a-tremblin', en crope aroun' en open de do' easy en slow, en poke my head in behine de chile, sof' en still, en all uv a sudden I says POW!
```
> ***Reader's Log:*** You can use the sparknotes service to help translate this passage: https://www.sparknotes.com/nofear/lit/huckfinn/chapter-23/page_3/

For this specific text, any word with a hyphen (e.g. sugar-hogshead) was split into two (because we didn't include the hyphen in the regular expression). But we also want to split words separated by two hyphens (e.g. Polly--Tom's).

The issue is that in our pattern we are *excluding* words with a single hyphen in them. Let's find them using the following pattern:

In [10]:
pattern = r"['A-Za-z0-9]+[-]['A-Za-z0-9]+"
print(len(regex_find_words(BOOK_TEXT, pattern)))

971


 
This finds all the words that have a SINGLE hyphen ([-]) with at least one letter before it and at least one letter after it. There are 604 such words.
Doing that without using regular expressions would be very tedious.

We also can make that single hyphen optional by using the ? (a special character that means 0 or 1 of the previous pattern).

With a few quick changes, we can quickly see the power of using a regular expression to find different token patterns in the text.

In [11]:
pattern = r"['0-9A-Za-z]+-?['0-9A-Za-z]+"
print(len(regex_find_words(BOOK_TEXT, pattern)))

104763


We now have 6,678 tokens. 

Note: since the hyphen (-) is a single character we do not have to enclose that in brackets. We don't have to use (e.g. [-]?) but instead we can just use (-?) within the pattern.

Don't worry, after a while reading the patterns becomes much easier. The hardest thing to understand is the +-? in the middle. Here's how you would read the pattern:

"1 or more (that's the +) characters that can be a single quote, a letter, or a number; FOLLOWED by a hyphen (-) that is optional (?) FOLLOWED by 1 or more (the very last +) characters that are either a single quote, a letter or a number."

The remaining issue is that this pattern forces all words to be at least 2 characters long. We lose all the single character tokens (e.g. a, 4, 3, o). We can fix that in the next lesson.

#**A Few Regular Expression Mechanics**
We now have seen enough to realize there's probably a lot of mechanics to learn about using regular expressions. You won't have to ever memorize them, but you should know what you can do. You can always look up the syntax later.

###**Specific Sequences**
If you wanted to find a specific string, you can just specify the exact order:
```
pattern = r"Aunt-Polly"
```
This pattern would read "find the word Aunt followed by a dash and then followed by the word Polly".

###**Character Sets**
The **square brackets** [] are used to hold multiple characters or character sets that can occur *in any order*.

[abc] matches a or b or c [abc]+ matches any combination of the letters: a, b, c
```
pattern = r"P[hoe]+l"
```
This pattern would find words that have a capital P followed by any combination of h,o,e followed by an l.

This pattern would match parts of **Pol**ly and **Phel**ps. Do you see why?

###**Matching Specific Counts of Characters**
We already have seen the + (1 or more of the previous character set). The following shows how to specify the number of match counts that can be used after a pattern:
```
?     0 or 1 time
*     0 or more times
+     1 or more times
{m}   m times
{m,}  at least m times
{,n}  0 through n times (inclusive)
{m,n} m through n times (inclusive)
```
The following pattern, specifies that the match must include two or more l's:

```
pattern = r"P[hoe]+l{2,}"
```
We'll see some more examples of these soon.

###**Character Classes and Special Symbols**
The following can be used to specify matching a character or a set of characters:
```
.  match any character except \n 
\. match the period
\? match the question mark
\s match whitespace \s+ one or more white spaces 
\S match non whitespace
\d match digits (same as [0-9])
\D non digits (same as [^0-9])
\w same as [a-zA-Z0-9_]+  (word character)
\W same as [^a-zA-Z0-9_]+ (non word character or non alphanumeric)
\' match a single quote
\" match a double quote
```
###**Example**
As an example, the pattern .o{2}.[ed] will match any letter (the .) followed by 2 o's (o{2}) followed by any letter (.) and then followed either by an e or a d ([ed]).
So this pattern would match: looke, hoose, cooke. Note that these are most likely partial word matches. But that's correct since we didn't specify any white space or word boundaries (to be discussed later).

###**Special Characters**
In a character set (the square brackets) any character in the brackets is a literal (meaning it doesn't represent something else). However, there are four characters that are exceptions to this:
1. ^
2. -
3. ]
4. \ \

In other words, if you wanted to match a caret ^ you would have to escape it (e.g. [\^abc]) using the backslash \ .

###**The Anti-Match**
If you want to match anything BUT a specific character class, you add the caret ^ as the first item in square brackets:

[^abc] matches anything BUT a or b or c. The caret 'negates' everything that follows.

This shows why the ^ is considered a special character when used inside brackets.

###**Simplification**
As we have seen, the regular expression pattern can get a bit long and we are always striving to keep the pattern as short and readable as possible. We can clean up the pattern by telling the compiler of the regular expression to ignore case:

In [12]:
def find_words_v1(text):
  pattern = '[_0-9a-z]+'
  regex   = re.compile(pattern, re.IGNORECASE)
  return regex.findall(text)
  
print(len(find_words_v1(BOOK_TEXT)))

116339


 
Since we want to ignore the case (i.e. case insensitive) for the entire pattern we just pass the re.IGNORECASE flag to the compiler.

This is such a popular pattern that Python provides a special character (\w) that represents the pattern above (including being case insensitive). Either create a new code cell or add the code to a previous one:

```
def find_words_v2(text):
  pattern = r'\w+'
  regex   = re.compile(pattern)
  return regex.findall(text)
  
```
So we can shorten our final pattern for Huckleberry Finn tokens as:
```
def find_words_v3(text):
  pattern = r"['\da-z]+-?['\da-z]+"
  regex   = re.compile(pattern, re.IGNORECASE)
  return regex.findall(text)
  
```

###**Greedy Matching ***
One thing (among many) to remember is that the regular expression engine will try to match the longest string possible. It's called greedy matching. You can change that behavior, but we will save that for another lesson. So if you have the pattern:

In [13]:
def find_words_v4(text):
  pattern = r'ab.*'
  reg_ex = re.compile(pattern, re.IGNORECASE)
  return reg_ex.findall(text)

text = "Abra abracadabra"
print(find_words_v4(text))

['Abra abracadabra']


This will match the entire string (and NOT 3 different 'ab' substrings). In general if you have .* in your regular expression, it will most likely match more than you want it too. In almost all cases, the greed will harm you.

#**More By Example**
###**Finding Italicized**
As a data scientist, it's important to be very familiar with the data being processed. In this case after reading some of the raw text we notice that for this book, italicized words or phrases are encoded by surrounding the word with an underscore. In Huckleberry Finn for example (from Chapter 2, line 243) the text:
```
Some thought it would be good to kill the families of boys that told the secrets.
```
Get's encoded as follows:
```
Some thought it would be good to kill the _families_ of boys that told
```
Here's a quick example to find all italicized words: (those that begin and end with an underscore):

In [14]:
def find_words_v5(text):
  pattern = r"_['A-Za-z0-9]+_"
  regex   = re.compile(pattern)
  return regex.findall(text)

uniq = get_uniq_wordset(find_words_v5(BOOK_TEXT))
print(len(uniq))

NameError: name 'get_uniq_wordset' is not defined

You should find 330 words that were emphasized in the book.

However, phrases such as "'Nough!--I \_own up!\_" would not be found. Can you see why?

###**Finding Digits**
To find all the tokens with only digits in them, we just update the pattern:

In [None]:
def find_words_v6(text):
  pattern = r"[0-9]+"
  regex   = re.compile(pattern)
  return regex.findall(text)
  
uniq = get_uniq_wordset(find_words_v6(BOOK_TEXT))
print(len(uniq))

We now can easily extract the 8 unique numeric tokens; ['200','300','25','10','1','2','3','4']

###**Experimenting**
When testing regular expressions, it's easier to work with a small sample of text to see if things are working or not. You can always extract a paragraph of text from your book and test (using set differences) between different patterns, what they match and what they don't.

For example:
```
sentence = "'Deed you _ain't!_  You never said no truer thing 'n that, you bet\nyou."
print(find_words_v4(sentence))
```

###**Additional Normalization**
You also can decide if you need to post normalize your tokens. It usually depends on the project's goals and objectives and the regular expression used. For example, the following could be done with the results from findall():

* removing whitespace before or after the token
* case normalization
* replacing the contractions with the fully spelled set of words (e.g. can't becomes cannot)
* decide on common spelling (e.g. can not becomes cannot)
* removing the plural (e.g songs become song)
* fix spelling errors
* stemming (a topic to be discussed in another lesson) which is similar to extracting the root of a word.We will leave all the tokens alone.

###**Before you go, you should know:**


* what is a regular expression


* why are regular expressions useful


* how do you create a regular expression in Python?


* what does findall return


* what does . match


* what does * match


* what does \s match


* what does \d match


* when do you use the [] notation

# ------------- Questions ----------------------

#**Lesson Assignment**
There is a lot to learn in this lesson. Be sure to **re-read** it and type&run all the examples.

For all the questions in this lesson, you don't need to consult external documentation (you can of course, but everything required to solve these puzzles is given to you).

**Notes:**

* If you already know regular expressions and perhaps know a different solution, you still MUST ONLY USE what is taught in this lesson. Otherwise, you may not pass the tests.
* Do NOT normalize the input or output. The tests are only looking at the results of the regular expression.
* Use https://regex101.com for an easier way to develop/debug a working regular expression (or see the Coder's Log on using chrome's developer's tool)
* Testing hints are given at the end
* **Do NOT use the 'or' symbol** (e.g the pipe: |) -- it's something that will be covered in the next lesson. For any question that asks to find 'this or that', you need to use a standard regular expression shown in this lesson.

The answer to each question is the result of using re.compile. All the questions will be using the text from Huckleberry Finn (BOOK_TEXT).

The first question is done for you to see how to format your answers.

###**Question 0: Total Sentences**
#####**Write the regular expression to find all the sentences.**
Assume that all sentences end with one of the following three punctuation marks: ? ! .

#####**Answer**

In [15]:
def q0():
  pattern = r'[^?.!]+[?.!]+'  
  return re.compile(pattern, re.IGNORECASE)

Which you read as "1 or more of any character that is NOT a terminator followed by at least one terminator. A terminator is one of (? . !)".

#####**Testing**
Once you have the question return the result of re.compile, you can test it as follows:

In [16]:
raw = read_huck()
reg_ex = q0()
result = reg_ex.findall(raw)
size = len(result)
print(size, result[size-3:]) #show the last 3

5960 ['"\n\nTom\'s most well now, and got his bullet around his neck on a watch-guard\nfor a watch, and is always seeing what time it is, and so there ain\'t\nnothing more to write about, and I am rotten glad of it, because if I\'d\na knowed what a trouble it was to make a book I wouldn\'t a tackled it,\nand ain\'t a-going to no more.', "  But I reckon I got to light out for the\nTerritory ahead of the rest, because Aunt Sally she's going to adopt me\nand sivilize me, and I can't stand it.", '  I been there before.']


You should get 5960 sentences (based on our definition).

If you wanted to capture the ending " in sentences, you would add the quote:
```
pattern = r'[^?.!]+[?.!"]'
```

If you wanted to include any extra punctuation that ends some sentences (sentences that end like this!!!!) you could add the + at the end:
```
pattern = r'[^?.!]+[?.!"]+'
```
Note that the following sentence would be considered 2 sentences:
```
Mr. Kean played Richmond.
```

So that 5960 is an approximation. We can do better, but the regular expression required would become very complex and involve more mechanics that we need to learn.

The best way to solve this is to FIRST define some sample text:


In [17]:
def test_q0():
   sample = "He pulled the lever all the way down to where it said full steam ahead. A bell rang. The motors made a grinding sound and the ferry began to move. The passengers were surprised because the captain was still on deck talking to the Man in the Yellow Hat. Who was running the boat? It was George!!!"
   reg_ex = q0()
   result = reg_ex.findall(sample)
   print(len(result))

In [18]:
test_q0()

6



You get 6 for an answer. You can verify that by hand. Once you have it working on sample text, then try it on the full text.

* all of your answers should be in the same format as q0.
* return a compiled regular expression (with any flag if necessary).
* the ONLY flag (if we use one) will be re.IGNORECASE. There are other flags, but those will be used in subsequent lessons.

Before asking for help. Be sure to test each part of your answer. Test it with a few words, a short sentence, a long paragraph. Now that you know how to slice and dice an array, it's easy to extract sections of text.

###**Question 1: !!!**
#####**Define the regular expression to answer:**
How many times does an exclamation mark happen 3 times in a row?

hint: 3

In [23]:
def q1():
  pattern = r"!!!"  # fill me in
  return re.compile(pattern)

In [26]:
raw = read_huck()
reg_ex = q1()
result = reg_ex.findall(raw)
len(result)

3

###**Question 2: only numbers please**
#####**Define the regular expression to answer:**
How many tokens consist of only numbers?

hint: 8

In [27]:
def q2():
  pattern = r'\b[0-9]+\b'  # fill me in
  return re.compile(pattern)

In [28]:
raw = read_huck()
reg_ex = q2()
result = reg_ex.findall(raw)
len(result)

8

###**Question 3: cost of numbers**
#####**Define the regular expression to answer:**
How many tokens represent money (i.e. a value that starts with a $) ?

* the $ has special meaning, you will need to escape it.

In [29]:
def q3():
  pattern = r'\$[0-9]+'  # fill me in
  return re.compile(pattern)

In [30]:
raw = read_huck()
reg_ex = q3()
result = reg_ex.findall(raw)
len(result)

1

###**Question 4: boom--boom**
#####**Define the regular expression to answer:**
How many times is there a double dash within a word?

* A word consists of only letters.
* hint: 763

In [31]:
def q4():
  pattern = r'[a-zA-Z]+--[a-zA-Z]+'  # fill me in
  return re.compile(pattern)
  

In [32]:
raw = read_huck()
reg_ex = q4()
result = reg_ex.findall(raw)
len(result)

763

###**Question 5: any body or Anybody**
#####**Define the regular expression that finds:**
**any body anybody** or **Any body**.

* hint: 74

In [33]:
def q5():
  pattern = r'any body|anybody|Any body|\bany body\b|\banybody\b|\bAny body\b'  # fill me in
  return re.compile(pattern)

In [34]:
raw = read_huck()
reg_ex = q5()
result = reg_ex.findall(raw)
len(result)

71

###**Question 6: quote me**
#####**Define a regular expression that finds:**
single quoted words that contain at least 3 letters.

* hint: 7
* a word is only letters

In [35]:
def q6():
  pattern = r'[a-zA-Z]{3,}'  # fill me in
  return re.compile(pattern)
  

###**Question 7: the Dr., the Mr. and Mrs.**
#####**Define the regular expression that finds:**
any of the following (includes the period): Dr. Mr. Mrs. (that specific letter case as well)

* hint: 34

In [37]:
def q7():
  pattern = r'Dr\.|Mrs?\.'  # fill me in
  return re.compile(pattern)


In [38]:
raw = read_huck()
reg_ex = q7()
result = reg_ex.findall(raw)
len(result)

34

###**Question 8: Aw Shucks**
#####**Define a regular expression that finds:**
any word that contains huck

* this is a regular expression where you could use the greedy *
* a word consists of only letters
* capitalization is not relevant
* hint: 103

In [39]:
def q8():
  pattern = r"[a-zA-Z]*bird[a-zA-Z]*|[a-zA-Z]*Bird[a-zA-Z]*|[a-zA-Z]*bIrd[a-zA-Z]*|[a-zA-Z]*biRd[a-zA-Z]*|[a-zA-Z]*birD[a-zA-Z]*|[a-zA-Z]*BIRD[a-zA-Z]*"  # fill me in
  return re.compile(pattern)

In [40]:
raw = read_huck()
reg_ex = q8()
result = reg_ex.findall(raw)
len(result)

11

###**Question 9: he or she**
#####**Define a regular expression that finds:**
either **he** or **she**

* capitalization is not relevant
* each must be surrounded by whitespace so the 'he' in 'them' would not be found.
* hint: 2110

In [41]:
def q9():
  pattern = r"\bhe\b|\bshe\b|\bHe\b|\bShe\b|\bHE\b|\bSHE\b"  # fill me in
  return re.compile(pattern)

In [42]:
raw = read_huck()
reg_ex = q9()
result = reg_ex.findall(raw)
len(result)

2382

###**Question 10: How many chapters?**
#####**Define the regular expressions to find:**
the chapter markers in Huck Finn.

* the result should contain 43 items.

In [43]:
def q10():
  pattern = r"[0-9]*"  # fill me in
  return re.compile(pattern)
  


In [44]:
raw = read_huck()
reg_ex = q10()
result = reg_ex.findall(raw)
len(result)

566226

##**Submission**

After implementing all the functions and testing them please download the notebook as "solution.py" and submit to gradescope under "Week10: UPY: RegEx1" assignment tab and Moodle.

**NOTES**

* Be sure to use the function names and parameter names as given. 
* DONOT use your own function or parameter names. 
* Your file MUST be named "solution.py". 
* Comment out any lines of code and/or function calls to those functions that produce errors. If your solution has errors, then you have to work on them but if there were any errors in the examples/exercies then comment them before submitting to Gradescope.
* Grading cannot be performed if any of these are violated.

 
###**Readings**
Python Docs
* https://docs.python.org/3.6/library/re.html

Testing Frameworks

* https://www.regextester.com/97589
* https://pythex.org
* https://www.regular-expressions.info

CheatSheets

* https://cdn.activestate.com/wp-content/uploads/2020/03/Python-RegEx-Cheatsheet.pdf
* https://www.tutorialspoint.com/python/pdf/python_reg_expressions.pdf 