### Just Regular Expressions ###

Here we isolate the regular expression lesson from the ELIZA bot. Make sure you understand how regular expressions work as they are so so so very important. We will still use the 2019 State of the Union address as our data set -- a list with one entry per sentence in the speech ordered from Trumps first sentence `new_sentences[0]` to his last `new_sentences[-1]`.

In [None]:
# read in the speech
sentences = open("sotu_sentences.txt").readlines()

# use a list comprehension to strip out the newlines
new_sentences = [s.strip() for s in sentences]

**Regular expressions (second pass)**

As we mentioned in class, regular expressions constitute a language for specifying patterns in text. They are implemented in almost every programming language you will come across. Python supports a variety of activities involving regular expression including searching for patterns and making substitutions based on identified patterns. All of these features are available through a package called `re` (aptly named). 

At first, we are going to make use of the `search()` function in this package. It takes a pattern definition (a regular expression) and searches for it in a string, returning every match it finds. You can use this in a list comprehension because when a `search()` finds a regular expression pattern it is treated as `True` in an `if` statement. When it can't find the pattern, it is treated as `False`.

In [None]:
from re import search

**Literals** 

As a way of specifying patterns, let's start with so-called **"literals" -- these characters just match themselves.** For example, the literal *"wall"* matches the following sentences from Trump’s transcript. This should be equivalent to the results we had by using the operator `in` in our loop.

In [None]:
[s for s in new_sentences if search("wall",s)]

Often we don't care about the case of the text we are matching. We can add a flag to the regular expression to explicitly ignore the case of the literals we are searching for. Tweeting is not a particularly grammatical exercise, so this option might be good to add. It's a flag we can import from `re`.

In [None]:
from re import IGNORECASE

[s for s in new_sentences if search("wall",s,IGNORECASE)]

Replace the literal *"wall"* with a search for *"immigration"*. Do you find any sentences matching this pattern? What other searches like this might you do to highlight sentences about immigration? Or other topics that this search suggests.

In [None]:
# your code here


Now, in the search for "wall" we also had "walls" turn up. If we were interested in "media" we might look for the **literal string** consisting of the letters m-e-d-i-a. See what we get...

In [None]:
[s for s in new_sentences if search("media",s)]

Hmm, well, that's not what we wanted at all. So, how do we get just the word "media"? Let's take a moment and write down the kinds of patterns in text we might be interested in finding, either from this speech or from your reporting or other work. What sorts of things do you have to specify? I mean word boundaries seem like a good idea so we can differentiate "media" from "immediate". What else do you need? Regular expressions describe useful patterns in text and what patterns have you needed in the past?

.

.

.

To get us to more complex searches, we need some way to express these ideas. And that's where metacharacters come to the rescue!

**In walk metacharacters**

Any character except for [ ]\\^\$.|?\*+( )\{ and \} can be used to specify a literal -- they match a single instance of themselves. The string *"wall"* represented a series of literals and to have a match we need to find a "w" followed by an "a" followed by an "l" and so on. The non-literals, on the other hand, are known as **metacharacters** and are used to specify much more complicated text patterns.

They help us specify "whitespace," word boundaries, sets or classes of literals, the beginning and end of a line, and various alternatives ("war" or "peace"). For example **^ represents the start of a line.** Let's look at what we get by searching the SOTU for a pattern than includes this character.

*(We are now going to preface our strings representing regular expressions with the letter `r`. We have seen `u` before to mean Unicode. Here `r` means a "raw" string. Basically it tells Python that every character is to be interpreted as it is. We have seen that `\n` in a string means a single character, newline. In a raw string, `\n` is interpreted as two characters, a backslash and an "n". This will be important and is really a way around the fact that both regular expressions and Python use metacharacters to build things like newlines or  to represent the beginning of a sentence. Sometimes Python's syntax gets in the way of that used by a regular expression and so the `r` turns that off.)*

In [None]:
[s for s in new_sentences if search(r"^I have",s)]

Try a new sentence starter and see what you find...

In [None]:
# your code here


**Some help**

To interpret what a regular expression is doing, we can use a special tool from [regexper.com](http://regexper.com/). It takes regular expressions and renders them graphically so you can get a better sense of how the machinery is functioning. For example, here is the display for our *"^I have"* example.

[Graphical view of the pattern "^I have"](https://regexper.com/#%5EI%20have)

You see a window where you can change the regular expression and then a graphical interpretation of what you've asked for. This is an extremely handy tool. (Notice that you don't have to include the quotes or the `r` in the regexper.com interface.) 

Now, if specifying the start of a line is important, having a special character for the end of a line is likely to be handy also. **The \$ represents the end of a line.** Consider the pattern *"it.\$"*. Here are the lines it matches the SOTU. Do you notice anything odd here?

In [None]:
[s for s in new_sentences if search(r"it.$",s)]

We see two sentences that end in "it." but really end with the worlds "credit." and "await." Word breaks -- gotta introduce those. Ah but notice also that we also have one sentence that ends with "it?". 

What's going on here? **Well, the dot, ".", is a metacharacter and represents a wildcard.** It is used to refer to any character. So, *it.\$* will match lines that end in "it." or "it?" or even "it9" (should the transcriber be typing sloppily one day). Have a look at this at regexper.com.

[Graphical view of the pattern "road.\$"](http://regexper.com/#it.%24)

Putting a backslash \\ before one of the special metacharacters [ ] \\^\$/?\*+()\{ and \} lets us include these in a pattern as literals -- in technical terms, we have "escaped" the special meaning of these characters and they return to their literal meanings. 

Consider the pattern "\\\\$2". With the backslash, \$ no longer means the end of a line. We have returned the dollar sign to its literal meaning and the following lines match from the Trump transcript. 

>Therefore, we recently imposed tariffs on $250 billion of Chinese goods — and now our Treasury is receiving billions of dollars a month from a country that never gave us a dime

And to bring the point home, look at the following.

[Graphical view of the pattern "\\\\$2"](http://regexper.com/#%5C%242)

So, given this, what do we need to do to match sentences ending with the word "country" followed by a period?

In [None]:
# your code here



**A character class matches a single character out of all the possibilities contained in brackets, [  ]** Whithin the braces we offer choices. So it's one step up from a literal. We are not matching a single character like a literal would, but we are giving some options that we can use to match. But from this collection, a match occurs if we find just one in our string. There are certain rules that apply when specifying these classes that we’ll get to in a second. Let's look at the pattern *"[Tt]onight"* and see what lines it matches in the transcript. Literally this means we match either "T" or "t"...

In [None]:
[s for s in new_sentences if search(r"[Tt]onight",s)]

[Graphical view of the pattern "[Tt]onight"](http://regexper.com/#%5BTt%5Donight)

Keep in mind that while there might be lots of options in the square brackets, we are only trying to match one character out of this group. The graphical display makes this clear. We'll talk about specifying more than one match in a few minutes.

In terms of the rules that work within character classes, you can specify a range of letters [a-z] or [A-Z] or numbers [0-9] — Keep in mind that the order within the character class doesn’t matter, it specifies a bag of characters from which we select one item. Let's look at the pattern *"[0-9] years"* and see which sentences it will match.

In [None]:
[s for s in new_sentences if search(r"[0-9] years",s)]

[Graphical view of the pattern "[0-9] years"](http://regexper.com/#%5B0-9%5D%20years)

It's important to keep in mind what's being matched here. The expression *"[0-9] years"* will match "5 years" in the sentence that started "CJ served 15 years in the Air Force". The expression wants a number then a space then the word "years". Get it? 

**When used at the beginning of a character class ^ is also a metacharacter and it indicates matching characters NOT in the indicated class.** So the pattern *"[^?.]\$"* will match sentences that don't end in a period or a question mark (you don't have to "escape" characters in a character class -- or between [ and ]). 

In [None]:
[s for s in new_sentences if search(r"[^?.]$",s)]

Here is the graphical representation of the expression.

[Graphical view of the pattern "[^?.]\$"](http://regexper.com/#%5B%5E%3F.%5D%24)

Continuing on our survey of metacharacters, the **vertical bar "|" translates to “or”** — We can use it to combine expressions, the subexpressions being called alternatives. The expression *"good|bad"* will match these lines from transcript file.

In [None]:
[s for s in new_sentences if search(r"good|bad",s)]

We'll jump ahead just briefly and say that we can solve the problem of matching "goodness" and "badly" when all we want are the words "good" or bad". Some collections of characters, some "character classes" are used so often that they are given special notation. For example `\w` is a word character and `\b` represents a character class of "word boundaries". Here's a (not elegant but works) way to say you want "good" or "bad" alone.

In [None]:
[s for s in new_sentences if search(r"\bgood\b|\bbad\b",s)]

We will give a complete set of character classes represented in this way at the end of this section.

Returning to choices, of course we can join several alternatives together. Consider *"year|month|day"*.

In [None]:
[s for s in new_sentences if search(r"day|month|year",s)]

Again, here we might see a lot of matches to patterns like "birthday" or "someday". Both contain the literal *"day"* but they might not be what we had in mind. 

[Graphical view of the pattern "year|month|day"](http://regexper.com/#year%7Cmonth%7Cday)

Oh and the alternatives separated by "|" can be any real expressions and not just literals. Here we ask for time or money.

In [None]:
[s for s in new_sentences if search("[0-9] year|\$[0-9]",s)]

And again, regexper.com to help us out.

[Graphical view of the pattern "[0-9] year|\$[0-9]"](https://regexper.com/#%5B0-9%5D%20year%7C%5C%24%5B0-9%5D)

**Subexpressions are often contained in parentheses (more metacharacters) to constrain the
alternatives in some way.** Think of this like the use of parentheses you encountered in HS algebra. You used parentheses to make sure that things were added before they were divided, etc. 

In the same way, the parentheses in regular expressions let ou define groupings or subexpressions. For example *"^(I will|I am)"* matches either expression, but at the start of a sentence. 

Later we will see that we can identify each subexpression separately, allowing us to extract (or capture) the content they match.

In [None]:
[s for s in new_sentences if search("^(I am|I have)",s)]

And the graphical representation -- notice the new reference to groups that are formed by the parentheses.

<a href=https://regexper.com/#%5E%28I%20have%7CI%20am%29>Graphical view of the pattern "^(I have|I am)"</a>

We're building up quite a vocabulary. Try a more complex expression on your own.

In [None]:
# Your code here


Before we leave this, let's use the shorthand \\b  for word boundaries and our parentheses construction to match what we really wanted with our original search a few cells back, *"\b(day|month|year)\b"*. Let's have a look.

In [None]:
[s for s in new_sentences if search(r"\b(day|month|year)\b",s)]

**The question mark indicates that the indicated expression is optional.** The expression *"George( W\\.)? Bush"* will match references to “George W. Bush” or just “George Bush”.

<a href=http://regexper.com/#George(%20W%5C.)%3F%20Bush>Graphical view of the pattern "George( W\\.)? Bush"</a>

**The \* and + signs are metacharacters used to indicate repetition** — the \* means “any number, including zero, of the item” and + means “at least one of the item”. So we can specify clauses after a colon with the following regular expression. It says look for a colon tht is followed by a sequence of one or more chracters (any characters because we are using the "." dot) until the end of the sentence. 

In [None]:
[s for s in new_sentences if search(r":.+$",s)]

**What, exactly, is being matched?** 

When we specify a pattern, it can be easy to misunderstand what, exactly, is being matched. The command `findall` in the `re` package makes this crystal clear. Given a pattern and a string, it gives for us all the portions of the string that match the pattern. 

Let's try it out. 

In [None]:
from re import findall

Here we take a sentence from the state of the union address and try matching some literals. `findall` will pull out all the matching strings. Warning, these won't be so exciting.

In [None]:
s = 'In 1996, at age 30, Matthew was sentenced to 35 years for selling drugs and related offenses.'

findall(r"en",s)

In [None]:
findall(r"at",s)

Notice in each case it just pulls out the literals that match. In this case we find the literal "en" twice in the word "sentenced" and once in "offenses." So we get three matches. Make sure you understand why we get three "at" matches.

Now, a character class like `[0-9]` means match one of zero through nine. So when we use this in `findall()` we get all the single digits as matches. Again `[0-9]` refers to matching just one member of the class, in this case a single number. 

In [None]:
findall(r"[0-9]",s)

The pattern first matches the 1 in 1996, and then the 9 in 1996, and then the second 9 in 1996, and then the 6 in 1996, and the 3 in 30 and so on.

Now, if we wanted to match one or more numbers like, say, the year `1996` (as opposed to the individual single numbers), we would use the `+` to represent multiplicity -- one or more times. The regular expression would be `[0-9]+` which literally translates into matching a sequence of digits (zero through nine) of length one or more. Here's what we get now.

In [None]:
findall(r"[0-9]+",s)

The other metacharacter representing multiplicity is the `*`. It means matching something **zero or more times.** If we use `findall()` but now ask for `[0-9]*`, meaning give me as matches sequence of zero or more numbers. This means that the word "In" that starts the sentence will be counted as two matches, the "I" will match "", the string with 0 numbers in it, as will "n". 

In [None]:
findall(r"[0-9]*",s)

Each character in the sentence `s` that does not contain a digit will create a match of "" because it the pattern `[0-9]*` offers the possibility of a match not including any digits. But that match is just the emmpty string... no digits.

**The multiplicity metacharacters `*` and `+` are "greedy"** in the sense that they will match, in some sense, the largest pattern they can. As an example, the regular expression `,.+,` means "match a comma, then any character one or more times, and then a comma. It should help us find the content between commas. Here's what `findall()` produces.

In [None]:
s = 'Madam Speaker, Mr. Vice President, Members of Congress, the First Lady of the United States, and my fellow Americans: We meet tonight at a moment of unlimited potential.'

findall(r",.+,",s)

So, what it did was to match the material between the first comma (after "Madam Speaker") and the last comma (before "and my fellow Americans"). This is a greedy match. We can switch off the greed by following the multiplicity metacharacter with a `?` (this is the second use we've seen for this metacharacter, the first being conditional). Let's see what we get.

In [None]:
findall(r",.+?,",s)

See? We now have two matches instead of just one by making the `+` nongreedy. It recognizes that we have the patterns with the string ", Mr. Vice President," and with ", the first Lady of the United States,". (It leasves out "Members of congress because it uses the comma after "Vice President" to form the first match and so it is not available to match around "Members of Congress". )

The use of regexper.com and `findall()` will help you understand what you are matching and why. 

**Finding data with more stakes.** Now, to make this seem a bit more practical, we are going to consider another data source. You can download some of it, but it's big. It's essentially all of Jeb Bush's emails while he was in office in Florida. I mean literally in `mbox` format. The site is show here via the Internet Archive...
<br><br>
<img src=https://github.com/computationaljournalism/columbia2018/raw/master/images/jb2.jpg style="width: 65%; border: #000000 1px outset;"/>
<br><br>
... because this was such a bad idea, they took the site down quickly. 
<br><br>
<img src=https://github.com/computationaljournalism/columbia2018/raw/master/images/jb1.jpg style="width: 65%; border: #000000 1px outset;"/>
<br><br>
Why a bad idea? Well, no one masked any possibly sensitive data from the emails. So it's easy to find phone numbers or social security numbers or military ID's in these text files.

For example, to grab phone numbers, what regular expression might we use? 

Write your expression here



For social security numbers or phone numbers we could use this expression *"[0-9]+-[0-9]+-[0-9]+"*. Why? Here's a series of lines that match this pattern from Jeb's emails.

>Phone: 407-240-1891<br><br>In reference to your letter dated october 29, 1998 in which you offer to help me with my inmigration question, i am a us citizen who is petition for my husband (a mexican citizen) petition #SRC-98-204-50114 his name is FRANCISCO JAVIER CORTEZ HERNANDEZ.<br><br>Fax: 407-888-2445<br><br>Pager: 850-301-8072<br><br>Cell: 407-484-8167<br><br>The Reverned uses his pager# 813-303-4726 to get in contact with, or you may email
and I will get in touch with him. <br><br>

And from regexper.com... 

<a href=http://regexper.com/#%5B0-9%5D%2B-%5B0-9%5D%2B-%5B0-9%5D%2B>Graphical view of the pattern "[0-9]+-[0-9]+-[0-9]+"</a>

In words, we are looking for one or more numbers followed by a hyphen, followed by one or more numbers, and then another hyphen, and finally one or more numbers.

**The curly braces \{ and \} are referred to as interval quantifiers** — they let us specify the exact number of occurences of a pattern, its minimum, maximum or an acceptable range. For a Social Security Number we might want "[0-9]{3}-[0-9]{2}-[0-9]{4}", for example - three numbers, a hyphen, two numbers, a hyphen then four numbers.  

<a href=https://regexper.com/#%5B0-9%5D%7B3%7D-%5B0-9%5D%7B2%7D-%5B0-9%5D%7B4%7D>View of the pattern "[0-9]{3}-[0-9]{2}-[0-9]{4}"</a>

Here's the full list of what you can do with the curly braces as metacharacters. 
<table>
          <tr>
            <th>Expression</th>
            <th>What does it mean?</th>
          </tr>
          <tr>
            <td>{3}</td>
            <td>Looks for 3 occurences of a pattern</td>
          </tr>
          <tr>
            <td>{,3}</td>
            <td>Matches at most 3 occurrences</td>
          </tr>
          <tr>
            <td>{3,}</td>
            <td>Matches at least 3 occurrences</td>
          </tr>
          <tr>
            <td>{3,5}</td>
            <td>Matches between 3 and 5 occurrences</td>
          </tr>
 </table>

With this information, how would we skim these emails for specific kinds of numbers? Credit card numbers? Phone numbers? VIN numbers?

**Subexpressions and groupings.** In most implementations of regular expressions, the parentheses not only limit the scope
of alternatives divided by a “|”, but also can be used to “remember” text matched by the
subexpression enclosed. We refer to the matched text with \1, \2, etc., depending on how many parenthese we have. 

As an example, the expression
` ([a-zA-Z]+) \1 ` (it starts with a space) will match these lines in Jeb Bush’s inbox from January of 2000.

>I feel this is a **win win** situation for the Governor, the Reverend and the people that need help.<br><br>I insisted **that that** be the outcome in that court and that we did not recede from that position.<br><br>I guess you're embarrassed **that that** line got out.<br><br>

The pattern is asking for repeated words. Whatever matches in the first subexpression in parentheses can be teferred to as `\1`, asking for a double occurrence of the subexpression -- like "win win". We highlighted them in the text above. Also have a look at the graphical representation of this regular expression.

<a href=http://regexper.com/#%20(%5Ba-zA-Z%5D%2B)%20%5C1%20>Graphical view of the pattern " ([a-zA-Z]+) \1 "</a>.

**Substitution**

Groupings are also helpful when you want to make substitutions. We have already seen that string types offer you the opportunity to replace text...

In [None]:
s = 'San Diego used to have the most illegal border crossings in the country..'
s.replace("Diego","Francisco")

The `re` package includes a function `sub` that lets you make changes to strings based on regular expression matches. one common form of change is to act on groups. In the next cell, we define a group to be a dollar value and extract it. With this kind of construction, you can see how you might start to pull structured information from unstructured text. 

The expression is `.*\$([0-9,]+).*` Try it out in regexper.com! It says look for a string of any characters of length zero or more `.*`, followed by a literal dollar sign `\$`. Then, we create a subexpression matching a string of either numbers 0 through 9 or a comma that is at least one character long `([0-9,]+)`. Finally, it then matches strings of any character that is of length zero or more `.*`. 

Again, the regular expression below creates a single group indicated by the parentheses and referrable by `\1`. Here we use `sub()` to replace the whole string `s` with just the contents of this group 1. 

In [None]:
from re import sub

s = "At the same time, she rallied her community and raised more than $40,000 for the fight against cancer."

# pull out the dollar value
sub(r".*\$([0-9,]+).*",r"\1",s)

To bet a better sense of how this works, let's add three parentheses. The first is the material before the dollar sign, the second is the dollar value and the last is the material after the sequence of one or more numbers. These are denoted groups `\1`, `\2` and `\3` respectively. See how `sub()` changes when we use each group in turn. Fun!

In [None]:
# the first subexpression

sub(r"(.*)\$([0-9,]+)(.*)",r"\1",s)

In [None]:
# the second subexpression

sub(r"(.*)\$([0-9,]+)(.*)",r"\2",s)

In [None]:
# the third subexpression

sub(r"(.*)\$([0-9,]+)(.*)",r"\3",s)

**To Sum**

The presentation here is meant to give you a flavor of how regular expressions are structured; you have seen the major metacharacters and to use them to create patterns. Below I provide a useful cheat sheet to remember what the different metacharacters mean and what some of the useful shorthand character classes are. In addition, I can recommend [an interactive cheat sheet](https://www.debuggex.com/cheatsheet/regex/python), and the site [http://www.regular-expressions.info/](http://www.regular-expressions.info/) is also an excellent resource.

**Metacharacters**

<table>
          <tr>
            <th>Metacharacter</th>
            <th>What does it do?</th>
            <th>Examples</th>
            <th>Matches</th>
          </tr>
          <tr>
            <td>^</td>
            <td>Matches beginning of line</td>
            <td>^abc</td>
            <td>abc, abcdef.., abc123</td>
          </tr>
          <tr>
            <td>\$</td>
            <td>Matches end of line</td>
            <td>abc\$</td>
            <td>my:abc, 123abc, theabc</td>
          </tr>
          <tr>
            <td>.</td>
            <td>Match any character</td>
            <td>a.c</td>
            <td>abc, asg, a123c</td>
          </tr>
          <tr>
            <td>[...]</td>
            <td>Matches one character contained in brackets</td>
            <td>[abc]</td>
            <td>a,b, or c</td>
          </tr>
          <tr>
            <td>[^...]</td>
            <td>Matches one character not contained in brackets</td>
            <td>[^abc]</td>
            <td>xyz, 123, 1de</td>
          </tr>
          <tr>
            <td>[a-z]</td>
            <td>Matches one character between 'a' and 'z'</td>
            <td>[b-z]</td>
            <td>bc, mind, xyz</td>
          </tr>
          <tr>
            <td>*</td>
            <td>Matches character before * 0 or more times</td>
            <td>ab\*c</td>
            <td>abc, abbc, ac</td>
          </tr>
          <tr>
            <td>+</td>
            <td>Matches character before + one or more times</td>
            <td>a+c</td>
            <td>ac, aac, aaac,</td>
          </tr>
          <tr>
            <td>?</td>
            <td>Matches the character before the ? zero or one times. Also, used as a non-greedy match</td>
            <td>ab?c</td>
            <td>ac, abc</td>
          </tr>
          <tr>
            <td>{x}</td>
            <td>Match exactly 'x' number of times</td>
            <td>(abc){2}</td>
            <td>abcabc</td>
          </tr>
          <tr>
            <td>{x,}</td>
            <td>Match 'x' number of times or more</td>
            <td>(abc){2,}</td>
            <td>abcabc, abcabcabc</td>
          </tr>
           <tr>
            <td>{,x}</td>
            <td>Match up to 'x' number of times</td>
            <td>(abc){2,}</td>
            <td>abcabc, abcabcabc</td>
          </tr>
          <tr>
            <td>{x,y}</td>
            <td>Match between 'x' and 'y' times.</td>
            <td>(a){2,4}</td>
            <td>aa, aaa, aaaaa</td>
          </tr>
           <tr>
            <td>|</td>
            <td>OR operator</td>
            <td>abc|xyz</td>
            <td>abc or xyz</td>
          </tr>
          <tr>
            <td>(...)</td>
            <td>Capture anything matched</td>
            <td>(a)b(c)</td>
            <td>Captures 'a' and 'c'</td>
          </tr>
          <tr>
            <td>(?:...)</td>
            <td>Non-capturing group</td>
            <td>(a)b(?:c)</td>
            <td>Captures 'a' but only groups 'c'</td>
          </tr>
           <tr>
            <td>\</td>
            <td>Escape the character after the backslash; or create a special sequence (like word boundaries, \b, or a character representing a space, \s.</td>
            <td>a\sc</td>
            <td>a c</td>
          </tr>
        </table>

The special "metacharacters" () [] {} ^ \$ . | \* + ?  and \\ become "literals" again if you put a \\ in front of them -- That is, \\. matches a period and is no longer the wild card. We say we have "escaped" the metacharacter.

**Shorthand character classes** Remember we saw that `\b` represented a word boundary. There are many character classes that are defined in this way. For example, instead of typing `[0-9]` we can use `\d` to represent a digit. This is just shorthand for commonly used character classes. Again, programming languages try to make common activities easy... both easy to write as well as easy to read.

<table>
          <tr>
            <td>\d</td>
            <td>Match any digit (0-9)</td>
          </tr>
          <tr>
            <td>\D</td>
            <td>Match any non digit</td>
          </tr>
          <tr>
            <td >\t</td>
            <td>Match a tab</td>
          </tr>
          <tr>
            <td>\n</td>
            <td>Match a new line</td>
          </tr>
          <tr>
            <td>\r</td>
            <td>Match a carriage return</td>
          </tr>
          <tr>
            <td>\s</td>
            <td>Matches a space character (space, \t, \r, \n)</td>
          </tr>
          <tr>
            <td>\S</td>
            <td>Matches any non-space character </td>
          </tr>
          <tr>
            <td>\b</td>
            <td>Word boundary</td>
          </tr>
          <tr>
            <td>\B</td>
            <td>Non word boundary</td>
          </tr>
          <tr>
            <td>\w</td>
            <td>Matches any one word character [a-zA-Z_0-9]</td>
          </tr>
          <tr>
            <td>\W</td>
            <td>Matches any one non word character</td>
          </tr>
          </table>

## Regular Expression Drill ##

Download a CSV that represents all of Trump's tweets for the last year. I've loaded it to GitHub and [you can grab it here.](https://github.com/computationaljournalism/columbia2019/raw/master/data/trump_tweets_year.txt). Download it and put it in the same folder as this notebook. Now, let's read it in. Remember, `pandas` is our go-to package for this.

In [None]:
from pandas import read_csv, set_option

# prepare cells to show 280 characters
set_option("display.max_colwidth",280)

rdt = read_csv("trump_tweets_year.csv")
rdt.head(10)

Now, we want to use regular expressions to identify subsets of tweets from the last year. We have seen using the `.str` object in a pandas data frame to give us access to string methods. In this case `.str.contains()` will return `True` if the regular expression is a match in the column and `False` otherwise. So, for example, let's look at whether the tweet contains a number. The tweet text is stored in the column `text` whereas things like the date it was tweeted is in the column `created_at`. 

So now, let's use the `.str.contains()` and a regular expression to return `True` when  a tweet contains a sequence of at least one number.

In [None]:
rdt["text"].str.contains("[0-9]+")

We see the second and seventh tweets (the 2nd most recent and the 7th) both contain numbers -- one has a tariff number and one counts years of the collusion investigation. Ah but we have a column of `True` and `False` that we can use to get us rows that contain numbers! We can keep just those rows with numbers by subsetting with this column...

In [None]:
# let's do this in two steps to make it clear... first we create a column of T/F ...

keep = rdt["text"].str.contains(r"[0-9]+")
keep.head(10)

In [None]:
#... and then we use it to subset the tweet data frame rdt

rdt[keep]

Now, since `keep` is made up of `True` and `False` values, when we `sum()` it up, the `True` becomes a 1 and the `False` a 0. So this will tell us how many tweets contain a sequence of at least one number.

In [None]:
sum(keep)

I get 1517. OK? Now a little more background and then we set you free!

Notice that some of these matches are actual numbers, as in they measure the quantity of things, while others are numbers that are part of a URL shorterner like `https://t.co/hUK9dSBM3M`. We might want to separate these out. How would we do that?

Now, for one last helpful example, let's see how we might pull all the retweets. These are tweets that start with "RT". Remember how to do that? Here again we use the column of `True` and `False` to keep only those rows which start with an RT. We store the retweets in a new dataframe called `rts`.

In [None]:
keep = rdt["text"].str.contains(r"^RT")
rts = rdt[keep]

rts.head()

In [None]:
sum(keep)

Now, while we use `.str.contains()` for a pandas column in place of  of `search()` for   a string to test if a pattern is present or not in each row of a data frame, we can use the pandas command `.str.replace()` instead of `sub()` to make changes to each row of a data frame. 

Here we create a regular expression to pull out who is being retweeted from the tweets in `rts`. We do this by looking for text between the "RT" and the colon. Notice that are using `(@.*?)` to indicate group `\1` is made up of an "at" sign and then any character up to a colon. We use the `?` to make this search not greedy, otherwise we would take the retweeted account's name as everything up to the last colon in the tweet, not the first. 

So here we pull out the name of the person being retweeted and replace the original tweet with just this account name, with just group `\1`.

In [None]:
rts["text"].str.replace(r"^RT (.*?):.*$",r"\1")

To then tabulate the accounts he retweets, we can use `value_counts()` as we have many times in the past. 

In [None]:
rts["text"].str.replace(r"^RT (.*?):.*$",r"\1").value_counts()

The point? Well, in data cleaning, we often find ourselves working with sequences of text to try to pull out just what we want. Regular expressions are powerful tools for this. 

Now, let's start to look for some things.

1. How often does he refer to percentages?
2. How often does he refer to a summer month?
4. How often does he tweet using at least one word (not single letters) that is all caps?
5. How often does he embed a link?
6. How often does he refer to another Trump?
7. Tabulate the words Trump uses before "Hillary".

Come up with 5 searches on your own!