# [LEGALST-190] Lab 3/15: Regular Expression

This lab will cover the basics of regular expression: finding, extracting and manipulating pieces of text based on specific patterns within strings.

*Estimated Time: 45 minutes*

### Table of Contents

[The Data](#section data)<br>

[Overview](#section context)<br>

0- [Matching with Regular Expressions](#section 0)<br>

1 - [Introduction to Essential RegEx](#section 1)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1 - [Special Characters](#subsection 1)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2 - [Quantifiers](#subsection 2)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3 - [Sets](#subsection 3)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4 - [Special Sequences](#subsection 4)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5 - [Groups and Logical OR](#subsection 4)

2- [Python RegEx Methods](#section 2)<br>

3 - [Valuation Extraction](#section 3)<br>

## The Data <a id='data'></a>

You will again be working with the Old Bailey data set to practice matching and manipulating pieces of the textual data.


## Overview <a id='data'></a>
Regular Expressions operations ("RegEx") are a very flexible version of the text search function that you find in most text processing software. In those regular search functions, you press `ctrl+F` (or `command+F`) and type in the search phrase you are looking for e.g. "Congress". If your software finds an exact match for your search phrase ("Congress"), it jumps to its position in the text and you can take it from there.

Thinking a bit more abstractly about this, "Congress" is nothing else than a very specific search. In it, we ask the search function to report the position where it finds a capital "C" followed seven lower case letters ("o", "n", "g", "r", "e","s","s"), all in a specific order. Depending on your text, it may have been sufficient to let your search function look for all words starting with the captial letter "C", or for those words starting with "C" and ending with "ess". This kind of flexibility is exactly what RegEx provides.

RegEx is more flexible than the customary search function as it does not restrict you to spell out the literal word, number or phrase you are looking for. Rather, in RegEx you can describe the necessary characteristics for a match. You can enter these characteristics based on rules and special characters that make RegEx what it is.

Regular expressions are useful in a variety of applications, and can be used in different programs and programming languages. We will start by learning the general components of regular expressions, using a simple online tool, Regex101. Then at the end of the workshop, we'll learn how to use regular expressions to conduct devaluation exploration on the Old Bailey dataset - we will look at how often plaintiffs had the amount they were charged with stealing reduced when they were sentenced by matching valuations in the text such as 'value 8s 6p'.

__IT IS IMPORTANT to have an experimental mindset as you go through today's practice problems.__ Practice and curiosity are the keys to success! Each indiviual character expression may output a simple pattern, but you will need to explore different combinations to match more and more complicated sets of strings. Feel free to go beyond what the questions ask and test different expressions as you work through this notebook.


__Dependencies__: Run the cell below. We will go over what this python library does in the Python Methods section of this lab.

In [3]:
import re

---

## Introduction to Essential RegEx<a id='section 1'></a>
### 0. Matching with Regular Expressions <a id='subsection 0'></a>


Before we dive into the different character expressions and their meanings, let's explore what it looks like to match some basic expressions. Open up [Regex101](https://regex101.com/r/Una9U7/4), an online Python regular expression editor. This editor will allow us to input any test string and practice using regular expressions while receiving verification and tips in real-time. There should already be an excerpt from the Old Bailey Set (edited, for the sake of practice problems) in the `Test String` box.

You can think of the `Regular Expression` field like the familiar `ctrl+F` search box.
Try typing in the following, one at a time, to the `Regular Expression` field:
~~~ {.input}
1. lowercase letter: d
2. uppercase letter: D
3. the word:  lady
4. the word:  Lady
5. the word:  our
6. the word:  Our
7. a single space
8. a single period
~~~

__Question 1:__ What do you notice?

__Your Answer:__

*Write your Answer Here:*

Note that:
1. RegEx is case sensitive: it matches _exactly_ what you tell it to match.
2. RegEx looks for the exact order of each character you input into the expression. In the entire text, it found 'our' in 'Hon`our`able' and 'F`our`score'. However, nowhere in the text was there the exact sequence of letters O-u-r starting with a capital 'O', so 'Our' doesn't match anything.
3. The space character ` ` highlights all the single spaces in the text.
4. the period character `.` matches all the characters in the text, not just the periods... why?

This last question takes us now to what is called __special characters__.


---
### 1. Special Characters <a id='subsection 1'></a>

Strings are composed of characters, and we are writing patterns to match specific sequences of characters.
Various characters have special meaning in regular expressions. When we use these characters in an expression,
we aren't matching the identical character, we're using the character as a placeholder for some other character(s)
or part(s) of a string.


~~~ {.input}

.         any single character except newline character
^         start of string
$         end of entire string
\n        new line
\r        carriage return
\t        tab

~~~

Note: if you want to actually match a character that happens to be a special character, you have to escape it with a backslash
`\`.

__Question 2:__ Try typing the following special characters into the `Regular Expression` field on the same Regex101 site. What happens
when you type:

1. `Samuel` vs. `^Samuel` vs. `Samuel$`?

2. `.` vs. `\.`

3. `the` vs. `th.` vs. `th..` ?

__Your Answer:__

*Write your Answer Here*:

1.

2.

3.

SOLUTION:
~~~ {.input}
1.
`Samuel` will match all instances of the pattern `Samuel` in the text
`^Samuel` will match only the instances of `Samuel` at the beginning of the text
`Samuel$` will match only the instance of `Samuel` at the end of the text

2.
`.` matches all individual characters in the text 
`\.` matches all periods in the text

3.
`the` matches all instances of the pattern `the` in the text
`th.` matches all instances of patterns starting with `th` and ending in any character 
`th..` matches all instances of patterns starting with `th` any two characters follwing it
~~~

---
### 2. Quantifiers<a id='subsection 2'></a>

Some special characters refer to optional characters, to a specific number of characters, or to an open-ended
number of characters matching the preceding pattern.

~~~ {.input}
*        0 or more of the preceding character/expression
+        1 or more of the preceding character/expression
?        0 or 1 of the preceding character/expression
{n}      n copies of the preceding character/expression 
{n,m}    n to m copies of the preceding character/expression 
~~~


__Question 3:__ For this question, click [here](https://regex101.com/r/ssAUXx/1) to open another Regex101 page.

What do the expressions `of`, `of*`, `of+`, `of?`, `of{1}`, `of{1,2}` match? Remember that the quantifier only applies to the character *immediately* preceding it. For example, the `*` in `of*` applies only to the `f`, so the expression looks for a pattern starting with __exactly one__ `o` and __0 or more__ `f`'s.

__Your Answer:__

*Write your answer here:*

SOLUTION:

~~~ {.input}

- `of`: matches all instances of the pattern `of` in the text
- `of+`: matches all intances of a pattern starting with exactly one `o` and 1 or more `f`'s
- `of?`: matches all instances of a pattern starting with exactly one `o` and at most one `f` after it.
- `of{1}`: matches all instances of a pattern starting with exactly one `o` and exactly one `f`
- `of{1,2}`: matches all instances of a pattern starting with exactly one `o` and 1 OR 2 `f`'s after it.

~~~

---
### 3. Sets<a id='subsection 3'></a>

A set by itself is merely a __collection__ of characters the computer may choose from to match a __single__ character in a pattern. We can define these sets of characters using `square brackets []`.

Within a set of square brackets, you may list characters individually, e.g. `[aeiou]`, or in a range, e.g. `[A-Z]` (note that all regular expressions are case sensitive). 


You can also create a complement set by excluding certain characters, using `^` as the first character
in the set. The set `[^A-Za-z]` will match any character except a letter. All other special characters loose
their special meaning inside a set, so the set `[.?]` will look for a literal period or question mark.

The set will match only one character contained within that set, so to find sequences of multiple characters from
the same set, use a quantifier like `+` or a specific number or number range `{n,m}`.

~~~ {.input}
[0-9]        any numeric character
[a-z]        any lowercase alphabetic character
[A-Z]        any uppercase alphabetic character
[aeiou]      any vowel (i.e. any character within the brackets)
[0-9a-z]     to combine sets, list them one after another 
[^...]       exclude specific characters
~~~

__Question 4:__ Let's switch back to the excerpt from the Old Bailey data set (link [here](https://regex101.com/r/Una9U7/2) for convenience). Can you write a regular expression that matches __all consonants__ in the text string? 

__Your Answer:__

In [2]:
# YOUR EXPRESSION HERE

In [None]:
#SOLUTION:
[^aeiou]

---

### 4. Special sequences<a id='subsection 4'></a>

If we want to define a set of all 26 characters of the alphabet, we would have to write an extremely long expression inside a square bracket. Fortunately, there are several special characters that denote special sequences. These begin with a `\` followed by a letter.

Note that the uppercase version is usually the complement of the lowercase version.

~~~ {.input}
\d        Any digit
\D        Any non-digit character
\w        Any alphanumeric character [0-9a-zA-Z_] 
\W        Any non-alphanumeric character
\s        Any whitespace (space, tab, new line)
\S        Any non-whitespace character
\b        Matches the beginning or end of a word (does not consume a character)
\B        Matches only when the position is not the beginning or end of a word (does not consume a character)
~~~

__Question 5:__ Write a regular expression that matches all numbers (without punctuation marks or spaces) in the Old Bailey excerpt. Make sure you are matching whole numbers (i.e. `250`) as opposed to individual digits within the number (i.e. `2`, `5`, `0`).

__Your Answer:__

In [None]:
# YOUR EXPRESSION HERE

In [None]:
#SOLUTION
\d+

__Question 6:__ Write a regular expression that matches all patterns with __at least__ 2 and __at most__ 3 digit and/or white space characters in the Old Bailey excerpt.

__Your Answer:__

In [None]:
#YOUR EXPRESSION HERE

In [None]:
#Solution
[\d\s]{2,3}

---

### 5. Groups and Logical OR<a id='subsection 5'></a>

Parentheses are used to designate groups of characters, to aid in logical conditions, and to be able to retrieve the
contents of certain groups separately.

The pipe character `|` serves as a logical OR operator, to match the expression before or after the pipe. Group parentheses
can be used to indicate which elements of the expression are being operated on by the `|`.

~~~ {.input}
|            Logical OR opeator
(...)        Matches whatever regular expression is inside the parentheses, and notes the start and end of a group
(this|that)  Matches the expression "this" or the expression "that"
~~~

__Question 7:__ Write an expression that matches groups of `Samuel` or `Prisoner` in the Old Bailey excerpt.

__Your Answer:__

In [3]:
# YOUR EXPRESSION HERE

In [None]:
#SOLUTION
(Samuel|Prisoner)

---

## Python RegEx Methods <a id='section 2'></a>

So how do we actually use RegEx for analysis in Python?

Python has a RegEx library called `re` that contains various methods so we can manipulate text using RegEx. The following are some useful Python Methods we may use for text analysis:


- ``.findall(pattern, string)``: Checks whether your pattern appears somewhere inside your text (including the start). If so, it returns all phrases that matched your pattern, but not their position.
- ``.sub(pattern, repl, string)``: Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.
- ``.split(pattern, string)``: Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.


We will only be using the `.findall()` method for the purposes of today's lab, so don't worry if the functionality of each method isn't clear right now. If you are curious about all the module content within the `re` library, take a look at the [documentation for `re`](https://docs.python.org/2/library/re.html) on your own time!

---

## Extracting Valuation from Old Bailey  <a id='section 3'></a>

Let's apply our new RegEx knowledge to extract all valuation information from the text!

The next cell simply assigns a long string containing three separate theft cases to a variable called `old_bailey`. Within the text are valuations which indicate the worth of the items stolen. We will use this string, what we can observe about the format of valuation notes in the text, and what we just learned about regular expressions to __find all instances of valuations in the text__. 

Valuations will look something like: `val. 4 s. 6 d.`

*Note:* British Currency before 1971 was divided into pounds (`l`), shillings (`s`), and pennies (`d`) - that's what the letters after the values represent. We want to make sure to keep the values and units together when extracting valuations.

__STEP 1__: We will first write expression(s) that will match the valuations.
Take a moment to look for a pattern you notice across the valuations:

In [10]:
old_bailey = """"Samuel Davis, of the Parish of St. James Westminster, was indicted for feloniously Stealing 58 Diamonds set in
Silver gilt, value 250 l. the Goods of the Honourable Catherine Lady Herbert, on the 28th of July last. It appeared that the 
Jewels were put up in a Closet, which was lockt, and the Prisoner being a Coachman in the House, took his opportunity to take 
them; the Lady, when missing them, offered a Reward of Fourscore Pounds to any that could give any notice of it; upon enquiry, 
the Lady heard that a Diamond was sold on London-Bridge, and they described the Prisoner who sold it, and pursuing him, found 
the Prisoner at East-Ham, with all his Goods bundled up ready to be gone, and in his Trunk found all the Diamonds but one, which
was found upon him in the Role of his Stocking, when searcht before the Justice. He denied the Fact, saying, He found them upon
a great Heap of Rubbish, but could not prove it; and that being but a weak Excuse, the Jury found him guilty. 

John Emory, was 
indicted for stealing eleven crown pieces, twenty four half crowns, one Spanish piece, val. 4 s. 6 d. one silk purse, and 
4 s. 6 d. in silver, the goods of Ann Kempster, in the dwelling house of Walter Jones. December 17. Acquitted. He was a second
time indicted for stealing one pair of stockings, val. 6 d. the goods of John Hilliard .

GEORGE MORGAN was indicted for that he, about the hour of ten in the night of the 10th of December , being in the dwelling-house
of George Brookes , feloniously did steal two hundred and three copper halfpence, five china bowls, value 30s. a tea-caddie, 
value 5s. a pound of green tea, value 8s. four glass rummers, value 2s. and a wooden drawer, called a till, value 6d. the 
property of the said George, and that he having committed the said felony about the hour of twelve at night, burglariously 
did break the dwelling-house of the said George to get out of the same."""

You might notice that there are multiple ways in which valuations are noted. It can take the form:

~~~ {.input}
value 30s.
val. 6 d.
4 s. 6 d.
~~~

...and so on.

Fortunately, we only care about the values and the associaed units, so the ommission or abbreviation of the word `value` can be ignored - we only care about:

~~~ {.input}
30s.
6 d.
4 s. 6 d.
~~~

Unfortunately, we can see that the format is still not consistent. The first one has no space between the number and unit, but the second and third do. The first and second have a single number and unit, but the third has two of each.

How might you write an expression that would account for the variations in how valuations are written? Can you write a single regular expression that would match all the different forms of valuations exactly? Or do we need to have a few different expressions to account for these differnces, look for each pattern individually, and combine them somehow in the end?

Real data is messy. When manipulating real data, you will inevitably encounter inconsistencies and you will need to ask yourself questions such as the above. You will have to figure out how to clean and/or work with the mess. 

With that in mind, click [here](https://regex101.com/r/2lal6d/1) to open up a new Regex101 with `old_bailey` already in the Test String. We will compose a regular expression, in three parts, that will account for all forms of valuations in the string above.

__PART 1: Write an expression__ that matches __all__ valuations of the form `30s.` AND `6 d.`, but does not match _anything else_ (e.g. your expression should not match any dates). Try not to look at the hints on your first attempt! Save this expression __as a string__ in `exp1`.


_Hint1:_ Notice the structure of valuations. It begins with a number, then an _optional_ space, then a single letter followed by a period.

_Hint2:_ What _quantifier_ allows you to choose _0 or more of the previous character_?

_Hint3:_ If you are still stuck, look back to the practice problems and see that we've explored/written expressions to match all components of this expression! It's just a matter of putting it together.


In [None]:
#Your Expression Here
exp1 = 

In [34]:
#SOLUTION
exp1 = '\d+ ?[a-z]\.'

__PART 2:__ For the third case we found above, there are multiple values and units in the valuation. What can you add to what you came up with above so that we have another expression that matches this specific case? Save this expression as a string in `exp2`.

In [None]:
#Your Expression Here
exp2 = ...

In [31]:
#SOLUTION
exp2 = '\d+ [a-z]\. \d+ [a-z]\.'

__PART 3:__ Now that you have expressions that account for the different valuation formats, combine it into one long expression that looks for (_hint_) one expression __OR__ the other. Set this expression to `final`. Be careful about the order in which you ask the computer to look for patterns (i.e. should it look for the shorter expression first, or the longer expression first?). Save this final expression as a string in `final`.

In [26]:
#Your Expression Here
final = 

In [32]:
#SOLUTION
final = '\d+ [a-z]\. \d+ [a-z]\.|\d+ ?[a-z]\.'

__STEP 2:__ Now that you have the right regular expression that would match our valuations, how would you use it to _extract_ all instances of valuations from the text saved in `old_bailey`?

Remember, you need to input your regular expression as a __string__ into the method.

In [None]:
#Your Expression Here

In [33]:
#SOLUTION
re.findall(final, old_bailey)

['250 l.',
 '4 s. 6 d.',
 '4 s. 6 d.',
 '6 d.',
 '30s.',
 '5s.',
 '8s.',
 '2s.',
 '6d.']

__Congratulations!!__ You've successfully extracted all valuations from our sample. When you are extracting valuations from a larger text for your devaluation exploration, keep in mind all the possible variations in valuation that may not have been covered by our example above. You now have all the skills necessary to tweak the expression to account for such minor variations -- Good Luck!

---

## Bibliography

- The Python Standard Library. (2018, February). Regular Expression Operations. https://docs.python.org/2/library/re.html
- Bloy, Marjie. (2006, June). British Currency Before 2971. http://www.victorianweb.org/economics/currency.html
- The Proceedings of the Old Bailey. https://www.oldbaileyonline.org/

---

Notebook Developed by: Keiko Kamei

Data Science Modules: http://data.berkeley.edu/education/modules