# Regular Expressions

- Before you can you can use regular expressions in your program, you must import the library using **`import re`**

- You can use **`re.search()`**  to see if the string matches a regular expression, similar to using **`find()`** method for string. 

- **`re.search()`** return True/False depending on whether the string matches the regular expression.

- You can use **`re.findall()`** to extract portions of the string that match your regular expression, similar to combination of `find()` and slicing: `var[5:10]`


```python
# Search for lines that contain 'From'
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search('From:', line):
        print(line)
```      
-------------
```python
# Search for lines that start with 'From'
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search('^From:', line):
        print(line)
```        

------------------
# Character matching in regular expressions - Wild-Card Characters

- The **dot** character matches any character
- If you add the **asterisk** character, the character is "any number of times"
If you add the **plus** character, the character is "one or more times"

```python
# Search for lines that start with 'F', followed by
# 2 characters, followed by 'm:'
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search('^F..m:', line):
        print(line)
```        

------------

```python
# Search for lines that start with From and have an at sign
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search('^From:.+@', line):
        print(line)
```  


The search string **`"^From:.+@"`** will successfully match lines that start with "From:", followed by one or more characters (".+"), followed by an at-sign. So this will match the following line:

**`From: stephen.marquard@uct.ac.za`**

You can think of the `".+"` wildcard as expanding to match all the characters between the colon character and the at-sign.

**`From:`**

It is good to think of the plus and asterisk characters as "pushy". For example, the following string would match the last at-sign in the string as the ".+" pushes outwards, as shown below:

**`From: stephen.marquard@uct.ac.za, csev@umich.edu, and cwen @iupui.edu`**

It is possible to tell an asterisk or plus sign not to be so "greedy" by adding another character.

--------

#### cheat sheet

| Meta-characters 	| Description                                         	|
|:-----------------	|:-----------------------------------------------------	|
| **^**           	| Matches the beginning of a line                     	|
| **$**           	| Matches the end of the line                         	|
| **.**           	| Matches any character                               	|
| **\s**          	| Matches whitespace                                  	|
| **\S**          	| Matches any non-whitespace character                	|
| *****           	| Repeats a character zero or more times              	|
| ***?**          	| Repeats a character zero or more times (non-greedy) 	|
| **+**           	| Repeats a character one or more times               	|
| **+?**          	| Repeats a character one or more times (non-greedy)  	|
| **[aeiou]**     	| Matches a single character in the listed set        	|
| **[^XYZ]**      	| Matches a single character not in the listed set    	|
| **[a-z0-9]**    	| The set of characters can include a range           	|
| **(**           	| Indicates where string extraction is to start       	|
| **)**           	| Indicates where string extraction is to end         	|

# Extracting data using regular expressions

If we want to extract data from a string in Python we can use the **`findall()`** method to extract all of the substrings which match a regular expression. Let's use the example of wanting to extract anything that looks like an email address from any line regardless of format. For example, we want to pull the email addresses from each of the following lines:

```python
import re
s = 'A message from csev@umich.edu to cwen@iupui.edu about meeting @2PM'
lst = re.findall('\S+@\S+', s)
print(lst)
```

The findall() method searches the string in the second argument and returns a list of all of the strings that look like email addresses. We are using a two-character sequence that matches a non-whitespace character (\S).

The output of the program would be:

`['csev@umich.edu', 'cwen@iupui.edu']`


We can use this regular expression in a program to read all the lines in a file and print out anything that looks like an email address as follows:

```python
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    x = re.findall('\S+@\S+', line)
    if len(x) > 0:
        print(x)
```


------
## `[0-9]+`

```python
import re
x = 'My 2 favourite numbers are 19 and 42'
y = re.findall('[0-9]+', x)
print(y)
y = re.findall('[AEIOU]+', x)
print(y)
```


## Warning Greedy Matching

The repeat characters (***** and **+**) push outward in both directions(greedy) to match the largest possible string.

```python
import re
x = 'From: Using the : character'
y = re.findall('F.+:', x)
print(y)
```

## Non-Greedy Matching

Not all repeat codes are greedy! If you add a **?** character, the **+** and **-** chill out a bit...

```python
import re
x = 'From: Using the : character'
y = re.findall('F.+?:', x)
print(y)
```

## Fine-Tuning String extraction( Combining searching and extracting)

**Parenthesis** are not part of the match - but they tell where to **start** and **stop** what string to extract

```python
import re
x = 'From csev@umich.edu Sat Jan 5 09:14:16 2008'
y = re.findall('From (\S+@\S+)', x)
print(y)
```

```python
import re
x = 'From csev@uct.ac.za Sat Jan 5 09:14:16 2008'
y = re.findall('@([^ ]*', x)
print(y)
```

```python
import re
x = 'From csev@uct.ac.za Sat Jan 5 09:14:16 2008'
y = re.findall('From .*@([^ ]*', x)
print(y)
```

--------
If we want to find numbers on lines that start with the string "X-" such as:

```
X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000
```
we don't just want any floating-point numbers from any lines. We only want to extract numbers from lines that have the above syntax.

We can construct the following regular expression to select the lines:

`^X-.*: [0-9.]+`

Translating this, we are saying, we want lines that start with "X-", followed by zero or more characters (".*"), followed by a colon (":") and then a space. After the space we are looking for one or more characters that are either a digit (0-9) or a period "[0-9.]+". Note that inside the square brackets, the period matches an actual period (i.e., it is not a wildcard between the square brackets).

This is a very tight expression that will pretty much match only the lines we are interested in as follows:


```python
# Search for lines that start with 'X' followed by '-' and 
# any characters and ':'
# followed by a space and any number.
# The number can include a decimal.
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search('^X-.*: [0-9.]+', line):
        print(line)
```        


```python
import re
hand = open('mbox-short.txt')
numlist = []
for line in hand:
    line = line.rstrip()
    stuff =  re.findall('^X-.*: ([0-9.]+)', line)
    if len(stuff) != 1 : 
        continue
    num = float(stuff[0])
    numlist.append(num)
print('Maximum: ', max(numlist))       
```

---
# Escape Character

If you want a special regular expression character to just behave normally(most of the time) you prefix it with **`'\'`**

```python
import re
x = 'We just received $10.00 for cookies.' 
y = re.findall('\$[0-9.]+', x)
print(y)
```

# Quiz

1. Which of the following regular expressions would extract 'uct.ac.za' from this string using re.findall?

    - [ ] `@\S+`
    - [ ] `..@\S+..`
    - [ ] `F.+:`
    - [ ] `@(\S+)`
    - [ ] `@(\S+)`
  
 
2. Which of the following is the way we match the "start of a line" in a regular expression?

    - [ ] `^`
    - [ ] `str.startswith()`
    - [ ] `\linestart`
    - [ ] `String.startsWith()`
    - [ ] `variable[0:1]`
    
    
3. What would the following mean in a regular expression? [a-z0-9]

    - [ ] Match a lowercase letter or a digit
    - [ ] Match anything but a lowercase letter or digit
    - [ ] Match an entire line as long as it is lowercase letters or digits
    - [ ] Match any number of lowercase letters followed by any number of digits
    - [ ] Match any text that is surrounded by square braces
    
    
4. What is the type of the return value of the re.findall() method? 
 
    - [ ] A boolean
    - [ ] An integer
    - [ ] A single character
    - [ ] A list of strings
    - [ ] A string
    
    
    
5. What is the "wild card" character in a regular expression (i.e., the character that matches any character)?

    - [ ] ?
    - [ ] *
    - [ ] $
    - [ ] .
    - [ ] +
    - [ ] ^
    
    
6. What is the difference between the "+" and "*" character in regular expressions?

    - [ ] The "+" matches at least one character and the "*" matches zero or more characters
    - [ ] The "+" matches upper case characters and the "*" matches lowercase characters
    - [ ] The "+" matches the beginning of a line and the "*" matches the end of a line
    - [ ] The "+" matches the actual plus character and the "*" matches any character
    - [ ] The "+" indicates "start of extraction" and the "*" indicates the "end of extraction"
    
    

7. What does the "[0-9]+" match in a regular expression?

    - [ ] Zero or more digits
    - [ ] One or more digits
    - [ ] Several digits followed by a plus sign
    - [ ] Any mathematical expression
    - [ ] Any number of digits at the beginning of a line
    
    
    
8. What does the following Python sequence print out?
    ```python
    x = 'From: Using the : character'
    y = re.findall('^F.+:', x)
    print(y)
    ```

    - [ ] :
    - [ ] ^F.+:
    - [ ] ['From:']
    - [ ] ['From: Using the :']
    - [ ] From:
    
    
9. What character do you add to the "+" or "*" to indicate that the match is to be done in a non-greedy manner?

    - [ ] ^
    - [ ] ?
    - [ ] ++
    - [ ] **
    - [ ] $
    - [ ] \g

10. Given the following line of text:

    ```From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008```
    
    What would the regular expression `'\S+?@\S+'` match?
    
    - [ ] marquard@uct
    - [ ] stephen.marquard@uct.ac.za
    - [ ] \@\
    - [ ] From
    - [ ] d@uct.ac.za

-----
# Finding Numbers

In this assignment you will read through and parse a file with text and numbers. You will extract all the numbers in the file and compute the sum of the numbers.
