In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab03.ipynb")

# Lab 3: Regular Expression with Python

Welcome to Lab 3 of DATA 271! 

This document contains examples and small tasks ("appetizers") for you to make sure you understand the examples.  The culminating task ("main course") at the end of the document is more complex, and uses most of the topics you have will have worked through. You should rarely remain stuck for more than a few minutes on questions in labs, so feel free to ask for help. Collaborating on labs is more than okay -- it's encouraged! Explaining things is beneficial -- the best way to solidify your knowledge of a subject is to explain it. Please don't just share answers, though. 

For this lab and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use `my_list` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you passed previously!

### In today's lab, we will
- Learn basic syntax for regular expression in Python and be able to write simple regular expressions using common operations in pattern matching.
- Understand the flexibility regular expression affords in searching and articulate at least one real world example where this is useful
- Become familiar with Python's `re` module and some of its functions such as `findall()`, `search()`, and `sub()`.
- Become more familiar with using online resources such as documentation, "cheat sheets" and Stack Exchange to independently learn more about a technical topic.

## Overview

Regular expression (shortened as regex or regexp) can be used for pattern matching in a text editor.  For example, when you use the "find and replace" feature in Microsoft Word, you are asking the computer to find specific strings which match a pattern and replace them with another string. We might desire a more flexible way to search and replace. For example, we might wish to locate and replace a word spelled two different ways in a text: serialise and serialize (British and American spelling). The regular expression `seriali[sz]e` matches both "serialise" and "serialize".

Other examples where this flexibilty is useful might be searching for and extracting email addresses from a file.  We know there will be an at sign (@), but don't know what the constraints are in front of it in terms of word length or characters used. It is possible to read through texts and look for patterns using string methods like `split()` and `find()`. However, searching and extracting is so common that there is a powerful library for these tasks (`re`).

The `re` module provides a set of powerful regular expression facilities, which allows you to quickly check whether a given string matches a given pattern (using the match function), or contains such a pattern (using the search function). A regular expression is a string pattern written in a compact (and quite cryptic) syntax.  

The regex describes a pattern to locate in the text, and then we can use specific methods to accomplish tasks.  You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?” You can also use regular expression to modify a string or to split it apart in various ways.

The documentation provides more details: https://docs.python.org/3/library/re.html.

### Pattern matching


In [None]:
import re # Imports the re module

#Check if the string starts with "The" and ends with "Spain":

txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)

if x:
    print("YES! We have a match!")
else:
    print("No match")

In [None]:
#Check if the string starts with "The" and ends with "Spain":
txt = "The Running of the Bulls occurs in Pamplona, Spain"
x = re.search("^The.*Spain$", txt)

if x:
    print("YES! We have a match!")
else:
    print("No match")

In [None]:
#Check if the string starts with "The" and ends with "Spain":

txt = "The Louvre is in Paris, France"
x = re.search("^The.*Spain$", txt)

if x:
    print("YES! We have a match!")
else:
    print("No match")

### Print Matches
You can print any matches found with the following.

In [None]:
txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x)
len(x) # how many matches are found

In [None]:
# returns an empty list if no matches are found
txt = "The rain in Spain"
x = re.findall("Portugal", txt)
print(x)

### Search
The `search()` function searches the string for a match and returns a match object if there is a match.  If there is more than one match, only the first occurence will be returned.  If there are no matches `None` is returned.  The match object returned has properties and methods which can provide more information about the search such as
- `span()` which returns a tuple containing the start and end positions of the match
- `group()` returns the part of the string where there was a match

In [None]:
txt = "The rain in Spain"
# search for first white space character \s
x = re.search("\s", txt)

print("The first white-space character is located in position:", x.start())

In [None]:
x = re.search("Portugal", txt)
print(x)

In [None]:
x = re.search("ai", txt)
print(x) #this will print an object

In [None]:
x = re.search(r"\bS\w+", txt)
print(x)
# span returns the start and end position of the first match occurrence.
print(x.span())

In [None]:
# look for upper case S
x = re.search(r"\bS\w+", txt)
# print the part of the string where there was a match
print(x.group())

### Split
The `split()` function returns a list where the string has been split on each match.  Notice this can also be accomplished with the string method `split()`.

In [None]:
# split on white space
x = re.split("\s", txt)
print(x)

In [None]:
txt = "The rain in Spain"
x = txt.split()
print(x)

### Substitution
The `sub()` function replaces the matches found with a string you indicate.  You can control how many replacements are done with the optional count parameter.

In [None]:
# replace spaces with the number nine as a string
x = re.sub("\s", "9", txt)
print(x)

In [None]:
x = re.sub("\s", "9", txt, 2)
print(x)

## Appetizers
Now it's time for you to practice.

### 1. Regex exploration
We have seen now different syntaxes to flexibly control what we want to match or replace.  Google "regular expression in Python cheat sheet" and download one of your choosing. Or use [this one](https://canvas.humboldt.edu/courses/76930/files/folder/Uploaded%20Media%202?preview=6039402). Take a moment to read through it.
Choose four different syntaxes on the cheat sheet and write a small example with a string of your choosing like "The rain in Spain" to test them.  
Specifically experiment with
- `[]` vs `()`
- `+` vs `*`
- `{}`
- etc.

Explain what your thought process and what you learn in a Markdown cell.

In [None]:
any_string = ...
some_practice = ...
some_practice

In [None]:
any_string = ...
some_practice = ...
some_practice

In [None]:
any_string = ...
some_practice = ...
some_practice

In [None]:
any_string = ...
some_practice = ...
some_practice

In [None]:
any_string = ...
some_practice = ...
some_practice

### 2. Emails
We will use the text file *emails.txt* for this exercise. Run the cell below to import the data and look at a snippet of the file.

In [None]:
hand = open('emails.txt')
emails = hand.read()
hand.close()
emails[:1000]

**Question 2.1:** With the `emails` data, make a list called `from_lines` containing all the lines in the .txt file which contain the word `"From"` (case sensitive). Be sure that the lines that end up in your list do not include any trailing spaces. 

*HINT:* The `rstrip` method will likely be helpful.

In [None]:
email_lines = ... # split email text by new lines
from_lines = ... # a list of email lines containing the word "From" (with no trailing spaces)
from_lines


In [None]:
grader.check("q2_1")

**Question 2.2:** The real power of regular expression comes from the flexibility to more precisely control which lines match the string. Use regular expression to make a list called `from_addresses` that contain all the lines from `emails` that *start* with `From:` and have an `@` sign. Be sure that the lines that end up in your list do not include any trailing spaces.

Remember that email addresses vary in length.  For example, Humboldt's Math Department email is math@humboldt.edu and the Biology Deparment's email is biosci@humboldt.edu.  They have a different number of characters before the @ sign.  Therefore your regex will have to be flexible enough to account for this variability.

In [None]:
from_addresses = ...
from_addresses

In [None]:
grader.check("q2_2")

**Question 2.3:** Suppose we want to create a contact list based on the email addresses from which emails were sent. Use regular expression to create a list called `contacts` containing all the unique email addresses in `from_addresses`. Your final answer should be a list of strings. For example, the first few entries of `contacts` could look like

```python
['stuart.freeman@et.gatech.edu', 'zach.thomas@txstate.edu', 'louis@media.berkeley.edu', ...]
```

In [None]:
contacts = ...
contacts

In [None]:
grader.check("q2_3")

**Question 2.4:** Using regular expressions, extract the *hour of day* that email messages were sent and put them into a list.

*HINT:* Look for lines that start with `From` followed by a space to get time information about the emails. For example, the line 

```python
'From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008'
```

contains time information and the hour of the day in which it was sent is `09`. Your final answer should be a list of strings. 

In [None]:
hour_of_day = ...
hour_of_day

In [None]:
grader.check("q2_4")

### 3. Comparing string methods and Regex

**Question 3.1:** Let the following string be considered: `'X-DSPAM-Confidence: 0.8475'`. Use `find` and string slicing to extract the number and cast it to a float. *HINT:* Everything after the colon is the number.

In [None]:
phrase = 'X-DSPAM-Confidence: 0.8475'

col_pos = ...
number = ...
number

In [None]:
grader.check("q3_1")

**Question 3.2:** Consider the same string: `'X-DSPAM-Confidence: 0.8475'`. Complete the same task you did in question 3.1, but this time use regular expression. Please type your regular expression in `regex` then extract the number from phrase in `number_with_re`. 

In [None]:
regex = ...
number_with_re = ...
number_with_re

In [None]:
grader.check("q3_2")

**Question 3.3:** Compare your approach from problems 3.1 and 3.2. List pros and cons of string methods and pros and cons of regex. 

*Type your answer, replacing this text.*

## 4. Parsing data
In this problem you will read through and parse a file with text and numbers. You will extract all the numbers in the file and compute the sum of the numbers.  

The file contains text from a data science textbook introduction with random numbers inserted through the verbage.

For example, the text might look like this:

```python
'''
Why should you learn to write programs? 7746
12 1929 8827
Writing programs (or programming) is a very creative 
7 and rewarding activity.  You can write programs for 
many reasons, ranging from making your living to solving
8837 a difficult data analysis problem to having fun to helping 128
someone else solve a problem.  This book assumes that 
everyone needs to know how to program ...
'''
```

**Question 4.1:** Import the data from `regex_sum.txt`. Find all the integers in the text file and add them up. Store your result in `numbers_sum`.

**Tips:**
- Refer to problem 2 to see how to import data from a txt file. 
- Consecutive digits should count as a single number. For example, the line 
```python
'8837 a difficult data analysis problem to having fun to helping 128'
```
 contains two numbers: 8837 and 128. 

In [None]:
file = ...
text_file = ...
...

numbers_sum = ...
numbers_sum

In [None]:
grader.check("q4_1")

## Main Course

Imagine you're ordering party invitations from a website like Shutterfly, which mails them directly to your guests. The website requires addresses to be entered in separate fields: Address Line 1, Address Line 2, City, State, and Zip Code. However, you currently have each address stored as a single string.

Your task is to use regular expressions (regex) to separate the components from each address. Run the cell below to import the addresses. (*This address list was fabricated for the purpose of this exercise.*)

In [None]:
from csv import reader
file = open('fake_address_list.csv')
address_contents = list(reader(file))
file.close()

**Question 5.1:** The `address_contents` variable contains the names and contact information for your party guests. Use it to create a list containing just the addresses (no names). Do not include the header (column name) in your list. 

In [None]:
address_list = ...
address_list

In [None]:
grader.check("q5_1")

**Question 5.2:** Use regular expressions to separate the City, State, and Zip Code from each address. Assign the following variables:

- `city`:  a list containing all the cities in your address list (preserving the order from the original file)
- `state`: a list containing all the state appreviations in your address list (preserving order)
- `zip_code`: a list containing all the zip codes (type int) in your address list (preserving order)

In [None]:
city = ...
city

In [None]:
state = ...
state

In [None]:
zip_code = ...
zip_code

In [None]:
grader.check("q5_2")

**Question 5.3:** Use regular expressions to separate the address lines from each address. Assign the following variables:

- `address_line1`:  a list containing all the first lines of addresses in your address list (preserving the order from the original file)
- `address_line2`: a list containing all the second lines of addresses in your address list (preserving order). For addresses with no second line, this should contain an empty string. 

As an example, for the address 
```python
'404 Elm Blvd Apt 37, Denver, CO 48403'
```

The cooresponding element in `address_line1` should be 
```python
'404 Elm Blvd'
```
and the cooresponding element in `address_line2` should be 
```python
'Apt 37'
```

*HINT:* The second line in addresses start with either "Apt", "Unit", or "Suite".

In [None]:
address_line1 = ...
address_line1

In [None]:
address_line2 = ...
                 ...
                 ...
                 ...
address_line2

In [None]:
grader.check("q5_3")

## Dessert

#### Background: Bioinformatics
The flexibility of regular expression is useful in many applications including bioinformatics.  A codon is a DNA or RNA sequence of 3 nucleotides that encodes a particular amino acid or gives a stop signal.  For DNA, there are three stop codons: TAG, TAA, and TGA.  If we want to match any sequence of DNA terminated by a stop codon, we can use this syntax:
`([ACTG])+(TAG|TAA|TGA)`.
- `[ACTG]` indicates any of the nucleotide bases (A, C, T, G) 
- the parentheses group patterns
- `+` modifies the previous group to match one or more times
- `(TAG|TAA|TGA)` indicates followed by one of the stop codons (the | notation signifies or)

Curly brackets allow flexibility in terms of how many repetitions we are searching for.  For example, `(AT){10,100}` matches an "AT" repeated 10 to 100 times.  `(AT){10,}` matches an "AT" repeated 10 or more times (no upper bound).

Open the .txt file *grape.txt* provided.  This file contains information about the Vitis vinifera (common grape) genome.  

The GATA protein is a transcription factor and is important for regulating transcription (the process where cells make an RNA copy of a piece of DNA which will later be used to make proteins). It binds to any short DNA sequence which matches the pattern GATA with either an A or a T before and either a G or an A after.  For example, in this sequence 
- `AAAAAAATGATAGAAAAAGATAAAAAA`
there are two matches (find the substring GATA, and then check that before you see an A or a T and after you see a G or an A).

Given a specific string, we can use regular expression to find out how many times this motif occurs.

In [None]:
def count_motifs(seq, motif):
    pieces = re.split(motif, seq)
    return len(pieces) - 1

seq = 'AAAAAAATGATAGAAAAAGATAAAAAA'
count_motifs(seq, '[AT]GATA[GA]')

Huntington's Disease is a neurogenerative disorder and is linked to the anomalous expansion of the number of tribucleotide repeats in particular genes.  Human beings have 23 pairs of chromosomes in our cells and each of our parents contributes one chromosome to each pair. The gene that causes Huntington's Disease (HD) is found on chromosome 4.  Each of us gets one copy of the gene from our mother and one copy from our father.

The gene responsible for HD contains a sequence with several CAG repeats (cytosine, adenine, guanine which are bases forming this specific codon). We all have these CAG repeats in the gene that codes for the huntingtin protein, but people with HD have a greater number than usual of CAG repeats in one of the genes they inherited.  (This protein is found in many of the body's tissues, with the highest levels of activity in the brain. Within cells, this protein may be involved in chemical signaling, transporting materials, binding to proteins and other structures, etc.)

The actual number of repeats of a specific codon determines the risk of developing HD. More than 35 repeats virtually assures the disease.  In this task, we will use regular expression to find the number of repeats of the CAG codon in a specific mRNA sequence. 

Run the cell below to import data from the *HTTmRNA.txt* file ([source](https://www.ncbi.nlm.nih.gov/nuccore/NM_002111.8?report=fasta)).

In [None]:
fhand = open('HTTmRNA.txt')
htt_mRNA = fhand.read()
fhand.close()

**Question 6.1:** Use regular expression to determine how many times CAG is repeated. 

**Tips:**
- `htt_mRNA` contains some new lines (`\n`). Remove the new lines before looking for the repeating CAG pattern.
- You may have to play around with the pattern to figure this out. 

In [None]:
htt_mRNA_cleaned = ...
htt_pattern = ...
match = ...
...
num_repeats = ...

In [None]:
grader.check("q6_1")

**Question 6.2:** Based on your result from the previous problem, do you think the mRNA sequence indicates a high probability of Huntington's Disease?

*Type your answer, replacing this text.*

### Submission
Congratulations on finishing Lab 3! Gus is very proud of you. Run the cell below to download a zip and upload to Canvas. 

<img src="gus_spies_on_neighbors.JPG" alt="drawing" width="300"/>

### References
If you want to read more about these topics, check out these sources.
- Python for Everybody: Exploring Data in Python 3 by Charles Severance.  https://www.py4e.com/book.php
- Möncke‐Buchner, Elisabeth, et al. "Counting CAG repeats in the Huntington’s disease gene by restriction endonuclease Eco P15I cleavage." Nucleic Acids Research 30.16 (2002): e83-e83.
- A Primer for Computational Biology by Shawn T. ONeil https://open.oregonstate.education/computationalbiology/chapter/bioinformatics-knick-knacks-and-regular-expressions/
- Using Regular Expression in Genetics with Python by Stephen Fordham.  https://towardsdatascience.com/using-regular-expression-in-genetics-with-python-175e2b9395c2

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)