In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab03.ipynb")

# Lab 3: Regular Expression with Python

Welcome to Lab 3 of DATA 271! 

This document contains examples and small tasks ("appetizers") for you to make sure you understand the examples.  The culminating task ("main course") at the end of the document is more complex, and uses most of the topics you have will have worked through. You should rarely remain stuck for more than a few minutes on questions in labs, so feel free to ask for help. Collaborating on labs is more than okay -- it's encouraged! Explaining things is beneficial -- the best way to solidify your knowledge of a subject is to explain it. Please don't just share answers, though. 

For this lab and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use `my_list` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you passed previously!

### In today's lab, we will
- Learn basic syntax for regular expression in Python and be able to write simple regular expressions using common operations in pattern matching.
- Understand the flexibility regular expression affords in searching and articulate at least one real world example where this is useful
- Become familiar with Python's `re` module and some of its functions such as `findall()`, `search()`, and `sub()`.
- Become more familiar with using online resources such as documentation, "cheat sheets" and Stack Exchange to independently learn more about a technical topic.

## Overview

Regular expression (shortened as regex or regexp) can be used for pattern matching in a text editor.  For example, when you use the "find and replace" feature in Microsoft Word, you are asking the computer to find specific strings which match a pattern and replace them with another string. We might desire a more flexible way to search and replace. For example, we might wish to locate and replace a word spelled two different ways in a text: serialise and serialize (British and American spelling). The regular expression `seriali[sz]e` matches both "serialise" and "serialize". Wildcard characters also achieve this, but are more limited in what they can pattern.

Other examples where this flexibilty is useful might be searching for and extracting email addresses from a file.  We know there will be an at sign (@), but don't know what the constraints are in front of it in terms of word length or characters used. It is possible to read through texts and look for patterns using string methods like `split()` and `find()`. However, searching and extracting is so common that there is a powerful library for these tasks (`re`).

The `re` module provides a set of powerful regular expression facilities, which allows you to quickly check whether a given string matches a given pattern (using the match function), or contains such a pattern (using the search function). A regular expression is a string pattern written in a compact (and quite cryptic) syntax.  

The module functions fall into three categories:
- pattern matching
- substitution
- splitting

The regex describes a pattern to locate in the text, and then we can use specific methods to accomplish tasks.  You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?” You can also use regular expression to modify a string or to split it apart in various ways.

The documentation provides more details: https://docs.python.org/3/library/re.html.

### Pattern matching


In [None]:
import re # Imports the re module

#Check if the string starts with "The" and ends with "Spain":

txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)

if x:
  print("YES! We have a match!")
else:
  print("No match")

In [None]:
#Check if the string starts with "The" and ends with "Spain":
txt = "The Running of the Bulls occurs in Pamplona, Spain"
x = re.search("^The.*Spain$", txt)

if x:
  print("YES! We have a match!")
else:
  print("No match")

In [None]:
#Check if the string starts with "The" and ends with "Spain":

txt = "The Louvre is in Paris, France"
x = re.search("^The.*Spain$", txt)

if x:
  print("YES! We have a match!")
else:
  print("No match")

### Print Matches
You can print any matches found with the following.

In [None]:
txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x)
len(x) # how many matches are found

In [None]:
# returns an empty list if no matches are found
txt = "The rain in Spain"
x = re.findall("Portugal", txt)
print(x)

### Search
The `search()` function searches the string for a match and returns a match object if there is a match.  If there is more than one match, only the first occurence will be returned.  If there are no matches `None` is returned.  The match object returned has properties and methods which can provide more information about the search such as
- `span()` which returns a tuple containing the start and end positions of the match
- `group()` returns the part of the string where there was a match

In [None]:
txt = "The rain in Spain"
# search for first white space character \s
x = re.search("\s", txt)

print("The first white-space character is located in position:", x.start())

In [None]:
x = re.search("Portugal", txt)
print(x)

In [None]:
x = re.search("ai", txt)
print(x) #this will print an object

In [None]:
x = re.search(r"\bS\w+", txt)
print(x)
# span returns the start and end position of the first match occurrence.
print(x.span())

In [None]:
# look for upper case S
x = re.search(r"\bS\w+", txt)
# print the part of the string where there was a match
print(x.group())

### Split
The `split()` function returns a list where the string has been split on each match.  Notice this can also be accomplished with the string method `split()`.

In [None]:
# split on white space
x = re.split("\s", txt)
print(x)

In [None]:
txt = "The rain in Spain"
x = txt.split()
print(x)

### Substitution
The `sub()` function replaces the matches found with a string you indicate.  You can control how many replacements are done with the optional count parameter.

In [None]:
# replace spaces with the number nine as a string
x = re.sub("\s", "9", txt)
print(x)

In [None]:
x = re.sub("\s", "9", txt, 2)
print(x)

### Example application: Bioinformatics
The flexibility of regular expression is particularly useful in bioinformatics.  A codon is a DNA or RNA sequence of 3 nucleotides that encodes a particular amino acid or gives a stop signal.  For DNA, there are three stop codons: TAG, TAA, and TGA.  If we want to match any sequence of DNA terminated by a stop codon, we can use this syntax:
`([ACTG])+(TAG|TAA|TGA)`.
- `[ACTG]` indicates any of the nucleotide bases (A, C, T, G) 
- the parentheses group patterns
- `+` modifies the previous group to match one or more times
- `(TAG|TAA|TGA)` indicates followed by one of the stop codons (the | notation signifies or)

Curly brackets allow flexibility in terms of how many repetitions we are searching for.  For example, `(AT){10,100}` matches an "AT" repeated 10 to 100 times.  `(AT){10,}` matches an "AT" repeated 10 or more times (no upper bound).

Open the .txt file *grape.txt* provided.  This file contains information about the Vitis vinifera (common grape) genome.  

The GATA protein is a transcription factor and is important for regulating transcription (the process where cells make an RNA copy of a piece of DNA which will later be used to make proteins). It binds to any short DNA sequence which matches the pattern GATA with either an A or a T before and either a G or an A after.  For example, in this sequence 
- `AAAAAAATGATAGAAAAAGATAAAAAA`
there are two matches (find the substring GATA, and then check that before you see an A or a T and after you see a G or an A).

Given a specific string, we can use regular expression to find out how many times this motif occurs.

In [None]:
def count_motifs(seq, motif):
    pieces = re.split(motif, seq)
    return len(pieces) - 1

seq = 'AAAAAAATGATAGAAAAAGATAAAAAA'
count_motifs(seq, '[AT]GATA[GA]')

## Appetizers
Now it's time for you to get some practice.

**Question 1:** We have seen now different syntaxes to flexibly control what we want to match or replace.  Google "regular expression in Python cheat sheet" and download one of your choosing. Or use [this one](https://canvas.humboldt.edu/courses/71553/files/5254145?wrap=1). Take a moment to read through it.
Choose four different syntaxes on the cheat sheet and write a small example with a string of your choosing like "The rain in Spain" to test them.  
Specifically experiment with
- `[]` vs `()`
- `+` vs `*`
- `{}`
- etc.

Find one question on Stack Overflow related to regular expression in Python and read the answer and test it out with code. (If the first one you find doesn't make sense, look for another.)  Did you learn anything about syntax from the example you found? Be prepared to explain the problem and solution to a peer.

In [None]:
any_string = ...
some_practice = ...
some_practice

In [None]:
any_string = ...
some_practice = ...
some_practice

In [None]:
any_string = ...
some_practice = ...
some_practice

In [None]:
any_string = ...
some_practice = ...
some_practice

In [None]:
any_string = ...
some_practice = ...
some_practice

**Question 2:** We will use the text file *emails.txt* for this exercise. If you are working from your local device, this file needs to be in the same directory as your Jupyter notebook. If you are using JupyterHub, this is already done for you. 

Here is a snippet of the file:

From bkirschn@umich.edu Fri Dec 21 09:55:06 2007
Return-Path: <postmaster@collab.sakaiproject.org>
Received: from murder (mail.umich.edu [141.211.14.25])
	 by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA;
	 Fri, 21 Dec 2007 09:55:06 -0500
X-Sieve: CMU Sieve 2.3
Received: from murder ([unix socket])
	 by mail.umich.edu (Cyrus v2.2.12) with LMTPA;
	 Fri, 21 Dec 2007 09:55:06 -0500
Received: from dreamcatcher.mr.itd.umich.edu (dreamcatcher.mr.itd.umich.edu [141.211.14.43])
	by panther.mail.umich.edu () with ESMTP id lBLEt6x8006098;
	Fri, 21 Dec 2007 09:55:06 -0500
Received: FROM paploo.uhi.ac.uk (app1.prod.collab.uhi.ac.uk [194.35.219.184])
	BY dreamcatcher.mr.itd.umich.edu ID 476BD3C4.BFDC1.28307 ; 
	21 Dec 2007 09:55:03 -0500
Received: from paploo.uhi.ac.uk (localhost [127.0.0.1])
	by paploo.uhi.ac.uk (Postfix) with ESMTP id A4CC6A7DD7;
	Fri, 21 Dec 2007 14:51:39 +0000 (GMT)
Message-ID: <200712211454.lBLEs7d9009944@nakamura.uits.iupui.edu>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit


Use the `search()` function to make a list of the lines in the .txt file which contain the word `"From."` Be sure that the lines that end up in your list do not include any trailing characters. 

*HINT:* `rstrip()` method removes any trailing characters (characters at the end a string) with space as the default character to remove.

In [None]:
hand = open('emails.txt')
from_lines = []
for ... in ...: 
    clean_line = ... # remove trailing characters
    if ...: # search for the word "From"
        ... # add to list
hand.close  
from_lines

In [None]:
grader.check("q2")

The real power of regular expression comes from adding special characters to the search string to more precisely control which lines match the string.  For example, we can search for lines that start with `From` and have an `@` sign.  
- the `^` symbol indicates the start of the line
- the `.+` means one or more characters.  Think of this as a wildcard expanding to match an unspecified number of characters.
- the `@` looks for this sign
- putting this together, we can search for `'^From:.+@'`

Notice that we don't have to specify how many characters are before the @.  This is good because email addresses vary.  For example, Humboldt's Math Department email is math@humboldt.edu and the Biology Deparment's email is biosci@humboldt.edu.  They have a different number of characters before the @ sign.  Therefore we need flexibility in the searching.

Here is code to accomplish this task.

In [None]:
hand = open('emails.txt')
from_lines2 = []
for line in hand:
    clean_line = line.rstrip()
    if re.search('^From:.+@', clean_line): # search for lines starting with From: followed by one or more characters (.+) followed by an @ sign
        from_lines2.append(clean_line)
from_lines2

### Extracting Data with Regular Expression
The method `findall()` finds *all* the matches and returns them as a list of strings, with each string representing one match.
If we would like to find all strings that look like email addresses, we can search `'\S+@\S+'`.  This works because
- `\S` matches a single character other than white space.  Adding the + means one or more characters other than white space, so `\S+` matches as many nonwhite space characters as possible (greedy).
- `@` looks for the sign in all email addresses
- `\S+` again looks for non white space characters

The terms greedy and lazy in regular expression mean
- greedy (default): keep searching until the condition is not satisfied
- lazy (indicated with a ? at the end of the quantifier): stop searching once the condition is satisfied

In [None]:
hand = open('emails.txt')
for line in hand:
    clean_line = line.rstrip()
    x = re.findall('\S+@\S+', clean_line) # look for lines that match at least one non-white space character, the @ and at least one non-white space character
    if len(x) > 0:
        print(x)
hand.close()

We see some of the email addresses returned have characters we might not want.  For example,we might want to remove the `<` and `>` in this address: `<postmaster@collab.sakaiproject.org>`.  We can just keep the portion of the string that starts with a letter or a number.
- square brackets are used to indicate a set of multiple acceptable characters we consider matching
- `[a-zA-Z0-9]\S*@\S*[a-zA-Z]` tells us to look for substrings that
    - start with a single lowercase character, uppercase character or digit (`[a-zA-Z0-9]`)
    - followed by zero or more nonblank characters (`\S*`)
    - has an `@` sign
    - is followed by zero or more nonblank characters which are letters (`*[a-zA-Z]`)
    - note that `*` means zero or more and `+` means one or more, applied to the single character immediately to the left

In [None]:
hand = open('emails.txt')
for line in hand:
    clean_line = line.rstrip()
    x = re.findall('[a-zA-Z0-9]\S*@\S*[a-zA-Z]', clean_line) 
    if len(x) > 0:
        print(x)
hand.close()

**Question 3.1:** Let the following string be considered: `'X-DSPAM-Confidence: 0.8475'`. Use `find` and string slicing to extract the number and convert it to a float. *HINT:* Everything after the colon is the number.

In [None]:
phrase = 'X-DSPAM-Confidence: 0.8475'

col_pos = ...
number = ...
number

In [None]:
grader.check("q3_1")

**Question 3.2:** Consider the same string: `'X-DSPAM-Confidence: 0.8475'`. Complete the same task you did in question 3.1, but this time use regular expression.

In [None]:
number_with_re = ...
number_with_re = ...
number_with_re

In [None]:
grader.check("q3_2")

**Question 4.1:** Using *emails.txt*, extract the hour of day that email messages were sent and put them into a list. Do this with two calls to split (splitting on the colon and then on spaces is one way to extract the hour (e.g., 09 for 9 am)).

In [None]:
hand = open('emails.txt')
hour_of_day = ... 
for ... in ...: 
    clean_line = line.rstrip() # remove trailing characters
    if not clean_line.startswith('From '): 
        continue # do nothing if it is not a line we are interested in 
    x = ... # split on colon 
    y = ... # split on space 
    ... # add hour to list 
hand.close()

hour_of_day

In [None]:
grader.check("q4_1")

**Question 4.2:** Using *emails.txt*, extract the hour of day that email messages were sent and put them into a list like you did in question 4.1. This time, use regular expression.

*HINT:* Look for lines that start with `From` then have a space, potentially some number of characters followed by a space and two digits followed by a colon.  Extract the two digits, indicated with the square brackets.

In [None]:
hand = open('emails.txt')
hour_of_day_with_re = []
...
    clean_line = ...
    x = ...
    ...
        ...
hand.close()
hour_of_day

In [None]:
grader.check("q4_2")

### 5. Main Course
In this proble you will read through and parse a file with text and numbers. You will extract all the numbers in the file and compute the sum of the numbers.  


The file contains text from a data science textbook introduction with random numbers inserted through the verbage.

For example, the text might look like this:

Why should you learn to write programs? 7746
12 1929 8827
Writing programs (or programming) is a very creative 
7 and rewarding activity.  You can write programs for 
many reasons, ranging from making your living to solving
8837 a difficult data analysis problem to having fun to helping 128
someone else solve a problem.  This book assumes that 
everyone needs to know how to program ...


The data can be found at this link: http://py4e-data.dr-chuck.net/regex_sum_1742785.txt. 

The basic outline of this problem is to 
- read the file
- look for integers using the `re.findall()`
- look for a regular expression of `'[0-9]+'` 
- convert the extracted strings to integers
- sum up the integers.


**Question 5.1:** Download the file from the link above and read the file. Look for integers using `re.findall()`. Make a list of lists containing the numbers in each line of the file. 

In [None]:
file = open('regex_sum_1742785.txt')

numbers_in_line = ...
...
    ...
    
file.close()
numbers_in_line

In [None]:
grader.check("q5_1")

**Question 5.2:** Convert the strings from the previous question to integers.

In [None]:
strings_to_ints = ...
...
    ...
        ...
            ...
strings_to_ints         

In [None]:
grader.check("q5_2")

**Question 5.3:** Add up all the integers from problem 5.2. 

In [None]:
sum_all_nums = ...
sum_all_nums

In [None]:
grader.check("q5_3")

### 6. Dessert
Huntington's Disease is a neurogenerative disorder and is linked to the anomalous expansion of the number of tribucleotide repeats in particular genes.  Human beings have 23 pairs of chromosomes in our cells and each of our parents contributes one chromosome to each pair. The gene that causes Huntington's Disease (HD) is found on chromosome 4.  Each of us gets one copy of the gene from our mother and one copy from our father.

The gene responsible for HD contains a sequence with several CAG repeats (cytosine, adenine, guanine which are bases forming this specific codon). We all have these CAG repeats in the gene that codes for the huntingtin protein, but people with HD have a greater number than usual of CAG repeats in one of the genes they inherited.  (This protein is found in many of the body's tissues, with the highest levels of activity in the brain. Within cells, this protein may be involved in chemical signaling, transporting materials, binding to proteins and other structures, etc.)

The actual number of repeats of a specific codon determines the risk of developing HD. More than 35 repeats virtually assures the disease.  In this task, we will use regular expression to find the number of repeats of the CAG codon in a specific mRNA sequence.

**Question 6:** Using the *HTTmRNA.txt* file ([source](https://www.ncbi.nlm.nih.gov/nuccore/NM_002111.8?report=fasta)), use regular expression to determine how many times either CAG is repeated. *HINT:* You may have to play around with the pattern to figure this out. 

In [None]:
fhand = open('HTTmRNA.txt')
htt_mRNA = ...
htt_pattern = ...
match = ...
print(len(match))
num_repeats = ...
fhand.close()

In [None]:
grader.check("q6_1")

### You're done!
Congratulations on finishing Lab 3! Gus is very proud of you. Run the cell below to download a zip and upload to Canvas. 

<img src="gus_spies_on_neighbors.JPG" alt="drawing" width="300"/>

### References
- Python for Everybody: Exploring Data in Python 3 by Charles Severance.  https://www.py4e.com/book.php
- Möncke‐Buchner, Elisabeth, et al. "Counting CAG repeats in the Huntington’s disease gene by restriction endonuclease Eco P15I cleavage." Nucleic Acids Research 30.16 (2002): e83-e83.
- A Primer for Computational Biology by Shawn T. ONeil https://open.oregonstate.education/computationalbiology/chapter/bioinformatics-knick-knacks-and-regular-expressions/
- Using Regular Expression in Genetics with Python by Stephen Fordham.  https://towardsdatascience.com/using-regular-expression-in-genetics-with-python-175e2b9395c2

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)