*This is a Jupyter Notebook. To run a gray code cell, click in the cell and either click the play arrow or type shift+enter (shift+return on a Mac).*

# <br>Finding Patterns in Strings

This is a task that comes up a lot in coding. Some common uses:
- see if data is present
- extract data from a software output file
- extract data from text
- remove data (names, pronouns, or other info) from text


This can be divided into two general questions:
1. Is the data there?
2. Where is the data? (to remove it or collect it)

For matching exact substrings, Python has some built-in functions. For matching more complicated patterns, you will need to use the Python package `re`, which stands for **regular expressions**.

**Regular expressions** are not unique to Python. They are built from a *regular language*, which is a language of symbols. Each symbol designates either something to search for - for example, a digit or an uppercase letter - or some way to search - for example, "do not include" or "return any number of" or "only return if it is at the beginning or end of a word". These symbols can be combined to search for the exact pattern you are looking for. Sometimes regular expressions are also called **regexes** and working with regular expressions is called **regex**.

The *regular language* contains hundreds of symbols, and we do not have time to learn all possible searches today. You may be wondering, "Can I search for (fill in your pattern here)?" The answer is always yes, but we might not get to it today. For a more thorough tutorial on the `re` package in Python, I recommend the Real Python Regex tutorial: https://realpython.com/regex-python/. Most of RealPython's tutorials are behind a paywall, but this one is available for free. 

## <br><br>Searching for exact strings using Python's built-in functions

#### Is the data there?

To see if an exact substring is present, we use the `in` boolean operator:

In [139]:
full = "Morgan Taylor 555-555-1890"

In [140]:
if "Taylor" in full:
    print("True")

True


In [141]:
if " 555-" in full:
    print("True")

True


In [142]:
" 555-" in full

True

#### <br><br>Where is the data?

The method function `find()` returns the index position of the first character in the search string.

In [143]:
full = "Morgan Taylor 555-555-1890"

In [144]:
full.find(" 555-")

13

In [145]:
full[13]

' '

In [146]:
full[13:]

' 555-555-1890'

In [147]:
full.find("Taylor")

7

If we wanted to return the entire substring in the full string (to collect it or remove it) we could do something like this:

In [26]:
search_term = "Taylor"
length_term = len(search_term)
start = full.find(search_term)
end = start + length_term
full[start:end]

'Taylor'

#### <br><br>A note about collecting or removing data from strings

Sometimes, you will need to search for one string in order to locate another. This would be if the data you need doesn't follow a strict pattern but some some substring near the data you need does follow a pattern.

For example, let's say you wanted to extract filenames from a list of text lines. The filenames are all different, with different extensions, but each filename is preceded by the phrase "Filename: ". 

In [20]:
data = ["Date: 11-10-19\tFilename: 73820.pdf\tAuthor: User", 
        "Date: Unknown\tFilename: giantPicture.jpeg\tAuthor: Unknown", 
        "Date: 12-01-19\tFilename: gene3820028.txt\tAuthor: User", 
        "Filename: P38HAK8.EXE\tAuthor: Admin"]

Using regular expressions, you could search for the pattern of the actual filename - a series of letters or numbers followed by a dot followed by three or four letters. But it would be much easier to search for the consistent string "Filename: " and take what is after it. Keep this in mind when searching for patterns in strings: The pattern may be before or after the data you actually desire.

In [22]:
for line in data:
    start = line.find("Filename: ")
    short_line = line[start:].replace("Filename: ", "")
    end = short_line.find("\t")
    filename = short_line[:end]
    print(filename)
    

73820.pdf
giantPicture.jpeg
gene3820028.txt
P38HAK8.EXE


### <br><br>Exercise 1

In [148]:
output = "The file provided produced two statistics. Average time spent in the pool was 1.112 hours. Average number of swimmers was 46.0."
print(output)

The file provided produced two statistics. Average time spent in the pool was 1.112 hours. Average number of swimmers was 46.0.


Run the line of code above to store the output. Write code using the `find()` function to return the average time spent in the pool. (In this example, you want to return `1.112`). This will take more than one line of code and there are multiple ways to do it.

In [25]:
search_string = "spent in the pool was "
search_length = len(search_string)
search_string2 = " hours."
start = output.find(search_string) + search_length
end = output.find(search_string2)
print(output[start:end])

1.112


## <br><br>The `re` module

First, we need to import the module:

In [27]:
import re

<br><br>We can use the function `re.search()` to find out if a string is present AND the location of the string. `re.search()` is not a method function - i.e. it doesn't go after an object. Instead, it takes two arguments - the expression to find and the string to search in.

In [32]:
full = "Morgan Taylor 555-555-1890"
re.search("Taylor", full)

<re.Match object; span=(7, 13), match='Taylor'>

This returns a match generator object. It tells us the start and end positions of the match (**span**) and the term that was matched (**match**). In this case, we asked for an exact match, so our match is the same as the search term.

If no match was found, nothing would be returned.

In [34]:
re.search("Tylor", full)

<br><br>The `re.search` generator object can also be used as a Boolean:

In [35]:
if re.search("Taylor", full):
    print("A match!")
else:
    print("No match.")

A match!


In [36]:
if re.search("Tylor", full):
    print("A match!")
else:
    print("No match.")

No match.


<br><br>We can return only some information from our search:

The span:

In [37]:
re.search("Taylor", full).span()

(7, 13)

The start position:

In [40]:
re.search("Taylor", full).start()

7

The end position:

In [41]:
re.search("Taylor", full).end()

13

The string that was matched:

In [42]:
re.search("Taylor", full).group()

'Taylor'

### <br><br>Exercise 2

In [149]:
output = "The file provided produced two statistics. Average time spent in the pool was 1.112 hours. Average number of swimmers was 46.0."
print(output)

The file provided produced two statistics. Average time spent in the pool was 1.112 hours. Average number of swimmers was 46.0.


Now you have learned how to use `re.search().start()` and `re.search().end()` you can more easily complete the same task as in Exercise 1. Write code to return the average number of hours spent in the pool from the string above. In this case, it should return `1.112`.

In [55]:
start = re.search("spent in the pool was ", output).end()
end = re.search(" hours.", output).start()
output[start: end]

'1.112'

### <br><br>Finding multiple occurrences

`re.search()` finds the first occurrence of the search term, when searching the full string from left to right. `re.findall()` finds all occurrences in the full string, and `re.finditer()` returns the start and end positions of all occurrences in the full string.

In [43]:
print(full)

Morgan Taylor 555-555-1890


In [44]:
re.search("555", full)

<re.Match object; span=(14, 17), match='555'>

In [45]:
re.findall("555", full)

['555', '555']

In [47]:
re.finditer("555", full)

<callable_iterator at 0x11304e5f8>

`re.finditer()` returns a generator object that you can loop through to reveal the generator object of each match:

In [48]:
matches = re.finditer("555", full)
for m in matches:
    print(m)

<re.Match object; span=(14, 17), match='555'>
<re.Match object; span=(18, 21), match='555'>


In [50]:
for m in re.finditer("555", full):
    print(m.span())
    print(m.start())
    print(m.end())

(14, 17)
14
17
(18, 21)
18
21


### <br><br>Exercise 3

In [51]:
full_text = "Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?' So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her. There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, `Oh dear! Oh dear! I shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge."

Run the cell above. Write code to print the start position for every occurence of "Alice" in `full_text`.

In [56]:
for i in re.finditer("Alice", full_text):
    print(i.start())

0
260
646
1014


### <br><br>Searching for inexact strings using regular expressions

Functions that are part of the `re` module can search for regular expressions. Remember, this is like a special language designed for searching for patterns.

Most characters (such as letters and numbers) will match to themsleves, like how we searched for "Alice" in the exercise above.

**Metacharacters** are special characters that mean something in a regular expression. <br>These are `. ^ $ * + ? { } [ ] \ | ( )`

#### <br><br>The `[]` metacharacters

**Character classes** are groups of characters that you can define yourself by enclosing them inside the metacharacters `[` and `]`. If you create a character class, `re` will match to any of the characters inside the class - kind of like an `or`:

In [60]:
phone_numbers = "800-555-2928 247-555-2827 860-555-2029 888-555-7262 860-555-2876"

In [61]:
re.findall("[860]", phone_numbers)

['8',
 '0',
 '0',
 '8',
 '8',
 '8',
 '6',
 '0',
 '0',
 '8',
 '8',
 '8',
 '6',
 '8',
 '6',
 '0',
 '8',
 '6']

`re.findall()` returned all occurences of 8, 6, or 0 in the string.

<br><br>We can also pass multiple character classes in a row to return a pattern. For example, 800, 888, and 860 are all toll-free extensions. Let's say we want to find all of the toll free numbers in the string. We can search for three digits followed by a "-". The first digit should be an 8. The second can be a 0, 6, or 8. The third can be a 0 or 8.

In [63]:
re.findall("8[086][08]-", phone_numbers)

['800-', '860-', '888-', '860-']

In [64]:
for m in re.finditer("8[086][08]-", phone_numbers):
    print(m.start())

0
26
39
52


### <br><br>Exercise 4

In [65]:
print(full_text)

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?' So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her. There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, `Oh dear! Oh dear! I shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and the

Run the cell above to print the text from Alice in Wonderland. Write code to find every occurrence of the word "rabbit" in `full_text`. You also want to include any occurrence where "rabbit" is capitalized, like "Rabbit".

In [66]:
re.findall("[Rr]abbit", full_text)

['Rabbit', 'Rabbit', 'Rabbit', 'rabbit', 'rabbit']

#### <br><br>The `\` metacharacter

**Special sequences** are predefined patterns that start with the metacharacter `\` followed by a letter.

`\d` will return any digit (0-9). `\D` will return any character that is not a digit.

In [69]:
address_book = "Ross McFluff: 834.345.1254 155 Elm Street\nRonald Heathmore: 892.345.3428 436 Finley Avenue\nFrank Burger: 925.541.7625 662 South Dogwood Way\nHeather Albrecht: 548.326.4584 919 Park Place"
print(address_book)

Ross McFluff: 834.345.1254 155 Elm Street
Ronald Heathmore: 892.345.3428 436 Finley Avenue
Frank Burger: 925.541.7625 662 South Dogwood Way
Heather Albrecht: 548.326.4584 919 Park Place


To find all the phone numbers, we could write this:

In [71]:
re.findall("\d\d\d.\d\d\d.\d\d\d\d", address_book)

['834.345.1254', '892.345.3428', '925.541.7625', '548.326.4584']

<br>`\s` is another special sequence that will return a whitespace. `\S` will return a character that is not a whitespace. Whitespaces include a space, a tab, or a new line.

Let's change the address book a bit:

In [72]:
address_book = "Ross McFluff: 834.345.1254 155 Elm Street\nRonald Heathmore: 892-345-3428 436 Finley Avenue\nFrank Burger: (925)541-7625 662 South Dogwood Way\nHeather Albrecht: 548.326.4584 919 Park Place"
print(address_book)

Ross McFluff: 834.345.1254 155 Elm Street
Ronald Heathmore: 892-345-3428 436 Finley Avenue
Frank Burger: (925)541-7625 662 South Dogwood Way
Heather Albrecht: 548.326.4584 919 Park Place


Now the phone numbers follow a less predictible pattern. We can use `\S` to still write code to return the phone numbers:

In [73]:
re.findall("\d\d\d\S\d\d\d\S\d\d\d\d", address_book)

['834.345.1254', '892-345-3428', '925)541-7625', '548.326.4584']

### <br><br>Exercise 5

In [81]:
personal_data = "8372083726288837262622793747576548774630847399485020928374756145803098474928225677409908799902967767684738929098972233441 Herman Johnson Daniel Weber Althea Jannon Bridget Booster-Braxton Michael Lee John Forsythe Roberta Jackson Millie Samuels Lucas Dolner Edgar Boris Bridges Rich Weinhart Lisa Bressley"
print(personal_data)

8372083726288837262622793747576548774630847399485020928374756145803098474928225677409908799902967767684738929098972233441 Herman Johnson Daniel Weber Althea Jannon Bridget Booster-Braxton Michael Lee John Forsythe Roberta Jackson Millie Samuels Lucas Dolner Edgar Boris Bridges Rich Weinhart Lisa Bressley


Run the line of code above to save the text string `personal_data`. We want to know where the numbers stop and the names begin. Write code to find the position of the first character that is not a number or whitespace.

In [84]:
re.search("\D", personal_data).start()

121

#### <br><br>The `*` and `+` metacharacters

These metacharacters let you search for repeated characters or phrases in a string. `*` will return 0 to many repeats, while `+` will return 1 to many.

Let's look for a pattern in this gene sequence:

In [103]:
gene = "ATGATATGCTCCAGATATTTTTTGAGCCCGATTAGCCATCATATTTGAAGCTAGCGGTGTCATAGCGATTCCAAAC"

We'll look for the pattern ATA plus any repeating Ts that follow. When we use the `*` it will also return ATA without any Ts following, but using `+` means there has to be at least one T following:

In [133]:
re.findall("ATAT*", gene)

['ATAT', 'ATATTTTTT', 'ATATTT', 'ATA']

In [130]:
re.findall("ATAT+", gene)

['ATAT', 'ATATTTTTT', 'ATATTT']

#### <br><br>How to search for metacharacters

Sometimes you actually want to search for `*` or `[` or another metacharacter in the string. 

You can search for these by preceding the character with a backslash `\`.

In [134]:
s = "95* *Does not include potatoes."

In [136]:
re.findall("\*", s)

['*', '*']

### <br><br>Exercise 6

In [137]:
my_string = "There are twenty five [25] mice in the maze."

Write code that will find the string "[25]" including the brackets.

In [138]:
re.search("\[25\]", my_string)

<re.Match object; span=(22, 26), match='[25]'>

### <br><br>Bonus Exercise

Here's a chance to practice parsing a difficult file. I've included two practice files of the same type. These are control files that get imported to the software program PAML (Phylogenetic Analysis for Maximum Likelihood). You don't need to know anything about the software program to do this exercise.

#### IF YOU ARE USING GOOGLE COLAB ONLY - you need to run the next line of code to upload the files. DO NOT RUN THIS LINE IF YOU ARE NOT USING GOOGLE COLAB:

In [None]:
!wget https://raw.githubusercontent.com/aGitHasNoName/patterns/master/practice.txt
!wget https://raw.githubusercontent.com/aGitHasNoName/patterns/master/practice2.txt

<br><br>To get you started, I have written code to load each of the files as a list of lines. You can run the next two cells and look at the files. You will see a line that includes "runmode = " followed by a number. After the number, there is a key to tell you what the number means. This key is spread out across two lines. These are the lines that you will be working with. Notice how they are different between the two files.

In [156]:
with open("practice.txt", "r") as f:
    file1_list = f.readlines()
for line in file1_list:
    print(line)

      seqfile = uniquegenename.pr.phy * sequence data filename

     treefile = speciesTree.tre    * tree structure file name



      outfile = uniquegenename.paml           * main result file name

        noisy = 9  * 0,1,2,3,9: how much rubbish on the screen

      verbose = 0  * 0: concise; 1: detailed, 2: too much

      runmode = 0  * 0: user tree;  1: semi-automatic;  2: automatic

                   * 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise



      seqtype = 2  * 1:codons; 2:AAs; 3:codons-->AAs

   aaRatefile = /data/paml4.8/dat/wag.dat * only used for aa seqs with model=empirical(_F)

                   * dayhoff.dat, jones.dat, wag.dat, mtmam.dat, or your own



        model = 3  * 0:poisson, 1:proportional, 2:Empirical, 3:Empirical+F

                   * 6:FromCodon, 7:AAClasses, 8:REVaa_0, 9:REVaa(nr=189)

        Mgene = 1  * aaml: 0:rates, 1:separate; 



    fix_alpha = 1  * 0: estimate gamma shape parameter; 1: fix it at alpha

        alpha = 0. * 

In [157]:
with open("practice2.txt", "r") as f:
    file2_list = f.readlines()
for line in file2_list:
    print(line)

      seqfile = uniquegenename.pr.phy * sequence data filename

     treefile = speciesTree.tre    * tree structure file name



      outfile = uniquegenename.paml           * main result file name

        noisy = 9  * 0,1,2,3,9: how much rubbish on the screen

      verbose = 0  * 0: concise; 1: detailed, 2: too much

      runmode = 4  * 0: user tree;  1: semi-automatic;  2: automatic

                   * 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise



      seqtype = 2  * 1:codons; 2:AAs; 3:codons-->AAs

   aaRatefile = /data/paml4.8/dat/wag.dat * only used for aa seqs with model=empirical(_F)

                   * dayhoff.dat, jones.dat, wag.dat, mtmam.dat, or your own



        model = 3  * 0:poisson, 1:proportional, 2:Empirical, 3:Empirical+F

                   * 6:FromCodon, 7:AAClasses, 8:REVaa_0, 9:REVaa(nr=189)

        Mgene = 1  * aaml: 0:rates, 1:separate; 



    fix_alpha = 1  * 0: estimate gamma shape parameter; 1: fix it at alpha

        alpha = 0. * 

<br><br>Since you will only need the runmode lines, I've written code to extract only those lines from the two files:

In [160]:
for line in file1_list:
    if "runmode" in line:
        file1_line1 = line
        place = file1_list.index(line)
        file1_line2 = file1_list[place + 1]
for line in file2_list:
    if "runmode" in line:
        file2_line1 = line
        place = file2_list.index(line)
        file2_line2 = file2_list[place + 1]

In [161]:
print("File 1:")
print(file1_line1)
print(file1_line2)
print("File 2:")
print(file2_line1)
print(file2_line2)

File 1:
      runmode = 0  * 0: user tree;  1: semi-automatic;  2: automatic

                   * 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise

File 2:
      runmode = 4  * 0: user tree;  1: semi-automatic;  2: automatic

                   * 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise



<br><br>Your job is to write code to parse the lines from both of these files. You will need to first return the runmode that is being used ("0" for file 1 and "4" for file 2) and then return the appropriate description of that runmode ("user tree" for file 1 and "PerturbationNNI" for file 2). In a real situation, you might be writing code to parse hundreds or thousands of these files, but let's just practice on two.

Your output should look like:
<br>File 1 - runmode 0: user tree
<br>File 2 - runmode 4: PerturbationNNI

You can use any Python skills you know, including regular expressions.

Check the answer key (patterns-Answers.ipynb) for one possible solution.

In [180]:
#I turned the lines into a dictionary, with the file name as the key and a list of lines as the value
files = {"File 1": [file1_line1, file1_line2], "File 2": [file2_line1, file2_line2]}

In [187]:
#I loop through the dictionary
for k, v in files.items():
    #I split the first line on the = and take the second character of the second half
    runmode = v[0].split("=")[1][1]
    #The description could be in either line, so first I loop through the lines
    for line in v:
        #I split each line on the star and see if the runmode is in the second half of the line
        #I had to split it because of course the runmode is in the first half of the first line - that's where I got it
        if runmode in line.split("*")[1]:
            answer_line = line.split("*")[1]
    #Now I know what line it is in
    #I find the start position of the number in the line
    runmode_position = re.search(runmode, answer_line).start()
    #I shorten the line to start with the runmode number
    short_line = answer_line[runmode_position:]
    #I find the positions of the : which comes before the description and the ; which comes after
    start = re.search(":", short_line).start()
    end = re.search(";", short_line).start()
    #I shorten the line even more to only include the description, plus I remove any leading whitespaces
    answer = short_line[start + 1: end].lstrip(" ")
    print(f"{k} - runmode {runmode}: {answer}")

File 1 - runmode 0: user tree
File 2 - runmode 4: PerturbationNNI
