# Introduction to Python for Biology
# Bonus Materials

# Code Along

## Regular Expressions

### Basic Patterns

Although we looked at some string manipulation tools in Python that help us look for fixed patterns, regular expressions offer us a lot more flexibility. 

Remember how we used special characters when we learned about string manipulation (e.g. \n, \t)? We can put an `r` (stands for raw string literal) in front of a string (outside the quotes) in Python to make sure the string ignores the special characters. We'll need this so that the special characters that regular expressions use don't end up clashing with those that have special meaning for strings. 

A backslash goes back to being a backslash when we prefix it with `r` and will get stored literally as they appear. 

In [1]:
print(r"\t\n")

\t\n


The module `re` allows us to use regular expressions in Python.

In [2]:
import re

We can use the search method to look for a pattern in a string using `re.search()`. This will act like it is True or a False to let us know whether it found the pattern in the string. (Really it is returning a match object, but we'll get more into that later).

This method takes two arguments, the pattern you want to search for (as a string) and the string you want to search in. 

We can use string characters to find an exact match. 

In [3]:
# not we don't strictly need the raw notation for this pattern string (but this is good practice)
sequence = "python"
pattern = r"python"
if re.match(pattern, sequence):
    print("Match!")
else: 
    print("Not a match!")

Match!


`re.match()` only matches from the beginning of the string

In [4]:
sequence = "python"
pattern = r"y"
if re.match(pattern, sequence):
    print("Match!")
else: 
    print("Not a match!")

Not a match!


### Wild Card 

Regular expressions use a set of special characters that don't literally match themselves but can be used as flexible search wildcards.

Here is a good cheatsheet: https://www.dataquest.io/wp-content/uploads/2019/03/python-regular-expressions-cheat-sheet.pdf

For example, `.` can match any single character except a newline character.

This time let's try out the method `re.search()` which will scan the string for the first instance of that pattern (if it exists). If the pattern exists, it will return something called a **MatchObject**. If not, it will return `None`. 

If we also tack on the `.group()` it will return the string matched.

In [5]:
re.search(r"p.thon", "python").group()

'python'

`\w` will match any single digit, letter, or underscore. If we use the uppercase (`\W`) it will search for the inverse (any character that is not a digit, letter, or underscore). 

In [6]:
re.search(r"p\wthon", "python").group()

'python'

In [7]:
re.search(r"p\Wthon", "p&thon").group()

'p&thon'

`\s` will match any whitespace character (like a space, newline, or return). 

`\S` matches the inverse of this (any character that is not a whitespace character).

`\t` matches tabs.

`\n` matches newlines.

`\r` matches return.

In [8]:
re.search(r"transmitting\sscience", "transmitting science").group()

'transmitting science'

`\d` will match any decimal (0-9).

In [9]:
re.search(r"\d\d\d-\d\d\d-\d\d\d\d", "phone: 893-483-3847").group()

'893-483-3847'

`^` (carat) will match a pattern at the start of the string.

`$` will match a pattern at the end of the string.

In [10]:
re.search(r"^To", "To be or not to be").group()

'To'

In [11]:
re.search(r"be$", "To be or not to be").group()

'be'

If we put the pattern in square brackets, we can search for any of those characters. This acts like "or" did before.

In [12]:
re.search(r'Number: [0-6]', 'Number: 5').group()

'Number: 5'

`\b` will match only at the beginning or ending of a word

In [13]:
re.search(r'\b[AEIOU]', 'Alphabet').group()

'A'

What if we wanted to search for something that is a special character (literally?) We can use `\` (backslash) to escape the character and treat it like any other non-special character.

In [14]:
# This treats '\s' as an escape character because it lacks '\' at the start of '\s'
re.search(r'\s', ' ').group()

' '

In [15]:
# This checks for '\' in the string instead of '\t' due to the '\' used 
re.search(r'\\s', '\s').group()

'\\s'

### Repetitions

If we want to check for more than one instance of the character, we can use characters to indicate that the type of character might repeat instead of repeating the special character.

`+` looks for one or more of the characters 

`*` looks for zero or more of the characters

(Both of these qualifiers are known as **greedy** because they will match as much of the search string as posisble.)

In [16]:
re.search(r"\d\d\d-\d\d\d-\d\d\d\d", "phone: 893-483-3847").group()

'893-483-3847'

In [17]:
re.search(r"\d+-\d+-\d+", "phone: 893-483-3847").group()

'893-483-3847'

In [18]:
re.search(r"bo*", "bo boooo boooo booooooooo b").group()

'bo'

In [19]:
re.search(r"bo*", "booooooooo").group()

'booooooooo'

In [20]:
re.search(r"bo*", "b").group()

'b'

But sometimes a greedy match is not what we want and we only want to mattch the first instance of the pattern. 

`*?` matches as little of the text as possible

`?` is a greedy qualifier that check for exactly zero or one of the character.

In [21]:
re.search(r"bo*?", "booooooooo").group()

'b'

Or in cases where we want to check for an exact number of a pattern's repetition, we can indicate that too.

`{x}` checks for when it repeats exactly x number of times

`{x,}`  checks for when it repeats at least x times or more

`{x, y}`  checks for when it repeats at least x times but no more than y times

In [22]:
re.search(r"\d{3}-\d{3,}-\d{3,4}", "phone: 893-483-3847").group()

'893-483-3847'

### Splitting Strings with RegEx

In an earlier lesson, we used the `.split()` string method to use delimeters to split strings. We can combine this with the power of regular expressions to customize what we want to split on. The `.split()` method doesn't actually allow us to use regulare expressions, but the `re` module has a `.split` method we can use for this. We will input the pattern as the first argument and the string to be split as the second argument.  

In [23]:
d = "2.19.1876, 3-23-1856, 2/23/1865"
dates = re.split(r"[\./-]", d)
print(dates)

['2', '19', '1876, 3', '23', '1856, 2', '23', '1865']


### Grouping for Multiple Groups

We can also pull out parts of a matching text using the group feature. Using `.group()` with a number in the parentheses. We've been using it without a number in the parentheses, which returns the whole matched text. 

To capture different parts of the text in our pattern, we will surround it with parentheses in the pattern string. Then group(1) will match the first pattern we captured, group(2) will capture the second, etc. 

In [24]:
email = "You can reach me at nichole.lynn.bennett@gmail.com"
match = re.search(r'([\w\.-]+)@([\w\.-]+)', email)
if match:
  print(match.group()) # The whole matched text
  print(match.group(1)) # The username (group 1)
  print(match.group(2)) # The host (group 2)

nichole.lynn.bennett@gmail.com
nichole.lynn.bennett
gmail.com


### Getting Match Positions

We can also pull out information about the position of the match using the `.start()` and `.end()` methods. Don't forget that Python starts counting at zero. 

In [25]:
name = "aaa9adkien3alsdk389" 
m = re.search(r"[\d]", name) 

if m: 
    print("number!") 
    print("at position " + str(m.start()))

number!
at position 3


### Getting Multiple Matches

In the example above, we could only find the first match because `.search()` can only give us one match. If we want to process multiple matches, we can use `re.finditer()`, which will give us a list of match objects we can process using our knowledge of looping. 

In [26]:
name = "aaa9adkien3alsdk389" 
matches = re.finditer(r"[\d]", name) 
for m in matches: 
    base = m.group() 
    pos  = m.start() 
    print(base + " found at position " + str(pos))

9 found at position 3
3 found at position 10
3 found at position 16
8 found at position 17
9 found at position 18


We can also return a list of all of the parts of the searched string that match a given pattern using the `re.findall()` module. It will return a list of strings (instead of a match object). 

In [27]:
name = "aaa9adkien3alsdk389" 
result = re.findall(r"\d", name) 
print(result)

['9', '3', '3', '8', '9']


### `re` Module Methods

`re.search(pattern, string, flags=0)` find the first location where there is a match, returns a match object if found, `None` if not (anywhere in the string)

`re.match(pattern, string, flags=0)` returns a match object if the pattern matches the beginning of the string, else it returns `None` (only at the start of the string)

`re.findall(pattern, string, flags=0)` finds all possible matches in the string and returns them as a list of string

`re.sub(pattern, repl, string, count=0, flags=0)` returns a new string where the leftmost non-overlapping instance of the pattern is replaced by the string in the `repl` argument. If the pattern is not found, it returns the original string. 

`re.compile(pattern, flags=0)` will compile a regular expression object. This is useful when you need to use the same regular expression several times in a single program. 

You can modify any of these methods using the `flags` argument. Some of the flags used are: IGNORECASE, DOTALL, MULTILINE, VERBOSE, etc.

Some next steps to look into on your journey with regular expressions (that we won't have to cover today) are lookahead and lookbehind assertions. 

# Independent Work

The regex tester may come in handy for working through these exercises: https://regex101.com/

### Shakespeare's Complete Works
This file comes from Project Gutenburg. Feel free to browse the site and choose a different book to use (link to the URL for its text file). I have Shakespeare's complete works as a default here for you to use. 

In [36]:
import re
import requests
shakespeare_url = 'https://www.gutenberg.org/files/100/100-0.txt'
book = requests.get(shakespeare_url).text

Write a program that answers the following questions:

* How many times does the word "the" appear in the corpus? 

* How many times is there a quote ("") in the corpus? 

* How many times does the character "d" followed by "e" appear? 

* How many times does the character "d" followed by "e" appear separated by a single letter? 

* How many times does the character "d" and "e" appear in any order? 

* How many times does a version of the word "love" (e.g. "lovely", "lover", "love", "loving") show up?

* How many words in the corpus end with "e"?

* How many times are there three or more digits in a row in the text?

In [37]:
def preprocess(sentence): 
    return re.sub('[^A-Za-z0-9.]+' , ' ', sentence).lower()
processed_book = preprocess(book)
print(processed_book)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [38]:
# How many times does the word "the" appear in the corpus?
len(re.findall(r'\bthe\b', processed_book))

30400

In [39]:
# How many times is there a quote ("") in the corpus?
len(re.findall(r'\”', processed_book))

0

In [40]:
# How many times does the character "d" followed by "e" appear?
len(re.findall(r'de', processed_book))

17927

In [41]:
# How many times does the character "d" followed by "e" appear separated by a single letter?
len(re.findall(r"d[a-z]e", processed_book))

6689

In [42]:
d_by_e = re.findall(r"de", processed_book)
e_by_d = re.findall(r"ed", processed_book)
print(len(d_by_e) + len(e_by_d)) 

37531


In [43]:
# How many times does a version of the word "love" (e.g. "lovely", "lover", "love", "loving") show up?
len(re.findall(r'\blov\w+', processed_book))

3218

In [44]:
# How many words in the corpus end with "e"?
len(re.findall(r'e\b', processed_book))

182884

In [45]:
# How many times are there three or more digits in a row in the text?
len(re.findall(r'\d{3}', processed_book))

325

### Names
Create a program that will reformat a name written in the format "FirstName LastName" to "LastName, FirstName". 

**Bonus** Make sure that it can handle "Jr.", "Sr." suffixes  or "Dr." prefixes correctly or other cases you can think of (e.g. middle names, hyphenated last names, etc.). Note that it is probably impossible to get regular expressions to identify *any* name, so it's okay to be limited in your approach.  

In [46]:
names = ['Mae Jemison', 'Rosalind Franklin', 'Wang Zhenyi', 'Jane Goodall', 'Katherine Johnson']

lastname_first = []
for name in names:
    match = re.match(r'^(\w+)\s(\w+)$',name)
    if match:
        lastname_first.append(match.group(2) + ", " + match.group(1))
    else:
        print("no match found")
print(lastname_first)

['Jemison, Mae', 'Franklin, Rosalind', 'Zhenyi, Wang', 'Goodall, Jane', 'Johnson, Katherine']


In [47]:
names = ['Mae Jemison', 'Dr. Rosalind Franklin', 'Wang Zhenyi, Jr.', 'Jane Goodall', 'Dr. Katherine Johnson']

lastname_first = []
for name in names:
    match = re.match(r'^(.+)?([A-Z]\w+)\s(\w+)(.+)?$',name)
    if match:
        lastname_first.append(match.group(3) + ", " + match.group(2))
    else:
        print("no match found")
print(lastname_first)

['Jemison, Mae', 'Franklin, Rosalind', 'Zhenyi, Wang', 'Goodall, Jane', 'Johnson, Katherine']


### Bonus
Want more to do? Check this out: https://regexcrossword.com/

And this: https://alf.nu/RegexGolf