# Python Regular Expressions

### Raw string notation and escaping metacharacters
* Raw string notation: `r"string"`
    * `"This is a string."`
    * `r"This is a raw string.\n"` -> treates `\` as literal characters, not as escape sequences.
* Escaping metacharacters
    * Use another `\` -> `\\n`

### Python built-in module for regular expressions: ***re***

In [3]:
import re

In [13]:
my_string = "a given string"
m = re.search("given", my_string)

m is a match object

In [14]:
print(m)

<re.Match object; span=(2, 7), match='given'>


.group() -> the string of matched pattern

In [15]:
m.group()

'given'

### Building regular expressions

#### Metacharacters

We define metacharacters by prepending a backslask `\`
* `\n` -> match a newline
* `\t` -> match a tab
* `\s` -> match white space
* `\w` -> match a "word character"
* `\d` -> match a digit
* `.`  -> match any character

Match a white space

In [20]:
# match white space \s
m = re.search(r"\s", my_string)
print(my_string)
print(m)
m.group()

a given string
<re.Match object; span=(1, 2), match=' '>


' '

Match a white space followed by five "word" characters

In [21]:
m = re.search(r"\s\w\w\w\w\w", my_string)
print(m)
m.group()

<re.Match object; span=(1, 7), match=' given'>


' given'

Note that re.search returns only the first instance.

#### Sets
Sets --> `[]`, provide several options or a range of characters to match

In [23]:
my_string = "sunflowers are described on page 89"
# match lowercase or uppercase "s" followed by two word characters
m = re.search(r"[sS]\w\w", my_string)
print(m)
m.group()

<re.Match object; span=(0, 3), match='sun'>


'sun'

In [25]:
# search for a number with two digits
m = re.search(r"[0-9][0-9]", my_string)
print(m)
m.group()

<re.Match object; span=(33, 35), match='89'>


'89'

A caret `^` as the frist symbol within a set `[^]`--> match any character not listed in the set

In [26]:
# Match a character not in s-z followed by 6 word characters
m = re.search(r"[^s-z]\w\w\w\w\w\w", my_string)
m.group()

'nflower'

#### Quantifiers

Quantifiers: determine how many characters of a certain type to expect.
* `?` -> match zero or one time
* `*` -> match zero or more times
* `+` -> match one or more times
* `{n}` -> match exactly n times.
* `{n,}` -> match at least n times.
* `{n, m}` -> match at least n but not more than m times. 

Match one or more letters A, C, G or T

In [39]:
motif = "ATTTGT AT"

In [41]:
#Match 1 or more letters of A, C, G or T
n = re.search(r"[ATCG]+", motif)
print(n)

<re.Match object; span=(0, 6), match='ATTTGT'>


Match any sequence of 3 or more letters, each of which must be A, C, G or T

In [42]:
n = re.search(r"[ACGT]{3,}", motif)
print(n)

<re.Match object; span=(0, 6), match='ATTTGT'>


Combining `*` and `?`     
Example
* `.*\s` -> match any number of characters followed by a white space
* `.?\s` -> match any character once followed by a white space
* `.*?\s` -> match any number of characters followed by a white space once

In [43]:
my_string = "once upon a time"
# match any number of characters
# followed by a white space
m = re.search(r".*\s", my_string)
m.group()

'once upon a '

In [44]:
m = re.search(r".?\s", my_string)
m.group()

'e '

In [45]:
m = re.search(r".*?\s", my_string)
m.group()

'once '

In [52]:
m = re.search(r"\s\w\s", my_string)
m

<re.Match object; span=(9, 12), match=' a '>

#### Anchors

* `^` -> match the beginning of a string     
* `$` -> match the end of a string

In [None]:
my_string = "ATATA"
m = re.search()

In [None]:
# searching at the end of the string


##### Intermezzo 5.1
Write down the search pattern in the following regular expressions, and check the result by runing the code. 

In [53]:
# a. 
re.search(r"\d", "it takes 2 to tango").group()

'2'

In [54]:
# b
re.search(r"\w*\s\d.*\d", "take 2 grames of H2O").group()

'take 2 grames of H2'

In [55]:
# c. 
re.search(r"\s\w*\s", "once upon at time").group()

' upon '

In [56]:
# Note:
re.search(r"\s\w.*\s", "once upon at time").group()

' upon at '

In [57]:
# d. 
re.search(r"\s\w{1,3}\s", "once upon at time").group()

' at '

In [58]:
# e. 
re.search(r"\s\w*$", "once upon at time").group()

' time'

#### Alternations
* use pipe symbol `|` to provide alternations.  

In [None]:
my_string = "I found my cat!"


#### Regular expression with raw string

What if we want to search for `?`, `*`, `+` in data? -> Use `\` in front

Example: search number of line contain the `*` sign in file Marra2014_BLAST_data.txt

In [73]:
# read the .fasta file as pyfaidx.Fasta data type
with open("../../regex/data/Marra2014_BLAST_data.txt") as f:
    counter = 0
    # search for pattern in each line 
    for line in f:
        m = re.search(r"\*", line)
        # only if a match was found, increase counter and print the line
        if m:
            counter += 1
            print(line)
print("The pattern was matched in {0} lines".format(counter))

contig01987	2b14_human ame: full=hla class ii histocompatibility drb1-4 beta chain ame: full=mhc class ii antigen drb1*4 short=dr-4 short=dr4 flags: precursor	112	5	5.07E-10	78.60%

contig05816	1c08_human ame: full=hla class i histocompatibility cw-8 alpha chain ame: full=mhc class i antigen cw*8 flags: precursor	825	5	1.06E-97	67.80%

contig05821	1b46_human ame: full=hla class i histocompatibility b-46 alpha chain ame: full=bw-46 ame: full=mhc class i antigen b*46 flags: precursor	606	5	9.86E-09	67.60%

contig22120	2b11_human ame: full=hla class ii histocompatibility drb1-1 beta chain ame: full=mhc class ii antigen drb1*1 short=dr-1 short=dr1 flags: precursor	137	5	5.05E-11	76.80%

contig23154	mmrn1_human ame: full=multimerin-1 ame: full=emilin-4 ame: full=elastin microfibril interface located protein 4 short=elastin microfibril interfacer 4 ame: full=endothelial cell multimerin contains: ame: full=platelet glycoprotein ia* contains: ame: full=155 kda platelet multimerin short=p-155 s

Note: `if m:` checks whether a regex search has found matches      
            if no match -> `False`      
            any match -> `True`

The above code finds lines match:
* `mhc`
* `[\s\w*]+` -> followed by a white space, and followed by zero or more letters for one or more times
* `\*` -> followed by the `*` sign
* `\w*` -> followed by any numbers of letters

### Functions of the re module

* `re.finall(reg, target_string)` -> return a list of all the matches
* `re.finditer(reg, target_string)` -> return an iterator of match object (usefull in for loops to allow access to the next match)
* `re.compile(reg)` -> store a pattern for repeated use
* `re.split(reg, target_string)` -> split the text by the pattern
* `re.sub(reg, repl, target_string)` -> substitute each nonoverlapping occurence of the match with the text in `repl`

In [1]:
import re
import pyfaidx

If your python does not have this module, install by     
`%pip install pyfaidx`

The **pyfaidx** module --> facilitates working with FASTA files.    

**pyfaidx.Fasta** object behaves like a dictionary.       
* keys -> the sequence names        
* values -> FastaRecord
* Slicing FastaRecord gives pyfaidx.Sequence object
* pyfaidx.Sequence object has an attribute `.seq` that contains the sequence string.

In [7]:
file_path = "../../regex/data/Ecoli.fasta"
genes = pyfaidx.Fasta(file_path)

In [8]:
genes.items()

odict_items([('gi|556503834|ref|NC_000913.3|:1978338-2028069', FastaRecord("gi|556503834|ref|NC_000913.3|:1978338-2028069")), ('gi|556503834|ref|NC_000913.3|:4035299-4037302', FastaRecord("gi|556503834|ref|NC_000913.3|:4035299-4037302"))])

In [9]:
records = list(genes.keys())
records

['gi|556503834|ref|NC_000913.3|:1978338-2028069',
 'gi|556503834|ref|NC_000913.3|:4035299-4037302']

In [10]:
# extract first sequence from genes dictionary
# get sequence start to finish
seq1 = genes[records[0]][:]

In [11]:
type(seq1)

pyfaidx.Sequence

In [12]:
# Length of sequence
seq1.end

49732

In [13]:
# Get the sequence as a string
seq1_str = seq1.seq

In [14]:
# print the first 40 nucleotides
seq1_str[:40]

'AATATGTCCTTACAAATAGAAATGGGTCTTTACACTTATC'

In [15]:
# search for the pattern "GATC" with re
m = re.search(r"GATC", seq1_str)
m.group()

'GATC'

In [None]:
# extract the start and end positions of the match
m.start()

In [None]:
m.end()

To find all matches, use `re.findall()`

In [18]:
# find in seq1_str for all matches for
# pattern: AACNNNNNNGTGC and GCACNNNNNNGTT
m = re.findall(r"AAC[ATCG]{6}GTGC|GCAC[ATCG]{6}GTT", seq1_str)
if m:
    print("There are", len(m), "items")
    print(m)

There are 6 items
['AACAGCATCGTGC', 'AACTGGCGGGTGC', 'GCACCACCGCGTT', 'GCACAACAAGGTT', 'GCACCGCTGGGTT', 'AACCTGCCGGTGC']


To find all matches and get their positions, use `re.finditer()`

In [19]:
hits = re.finditer(r"AAC[ATCG]{6}GTGC|GCAC[ATCG]{6}GTT", seq1_str)
for item in hits:
    print(item.start() + 1, item.group())

18452 AACAGCATCGTGC
18750 AACTGGCGGGTGC
25767 GCACCACCGCGTT
35183 GCACAACAAGGTT
40745 GCACCGCTGGGTT
42032 AACCTGCCGGTGC


In [27]:
# match a G, followed by 2 T's


### Groups in regular expressions
* use `()` to structure the regular expression

In [None]:
# match GT twice


We can use groups to search for unknown sequence flanked by known sequences    
`(known_seq_1)(variable_seq)(known_seq2)`

In [None]:
# get the second sequence from the Ecoli fasta file 

# compile regex pattern


To return the matches:
* Full match: `m.group()` or `m.group(0)`
* match for group1: `m.group(1)`
* match for group2: `m.group(2)`
* match for group3: `m.group(3)`

In [None]:
# Now we can idividually return the matched groups
# group(0) returns the entire match
# we look at only the first 40 nucleotides


In [None]:
# group(1) returns the first group


In [None]:
# group(2) is the variable middle part


In [None]:
# we can perform further analysis on this group


In [None]:
# group(3) returns the last group


### Verbose Regular Expressions

`re.VERBOSE` is a flag in `re` module that lets you add whitespace, line breaks and comments in regular expressions.      `re.VERBOSE` ignores whitespace, line breaks and comments starting with `#`
Example: 
```python
regpat = r"""
        \s     # white space
        \w*    # letters
        \d{3}  # 3 digits
        """
re.search(regpat, target_string, re.VERBOSE).group()
```

In [None]:
# pattern to match a zip code

