## Regular expression for string operation
- note that it targets strings such as long texts, or within collections, but not collections (list, tuple, dictionary, set) directly

**fullmatch** checks whether the entire string in its second argument matches the pattern in its first argument. We learn from here although it is less practically useful

In [None]:
import re
pattern = "02215"
"Match" if re.fullmatch(pattern, "02215") else "No match"

Regular expressions typically contain various special symbols called metacharacters:

- Regular expression metacharacters
[] {} () \ * + ^ $ ? . |

- **\ metacharacter** begins each predefined character class

In [None]:
# In \d{5}, \d is a character class representing a digit (0–9)
# {5} repeats \d five times to match five consecutive digits

"Valid" if re.fullmatch('\d{5}', "02215") else "Invalid"

In [None]:
"Valid" if re.fullmatch('\d{5}', "9877") else "Invalid"

? quantifier matches **zero or one** occurrences of a subexpression
-  against * that means >=0 occurrence; + means >=1 occurrence

In [None]:
'Match' if re.fullmatch('labell?ed', 'labeled') else 'No match'
#l? indicates that there can be zero or one more l characters before the remaining literal ed characters
# try *, +, *?, +?, .

match at least n occurrences of a subexpression with the {n,} quantifier

In [None]:
'Match' if re.fullmatch('\d{3,}', '1245548') else 'No match'

In [None]:
'Match' if re.fullmatch('\d{3,6}', '1245548') else 'No match'

To match any metacharacter as its literal value, precede it by a backslash (\\) \
    **\d     Any digit** \
    \D     Any character that is not a digit \
    \s     Any whitespace \
    \S     Any character not whitespace \
    **\w     Any alphanumeric character** \
    \W     Any character not alphanumeric \
    \b     matches the empty string at the beginning or end of a word. \
    \B     matches the empty string not at the beginning or end of a word

**By default, those are single character to match.** 
   

 **for any given pattern, there are probably multiple ways to describe it using a regular expression**

**Custom character classes** - https://docs.python.org/3/library/re.html
- Square brackets, [], define a custom character class that matches a **single** character
- [A-Z] matches an uppercase letter
- [a-z] matches a lowercase letter
- [a-zA-Z] matches any lowercase or uppercase letter

**Comparison:**
- re.search: will stop after the first match. result.start(), or .end(), to return location,.group() or index[0] for content
- re.findall: returns all non-overlapping matches of pattern in string as a list of strings
- re.finditer:  returns callable object. include locations of matches.

In [None]:
EGFR_1 = " 1 agacgtccgg gcagcccccg gcgcagcgcg gccgcagcag cctccgcccc ccgcacggtg\
       61 tgagcgcccg acgcggccga ggcggccgga gtcccgagct agccccggcg gccgccgccg\
      121 cccagaccgg acgacaggcc acctcgtcgg cgtccgcccg agtccccgcc tcgccgccaa\
      181 cgccacaacc accgcgcacg gccccctgac tccgtccagt attgatcggg agagccggag\
      241 cgagctcttc ggggagcagc gatgcgaccc tccgggacgg ccggggcagc gctcctggcg\
      301 ctgctggctg cgctctgccc ggcgagtcgg gctctggagg aaaagaaagg taagggcgtg\
      361 tctcgccggc tcccgcgccg cccccggatc gcgccccgga ccccgcagcc cgcccaaccg\
      421 cgcaccggcg caccggctcg gcgcccgcgc ccccgcccgt cctttcctgt ttccttgaga\
      481 tcagctgcgc cgccgaccgg gaccgcggga ggaacgggac gtttcgttct tcggccggga\
      541 gagtctgggg cgggcggagg aggagacgcg tgggacaccg ggctgcaggc caggcgggga\
      601 acggccgccg ggacctccgg cgccccgaac cgctcccaac tttcttccct cactttcccc\
      661 gcccagctgc gcaggatcgg cgtcagtggg cgaaagccgg gtgctggtgg gcgcctgggg\
      721 ccggggtccc gcacgtgcgc cccgcgctgt cttcccaggg cgcgacgggg tcctggcgcg\
      781 cacccgaggg gcgggcgctg cccacccgcc gagactgcac tgtttaggga agctgaggaa\
      841 ggaacccaaa aatacagcct cccctcggac cccgcgggac aggcggcttt ctgagaggac\
      901 ctccccgcct ccgccctccg cgcaggtctc aaactgaagc cggcgcccgc cagcctggcc\
      961 ccggcccctc tccaggtccc cgcgatcctc gttccccagt gtggagtcgc agcctcgacc\
     1021 tgggagctgg gagaactcgt ctaccaccac ctgcggctcc cggggagggg tggtgctggc\
     1081 ggcggttagt ttcctcgttg gcaaaaggca ggtggggtcc gacccgcccc ttgggcgcag\
     1141 accccggccg ctcgcctcgc ccggtgcgcc ctcgtcttgc ctatccaaga gtgcccccca\
     1201 cctcccgggg accccagctc cctcctgggc gcccgcgccg aaagccccag gctctccttc\
     1261 gatggccgcc tcgcggagac gtccgggtct gctccacctg cagcccttcg gtcgcgcctg\
     1321 ggcttcgcgg tggagcggga cgcggctgtc cggccactgc agggggggat cgcgggactc\
     1381 ttgagcggaa gccccggaag cagagctcat cctggccaac accatggtgt ttcaaaatgg\
     1441 ggctcacagc aaacttctcc tcaaaacccg gagactttct ttcttggatg tctctttttg\
     1501 ctgtttgaag aatttgagcc aaccaaaata ttaaacctgt cttacacaca cacacacaca\
     1561 cacacacaca cacacaccgg attgctgtcc ctggttcaag tgtgccaagt gtgcagacag\
     1621 aacatgagcg agtctggctt cgtgactacc gaccataaac ccacttgaca ggggaaacat"
EGFR_2 = [sequence for sequence in EGFR_1 if sequence !=" " and sequence.isdigit() == False]
new_seq = ""; 
for item in range(len(EGFR_2)):
    new_seq+= EGFR_2[item]
new_seq = new_seq.upper()
new_seq

In [None]:
# use the coronv.seq as the example for ALTERNATION
# A|T means either A or T
import re
if re.search("GG[A|T]CC",new_seq):
    print("restriction site found!")

In [None]:
if re.search("GG[ATGC]CC",str(new_seq)):   # same result: any letter of A, T , G , C
    print("restriction site found!")

#### Question: match any sequence of DNA starting with ATG and terminated by a stop codon (TAG, TAA, or TGA)

In [None]:

result_TAG = re.findall("ATG.+?TAG", new_seq)  
result_TAA = re.findall("ATG.+?TAA", new_seq)  
result_TGA = re.findall("ATG.+?TGA", new_seq)  

result = result_TAG, result_TAA, result_TGA
result

# What is wrong?

In [None]:
# What is wrong?
# match any sequence of DNA starting with ATG and terminated by a stop codon (TAG, TAA, or TGA)
#As the target string is scanned, REs separated by '|' are tried from left to right. If left is satisfied, then done
result = re.findall("ATG.+?T[A|G][G|A]", new_seq)  
result    

In [None]:
result = re.findall("(ATG[A-Z]{1,})", new_seq) 
# Adding groups to a pattern lets you isolate parts of the matching text, expanding those capabilities to create a parser. Groups are defined by enclosing patterns in parentheses (( and ))
result    

In [None]:
# Wrong?
# match any sequence of DNA starting with ATG and terminated by a stop codon (TAG, TAA, or TGA)
result = re.findall("ATG[A-Z]{1,}T[AA|AG|GA]",new_seq) 
result      # with limited span, otherwise too aggressive. but try to maximize it

In [None]:
# Wrong?
# match any sequence of DNA starting with ATG and terminated by a stop codon (TAG, TAA, or TGA)
result = re.findall("ATG[A-Z]T[AA|AG|GA]",new_seq) 
result      # with limited span, otherwise too aggressive. but try to maximize it

In [None]:
result = re.findall("ATG[A-Z]{1,10000}[TAA|TAG|TGA]",new_seq) 
result      # with limited span, otherwise too aggressive  
# copy ATG.. to each |

In [None]:
 # match any sequence of DNA starting with ATG and terminated by a stop codon (TAG, TAA, or TGA). Only show the seq before TAG
# (?=...)Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion.
result = re.findall("ATG.+?(?=TAG)", new_seq) 
result   

In [None]:
 # match any sequence of DNA starting with ATG and terminated by a stop codon (TAG, TAA, or TGA)
result = re.findall("ATG.+?(?=TAG|TAA|TGA)", new_seq) 
result   # see the cutoff?

In [None]:
 # match any sequence of DNA starting with ATG and terminated by a stop codon (TAG, TAA, or TGA)
result = re.findall("ATG.+?(?=TAG|TAA|TGA)\w\w\w", new_seq) 
result   

In [None]:
result = re.finditer("ATG.+?(?=TAG|TAA|TGA)\w\w\w", new_seq)  
result_1 = [item.group() for item in result]   # item.start() or .end(). while .group() produce the sequence result.
result_1  

**Useful functions**

search: looks in a string for the **first occurence** of a substring that matches a regular expression and returns a match object that contains the matching substring

In [None]:
# use ^ to search from the beginning of a string
result = re.search('fun$', 'Python is fun')
result.group() if result else 'not found'

In [None]:
result = re.search('is$', 'Python is fun')
result.group() if result else 'not found'

### **findall**: find every matching substring in a string

In [None]:
contact = 'Wally White, Home: 555-555-1234, Work: 555-555-4321'
re.findall('\d{3}-\d{3}-\d{4}', contact)   #or add \b to the front to indicate starting with empty space

**Capturing substrings in a Match**
- Use parentheses metacharacters ( and ) to capture substrings in a match

In [None]:
 text = 'Pieter-Jan Kwant | Chief of Health Product Department | email: kwantP@ge.com | Office +01 (525)374-8546'

In [None]:
pattern = 'email: (\w+@\w+\.\w{2,})'
# \w+ means more than one alphanumeric character. \. means normal dot; otherwise it becomes any character.
# last \w{3} means exactly 3 alphanumeric characters.
result = re.findall(pattern, text)
result

In [None]:
email_text = '"From: jun.wang.5" <jun.wang.5@stonybrook.edu>; \
Date: Fri, Nov 29, 2019 08:57 AM \
To: "Sam Goody"<**yl@foxmail.com>; \
Subject: Re: Application for PhD program'

y = re.findall("@([^ ]*)", email_text)
y

In [None]:
#email_text = '"From: jun.wang.5" <jun.wang.5@stonybrook.edu>;'
y = re.findall("(\w+@[^ ]*)", email_text)
y

**Q: how to extract the whole email jun.wang.5@stonybrook.edu ?**

Answer:

In [None]:
import re
y = re.findall("<(.*?)>", email_text)
y

In [None]:
y = re.findall("(\w{3}\.\w{4}\.\w+@[^ ]*)", email_text)
y

### Example: analysis of characters through ranking the occurrence frequency of names

- using  .split() didnot remove punctuation marks
- New strategy: replace non-alphanumeric characters with empty space

In [None]:
# Let's read a novel Gone with the Wind. You can find the free book here: http://gutenberg.net.au/ebooks02/0200161.txt
# download and save the file under the current working path of notebook
with open('GoneWithTheWind.txt', mode='r', encoding='utf-8') as novel:
    data = novel.readlines()
data

In [None]:
type(data)

In [None]:
# we need to flatten the whole list into one string before regex searching

one_string = ''
for item in data:
    one_string += item

In [None]:
one_string

### Q: How to extract only the words?

In [None]:
# Option 1: to remove the non-alphanumeric characters first, and leave only words separated by space. Then use split method of string


In [None]:

for word in processed:    
    if word in word_counts:
        word_counts[word] += 1  # update existing key-value pair
    else:
        word_counts[word] = 1  # insert new key-value pair
        
print(f'{"WORD": <12} COUNT') 
      
for word, count in sorted(word_counts.items()):
      print(f'{word: <12} {count}')
      


In [None]:
word_counts.items()

In [None]:

print('\nNumber of unique words:', len(word_counts))   
      
sorted_by_values = sorted(word_counts.items(), key=lambda x:x[1], reverse=True)
# sorted_by_values 

print(f'{"WORD": <12} COUNT') 
for key, value in sorted_by_values:
    print(f'{key: <12} {value}')    

In [None]:
re.finditer?

In [None]:
# to find the location each time Scarlett is mentioned.
iterable= re.finditer("Scarlett", text)
Scarlett_indices = [m.start(0) for m in iterable]

In [None]:
Scarlett_indices

In [None]:
iterable= re.finditer("Melanie", text)
Melanie_indices = [m.start(0) for m in iterable]

In [None]:
len(Melanie_indices)

In [None]:
iterable= re.finditer("Ashley", text)
Ashley_indices = [m.start(0) for m in iterable]
len(Ashley_indices)

In [None]:
iterable= re.finditer("Rhett", text)
Rhett_indices = [m.start(0) for m in iterable]
len(Rhett_indices)

In [None]:
iterable= re.finditer("Suellen", text)
Suellen_indices = [m.start(0) for m in iterable]
len(Suellen_indices)


In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=[16,5])

plt.hist(Scarlett_indices, 1000, facecolor='blue', alpha=0.5)

#plt.hist(Rhett_indices, 1000, facecolor='red', alpha=0.5)

#plt.hist(Ashley_indices, 1000, facecolor='yellow', alpha=0.5)

#plt.hist(Melanie_indices, 1000, facecolor='green', alpha=0.5)

plt.hist(Suellen_indices, 1000, facecolor='purple', alpha=0.5)

plt.show()



In [None]:
# cumulative curve
plt.figure(figsize=[10,5])

plt.hist(Scarlett_indices, 1000, facecolor='blue', alpha=0.5, density=True, histtype='step',
                           cumulative=True, label='Empirical')

plt.hist(Rhett_indices, 1000, facecolor='red', alpha=0.5, density=True, histtype='step',
                           cumulative=True, label='Empirical')

plt.hist(Ashley_indices, 1000, facecolor='yellow', alpha=0.5, density=True, histtype='step',
                           cumulative=True, label='Empirical')

plt.hist(Melanie_indices, 1000, facecolor='green', alpha=0.5, density=True, histtype='step',
                           cumulative=True, label='Empirical')

plt.hist(Suellen_indices, 1000, facecolor='purple', alpha=0.5, density=True, histtype='step',
                           cumulative=True, label='Empirical')



plt.show()


## Dictionaries



**Basic dictionary operation**

In [None]:
# Dictionaries are unordered collections.
# Do not write code that depends on the order of the key–value pairs.
# The keys are unique. Values of course do have to be unique

roman_numerals = {'I': 1, 'II': 2, 'III': 3, 'V': 5, 'X': 100}

In [None]:
roman_numerals.  # tab, to know what methods are available for dictionary

In [None]:
# accessing the value by referring to the key
roman_numerals['II']

In [None]:
# update the value of a key
roman_numerals["X"] = 10

In [None]:
# or use method to update
roman_numerals.update({"II": 55})
# roman_numerals.update(II = 55)   # it also permit receiving tuples. Note II is not quoted
roman_numerals

In [None]:
# Adding a new key-value pair
roman_numerals["L"] = 50
roman_numerals

In [None]:
# Delete a key-value pair
del roman_numerals['III']
roman_numerals

In [None]:
# Or, use pop to remove one pair
roman_numerals.pop("I")

In [None]:
# Method get returns its argument' corresponding value or None if the key is not found
# Return the second argument if the key is NOT found

roman_numerals.get("II")    # return the value of that key

In [None]:
roman_numerals.get("III")

In [None]:
#Either way, you can use in to test if a key is there. Not value!
# This is different from list which uses value in name
"III" in roman_numerals

**Methods associated with dictionary**

In [None]:
# .key gives all the keys
roman_numerals.keys()

In [None]:
# .values gives all the values
roman_numerals.values()

In [None]:
# .items return both key and values as tuples
roman_numerals.items()

In [None]:
#Iteration of dictionary
#Dictionary method items returns each key–value pair as a tuple

for keys, values in roman_numerals.items():
    print(f"{keys}: {values}")

In [None]:
# Dictionary itself does not need to appear like being sorted. sorting not helpful

# the keys can be sorted
sorted(roman_numerals.keys())   
# But it returns a list without values


**Dictionary comprehension:** A dictionary comprehension also can map a dictionary’s values to new values. \
The expression to the left of the **for** clause specifies a **key–value pair of the form key: value**.

In [None]:
grades = {'Sue': [98, 87, 94], 'Bob': [84, 95, 91]}

grades2 = {key: sum(value)/len(value) for key, value in grades.items()}

In [None]:
grades2

In [None]:
# Alternatively use lamda function

grades2 = dict(map(lambda x: (x[0], sum(x[1])/len(x[1])), grades.items()))

In [None]:
grades2

## Sets
- mutable

In [None]:
# create a set
colors = {'red', 'orange', 'yellow', 'green', 'red', 'blue'}
colors

In [None]:
# initiate a set by a = {} is tricky

a = {}  # This actually creates a dictionary
b = set(a)
type(b)

In [None]:
colors.

In [None]:
'red' in colors

In [None]:
# we often use list, set, tuple etc for conversion
numbers = list(range(10)) + list(range(5))
numbers

In [None]:
numbers_2 = set(numbers)
numbers_2    # see the repetitive items are gone

In [None]:
# union of two sets
{1, 3, 5} | {1, 2, 3, 4, 5, 6}

# or {1, 3, 5}.union({1, 2, 3, 4, 5, 6}) # note set, list,and tuple are acceptable in union()

In [None]:
# Intersection: extract items in common
{1, 3, 5} & {1, 2, 3, 4, 5, 6}
# or {1, 3, 5}.intersection({1, 2, 3, 4, 5, 6})

In [None]:
# Difference: find the difference between two sets

{1, 2, 3, 4, 5, 6} - {1, 3, 5} 
# or {1, 3, 5}.difference({1, 2, 3, 4, 5, 6})

# test if list has such operation


In [None]:
{1, 2, 3, 4, 5, 6}.difference({1, 3, 5})

In [None]:
# Add elements to an existing set
numbers = {1, 2, 3, 4, 5, 6}
numbers.add(23)
numbers

In [None]:
# remove elements
numbers.remove(4)
numbers

In [None]:
# Set comprehension
numbers = [1, 2, 2, 3, 4, 5, 6, 6, 7, 8, 9, 10, 10]
evens = {item for item in numbers if item % 2 == 0}
evens