In [None]:
import pandas as pd
import re
from IPython.display import clear_output

* Write regular expressions to extract gene mutations and protein changes

A single-letter amino acid code for the original codon followed by the numeric codon location followed by the changed codon.

fs->frameshift
X stop Stop count


In [None]:
bcj201559t1 = pd.read_html("""http://www.nature.com/bcj/journal/v5/n7/fig_tab/bcj201559t1.html#figure-title""", 
                           skiprows=0)[0]
bcj201559t1.head(10)

In [None]:
bcj201559t1.iloc[:,4]

### Amino Acid Codes

In [None]:
aac = pd.read_html("http://130.88.97.239/bioactivity/aacodefrm.html")[0]
aac

In [None]:
aacs = list(aac[0])+list(aac[13])

# In-class Exercises
* Write a regular expression to extract the sequence ID from a fasta file.

>A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line (defline) is distinguished from the sequence data by a greater-than (">") symbol at the beginning. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is:

~~~~
>P01013 GENE X PROTEIN (OVALBUMIN-RELATED)
QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE
KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS
VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP
FLFLIKHNPTNTIVYFGRYWSP`
~~~~

In [None]:
with open("../Resources/ex_ref.fasta") as f0:
    fseq = f0.read()
fseq[:1000]

In [None]:
seqid = re.compile(r"""(>(?P<id>[A-Z0-9]+))""")
for r in seqid.finditer(fseq):
    print(r.group("id"))

* Write a regular expression to extract the sequence ID from a fastq file.

>Files from various platforms employing this format are acceptable:

~~~~
@<identifier and other information>
<sequence>
+<identifier and other information OR empty string>
<quality>
~~~~

In [None]:
with open("../Resources/ex_test.fastq") as f0:
    qseq = f0.read()
print(qseq[:1000])

In [None]:
id_line=re.compile(r'''@(?P<id>[A-Z0-9]+):(\d+):(\d+):(\d+):(\d+#\d/\d)''')
for seq in id_line.finditer(qseq):
    print(seq.group('id',5))
    

In [None]:
with open("../Resources/obits.txt") as f0:
    obits = f0.read()
print(obits[:1000])

* Write a regular expression to extract date of death from obits.txt

In [None]:
dod = re.compile(r"""(died (?P<month>[a-zA-Z.]+) (?P<day>\d{1,2}))""")
for r in dod.finditer(obits):
    print(r.group("month"),r.group("day"))

* Write a regular expression to extract place of residence from obits.txt

In [None]:
residence = re.compile(r"""(of (?P<residence>[a-zA-Z ]+))""")
for r in residence.finditer(obits):
    print(r.group("residence"))

* Write regular expressions to extract %stenosis from us.txt

In [None]:
with open("../Resources/us.txt") as f0:
    us = f0.read()
print(us[:2000])

In [None]:
#extract %stenosis from us.txt

sten = re.compile(r"""(((?P<low>[0-9]+)-)?(?P<high>\d{1,3})%)|occluded""", re.I)
#?P immediately following an open paren specifies a named group
#? after the close paren, group is optional
#\d = match any digit, equivalent to character class [0-9]

for r in sten.finditer(us):
    print(r.group(0),r.group("low"),r.group("high"))
#r.group(0) is the whole string that you matched


* Use regular expressions to modify reports queried from MIMIC-2. Find, the de-identified name patterns (such as those shown below) and replace them with randomly selected first and last names:

~~~
DR. [**First Name4 (NamePattern1) **] 

[**Last Name (NamePattern1) **]

[**First Name8 (NamePattern2) **] 

[**First Name4 (NamePattern1) 6465**] 

[**Last Name (NamePattern1) **]

[**Last Name (NamePattern1) 2054**]
~~~

* Use regular expressions to modify the modified reports queried from MIMIC-2
    * Identify ages and replace them with `[**Age in XXs**]`
```
48-year-old
patient's father is 82 years old
mother is healthy at age 83
```

In [None]:
import pymysql
import pandas as pd
import getpass

conn = pymysql.connect(host="mysql",
                       port=3306,user="jovyan",
                       passwd=getpass.getpass("Enter MySQL passwd for jovyan"),db='mimic2')
cursor = conn.cursor()

reports = pd.read_sql("""SELECT text 
                         FROM noteevents 
                         WHERE category='DISCHARGE_SUMMARY' LIMIT 300""",
                      conn)
print(reports.shape)
reports.head(10)

### Use a `join` to put all the report texts into a single string

In [None]:
report_txt = " ".join(reports["text"])

In class we were able to come up with a single regular expression that seems to capture most of the age patterns that we saw (`age`). However, we did end up with the numeric age in three different groups, so it wasn't obvious how we would use this in an automated manner to get the age out.

In [None]:
age = re.compile(r"""([0-9]+(-|\s)year(s)?(-|\s)old)|([0-9]+ y\.o\.)|(\bage [0-9]+)""")
age.findall(report_txt)

We then decided to use two distinct regular expressions: one to match patterns like "48-year-old" and another to match patterns like "at age 83".

`age` is our regex to match the first pattern.

**NOTE:** Compared to what we did in class, I had to add a ? to get correctly capture the "73 y.o." pattern.

```Python
age2 = re.compile(r"""(?P<age>[0-9]+)(-|\s)y(ear(s)?|\.)(-|\s)?o(ld|\.)""")
```

In [None]:
age2 = re.compile(r"""(?P<age>[0-9]+)(-|\s)y(ear(s)?|\.)(-|\s)?o(ld|\.)""")
for m in age2.finditer(report_txt):
    print(m.group(0, "age"))

In [None]:
age3 = re.compile(r"""\bage(d)? (?P<age>[0-9]+)""")
for m in age3.finditer(report_txt):
    print(m.group(0, "age"))

At this point we have two regular expressions to capture ages. Since we want to change the text, we wrote a function to take an age string and return a decades string `age_in_decades`. As we originally wrote it in class, `age_in_decades` took a string as an argument, used `age2` to find the numeric age and return a decades string. We then tried to use the `age2.sub()` method passing the function in place of a string. This did not work.

However, after I class I found that the `re` module as a function `sub` that can take a function instead of a string for the replacement. [This stackoverflow discussion](https://stackoverflow.com/questions/18737863/passing-a-function-to-re-sub-in-python) helped me figure this out. I needed to modify the `age_in_decades` function to take a RegEx match object as an argument:

In [None]:
def age_in_decades(m):
    age = int(m.group("age"))
    
    return "[** Age in %ss**]"%(int(age/10)*10,)

age_in_decades(next(age2.finditer("74-year-old")))

#### Call `re.sub` with `age3` and then pass the results of this to `re.sub` with `age2`

In [None]:
tmp = re.sub(age2, age_in_decades, re.sub(age3, age_in_decades, report_txt))

In [None]:
"[** Age in 70s**]" in tmp

### Look at the lines that were changed

In [None]:
tmp_array = tmp.split("\n")
report_txt_array = report_txt.split("\n")
compare = zip(tmp_array, report_txt_array)

for l in compare:
    if l[0] != l[1]:
        print(l[0])
        print(l[1])
        print()
    

## De-identify de-identified names

We wrote a regular expression `name` to identify the de-identified last name pattern.

In [None]:

name = re.compile(r"""\[\*\*Last Name \(NamePattern\d*\) \d*\*\*\]""")

matched_names = [r.group(0) for r in name.finditer(report_txt)]

### Read in csv file containing 2010 census surnames

In [None]:
surnames = pd.read_csv("../Resources/surnames.csv")
surnames.head()

### Write a function to randomly select a name from the DataFrame

In [None]:
import random
def get_lastname2(surnames, seed=None):
    random.seed(seed)
    v = random.random()
    return surnames[surnames["cumulative_probability"] >= v].iloc[0]["name"]


We used a `set` to get the unique name patterns. We used dictionary comprehension to create a mapping between the de-identified name pattern and a randomly selected last name. 

In [None]:
name_mapping = {n:get_lastname2(surnames) for n in set(matched_names)}

In [None]:
name_mapping

In [None]:
tmp2 = report_txt[:]
for key, value in name_mapping.items():
    tmp2 = tmp2.replace(key,value)

In [None]:
'[**Last Name (NamePattern1) **]' in tmp2

In [None]:
tmp2_array = tmp2.split("\n")
report_txt_array = report_txt.split("\n")
compare = zip(tmp2_array, report_txt_array)

for l in compare:
    if l[0] != l[1]:
        print(l[0])
        print("-"*42)
        print(l[1])
        print()