## Homework 8:

__Exercise 1.__ Go to prosite and find the Gamma-glutamyl phosphate reductase signature. Write a script to detect all yeast proteins that have the Gamma-glutamyl phosphate reductase signature. Your script should output a dataframe with the folowing info about the proteins: accession number, match to the first, span of the motif, and the protein's description.

**NOTE:** You can just copy the file with all the yeast protein form last class in the current directory, you don't need to download it again.

```
[VA]-x(5)-A-[LIVAMTCK]-x-[HWFY]-[IM]-x(2)-[HYWNRFT]-[GSNT]-[STAG]-x(0,1)-H-[ST]-[DE]-x(1,2)-I
```

In [3]:
import urllib.request as urreq
url = "https://downloads.yeastgenome.org/sequence/S288C_reference/orf_protein/orf_trans.fasta.gz"
urreq.urlretrieve(url,'orf_trans.fasta.gz')  #first get the webpage you want
protFile = 'orf_trans.fasta'

url = "https://downloads.yeastgenome.org/sequence/S288C_reference/orf_dna/orf_coding.fasta.gz"
urreq.urlretrieve(url,'./orf_coding.fasta.gz')  #first get the webpage you want
orfFile = 'orf_coding.fasta'


import gzip
nucFile = gzip.open('orf_coding.fasta.gz', 'rb')
file_content = nucFile.read()
outfile = 'orf_coding.fasta'
fileout = open(outfile, 'wb')
fileout.write(file_content)
fileout.close()

aaFile = gzip.open('orf_trans.fasta.gz', 'rb')
file_content = aaFile.read()
outfile = 'orf_trans.fasta'
fileout = open(outfile, 'wb')
fileout.write(file_content)
fileout.close()

In [4]:
import re
import pandas as pd
from Bio import SeqIO

pattern1 = "[VA]\w{5}A[LIVAMTCK]\w[HWFY][IM]\w{2}[HYWNRFT][GNST][STAG]\w{0,1}H[ST][DE]\w{1,2}I"

def findinseqfile(pattern,filein):
    information = []
    for seq_record in SeqIO.parse(filein, "fasta"):
        result = re.search(pattern,str(seq_record.seq))
        if result:
            information.append([seq_record.name ,result.group(), result.span(), str(seq_record.description)])

    return pd.DataFrame(information, columns=['acc','match','start_end','seq'])

findinseqfile(pattern1, protFile)

Unnamed: 0,acc,match,start_end,seq
0,YOR323C,VTSTESAIQHINTHSSRHTDAI,"(338, 360)","YOR323C PRO2 SGDID:S000005850, Chr XV from 922..."


__Exercise 2.__ Now do the same for the Hexapeptide-repeat containing-transferases signature.

```
[LIV]-[GAED]-x(2)-[STAV]-x-[LIV]-x(3)-[LIVAC]-x-[LIV]-[GAED]-x(2)-[STAVR]-x-[LIV]-[GAED]-x(2)-[STAV]-x-[LIV]-x(3)-[LIV]
```

In [30]:
pattern2 = "[LIV][GAED]\w{2}[STAV]\w[LIV]\w{3}[LIVAC]\w[LIV][GAED]\w{2}[STAVR]\w[LIV][GAED]\w{2}[STAV]\w[LIV]\w{3}[LIV]"

findinseqfile(pattern2, protFile)

Unnamed: 0,acc,match,start_end,seq
0,YDL055C,IDPTAKISSTAKIGPDVVIGPNVTIGDGV,"(256, 285)","YDL055C PSA1 SGDID:S000002213, Chr IV from 356..."
1,YJL218W,IGGGVSIIPGVNIGKNSVIAAGSVVIRDI,"(138, 167)","YJL218W YJL218W SGDID:S000003754, Chr X from 2..."


__Exercise 3.__ Now find the 14-3-3 proteins signatures. The 14-3-3 proteins seem to have multiple biological activities and play a key role in signal transduction pathways and the cell cycle. The prosite database uses two motifs to determine members of this family.

Write a script to search for proteins in yeast that have both domains in either order. You should find two proteins.

Your script should show a dataframe with the proteins: accession number, match to the first motif, span of the first motif, match to the second motif, span of the second motif, and the proteins description.

Although your regex doesn't need to match the domains in the reverse order for it to identify both yeast proteins, I would like for you to write a regex that would be able to identify such a case for this exercise purpose.

```
[RA]-N-L-[LIV]-S-[VG]-[GA]-Y-[KN]-N-[IVA]
```

and

```
Y-K-[DE]-[SG]-T-L-I-[IML]-Q-L-[LF]-[RHC]-D-N-[LF]-T-[LS]-W-[TANS]-[SAD]
```

In [19]:
pattern3a = "[RA]NL[LIV]S[VG][GA]Y[KN]N[IVA]"
pattern3b = "YK[DE][SG]TLI[IML]QL[LF][RHC]DN[LF][T][LS]W[TANS][SAD]"
# pattern3 = f"({pattern3a}|{pattern3b})"

df1 = findinseqfile(pattern3a, protFile)
df2 = findinseqfile(pattern3b, protFile)
result = pd.concat([df1, df2])
result

Unnamed: 0,acc,match,start_end,seq
0,YDR099W,RNLLSVAYKNV,"(42, 53)","YDR099W BMH2 SGDID:S000002506, Chr IV from 653..."
1,YER177W,RNLLSVAYKNV,"(42, 53)","YER177W BMH1 SGDID:S000000979, Chr V from 5456..."
0,YDR099W,YKDSTLIMQLLRDNLTLWTS,"(215, 235)","YDR099W BMH2 SGDID:S000002506, Chr IV from 653..."
1,YER177W,YKDSTLIMQLLRDNLTLWTS,"(215, 235)","YER177W BMH1 SGDID:S000000979, Chr V from 5456..."


__Exercise 4.__ Parsing and extracting data from a URL:

This is form the tutorial that you should have completed.

When working with files and resources over a network, you will often come across URIs and URLs which can be parsed and worked with directly. Most standard libraries will have classes to parse and construct these kind of identifiers, but if you need to match them in logs or a larger corpus of text, you can use regular expressions to pull out information from their structured format quite easily.

URIs, or Uniform Resource Identifiers, are a representation of a resource that is generally composed of a scheme, host, port (optional), and resource path, respectively highlighted below.

http://regexone.com:80/page

The scheme describes the protocol to communicate with, the host and port describe the source of the resource, and the full path describes the location at the source for the resource.

In the exercise below, try to extract the protocol, host and port of the all the resources listed in this string.

```
ftp://file_server.com:21/top_secret/life_changing_plans.pdf
https://regexone.com/lesson/introduction#section
file://localhost:4040/zip_file
https://s3cur3-server.com:9999/
market://search/angry%20birds
```

You can work interactively here: https://regexone.com/problem/extracting_url_data to find the right regular expression, then use re.finditer to create a dataframe with columns protocol, host and port for each of the matches in the string.

In [32]:
string = ["ftp://file_server.com:21/top_secret/life_changing_plans.pdf",
          "https://regexone.com/lesson/introduction#section",
          "file://localhost:4040/zip_file",
          "https://s3cur3-server.com:9999/",
          "market://search/angry%20birds"]
pattern = r"(\w+)://([\w\-\.]+):?(\d+)?"


information = []
for link in string:
    for match in re.finditer(pattern, link):
        information.append([link, match.group(1), match.group(2), match.group(3) if match.group(3) else ''])

pd.DataFrame(information, columns=['resource', 'protocol', 'host', 'port'])

Unnamed: 0,resource,protocol,host,port
0,ftp://file_server.com:21/top_secret/life_chang...,ftp,file_server.com,21.0
1,https://regexone.com/lesson/introduction#section,https,regexone.com,
2,file://localhost:4040/zip_file,file,localhost,4040.0
3,https://s3cur3-server.com:9999/,https,s3cur3-server.com,9999.0
4,market://search/angry%20birds,market,search,
