Dan Shea  
2021-06-03  

#### Problem
To allow for the presence of its varying forms, a protein motif is represented by a shorthand as follows: $[XY]$ means "either X or Y" and $\{X\}$ means "any amino acid except X." For example, the N-glycosylation motif is written as $N\{P\}[ST]\{P\}$.

You can see the complete description and features of a particular protein by its access ID "uniprot_id" in the UniProt database, by inserting the ID number into:

http://www.uniprot.org/uniprot/uniprot_id

Alternatively, you can obtain a protein sequence in FASTA format by following:

http://www.uniprot.org/uniprot/uniprot_id.fasta

For example, the data for protein B5ZC00 can be found at http://www.uniprot.org/uniprot/B5ZC00.fasta

__Given:__ At most 15 UniProt Protein Database access IDs.

__Return:__ For each protein possessing the N-glycosylation motif, output its given access ID followed by a list of locations in the protein string where the motif can be found.

##### Sample Dataset
```
A2Z669
B5ZC00
P07204_TRBM_HUMAN
P20840_SAG1_YEAST
```
##### Sample Output
```
B5ZC00
85 118 142 306 395
P07204_TRBM_HUMAN
47 115 116 382 409
P20840_SAG1_YEAST
79 109 135 248 306 348 364 402 485 501 614
```

In [14]:
import regex as re
import requests
from time import sleep

In [15]:
uri = 'http://www.uniprot.org/uniprot/'
ext = '.fasta'

def get_protein(prot_id):
    url = f'{uri}{prot_id}{ext}'
    r = requests.get(url)
    return r

def parse_request(req):
    lines = req.text.split('\n')
    header = lines[0]
    sequence = ''.join(lines[1:])
    return sequence

def find_motif(seq, motif=r'N[^P][ST][^P]'):
    match = list(re.finditer(motif, seq, overlapped=True))
    return list(map(lambda m: m.start() + 1, match))


In [16]:
def parse_input_print_ans(filename):
    with open(filename, 'r') as fh:
        for line in fh:
            line = line.strip()
            result = find_motif(parse_request(get_protein(line)))
            if result != []:
                print(line)
                print(' '.join(map(str, result)))
            sleep(1)

In [17]:
parse_input_print_ans('sample.txt')

B5ZC00
85 118 142 306 395
P07204_TRBM_HUMAN
47 115 116 382 409
P20840_SAG1_YEAST
79 109 135 248 306 348 364 402 485 501 614


In [19]:
parse_input_print_ans('rosalind_mprt.txt')

A2A2Y4
90 359 407
Q9QSP4
196 250 326 443
P19827_ITH1_HUMAN
285 588 750
P01042_KNH_HUMAN
48 169 205 294
P81824_PABJ_BOTJA
25
P02974_FMM1_NEIGO
67 68 121
P01045_KNH2_BOVIN
47 87 168 169 197 204 280
P04233_HG2A_HUMAN
130 136 256 270
P47002
35 552 608
Q50228
55 228
