# Problem

To allow for the presence of its varying forms, a protein motif is represented by a shorthand as follows: [XY] means "either X or Y" and {X} means "any amino acid except X." For example, the N-glycosylation motif is written as N{P}\[ST\]{P}.

You can see the complete description and features of a particular protein by its access ID "uniprot_id" in the UniProt database, by inserting the ID number into

http://www.uniprot.org/uniprot/uniprot_id

Alternatively, you can obtain a protein sequence in FASTA format by following

http://www.uniprot.org/uniprot/uniprot_id.fasta

For example, the data for protein B5ZC00 can be found at http://www.uniprot.org/uniprot/B5ZC00.

**Given:** At most 15 UniProt Protein Database access IDs.

**Return:** For each protein possessing the N-glycosylation motif, output its given access ID followed by a list of locations in the protein string where the motif can be found.

#### Sample Dataset
```
A2Z669
B5ZC00
P07204_TRBM_HUMAN
P20840_SAG1_YEAST
```

#### Sample Output

```
B5ZC00
85 118 142 306 395
P07204_TRBM_HUMAN
47 115 116 382 409
P20840_SAG1_YEAST
79 109 135 248 306 348 364 402 485 501 614
```

In [1]:
import urllib3

In [2]:
def uniprot_seq(uniprot_id):
    url = f"http://www.uniprot.org/uniprot/{uniprot_id}.fasta"
    resp = urllib3.request("GET", url)
    text = resp.data.decode('utf-8')
    seq = text.split('\n')[1:]
    return "".join(seq)

In [3]:
uniprot_list = [
    "A2Z669",
    "B5ZC00",
    "P07204_TRBM_HUMAN",
    "P20840_SAG1_YEAST",
]

valid_uniprot_id = [s.split("_")[0] for s in uniprot_list]

sequences = {}
for seqid, valid in zip(uniprot_list, valid_uniprot_id):
    sequences[seqid] = uniprot_seq(valid)

In [4]:
sequences

{'A2Z669': 'MRASRPVVHPVEAPPPAALAVAAAAVAVEAGVGAGGGAAAHGGENAQPRGVRMKDPPGAPGTPGGLGLRLVQAFFAAAALAVMASTDDFPSVSAFCYLVAAAILQCLWSLSLAVVDIYALLVKRSLRNPQAVCIFTIGDGITGTLTLGAACASAGITVLIGNDLNICANNHCASFETATAMAFISWFALAPSCVLNFWSMASR',
 'B5ZC00': 'MKNKFKTQEELVNHLKTVGFVFANSEIYNGLANAWDYGPLGVLLKNNLKNLWWKEFVTKQKDVVGLDSAIILNPLVWKASGHLDNFSDPLIDCKNCKARYRADKLIESFDENIHIAENSSNEEFAKVLNDYEISCPTCKQFNWTEIRHFNLMFKTYQGVIEDAKNVVYLRPETAQGIFVNFKNVQRSMRLHLPFGIAQIGKSFRNEITPGNFIFRTREFEQMEIEFFLKEESAYDIFDKYLNQIENWLVSACGLSLNNLRKHEHPKEELSHYSKKTIDFEYNFLHGFSELYGIAYRTNYDLSVHMNLSKKDLTYFDEQTKEKYVPHVIEPSVGVERLLYAILTEATFIEKLENDDERILMDLKYDLAPYKIAVMPLVNKLKDKAEEIYGKILDLNISATFDNSGSIGKRYRRQDAIGTIYCLTIDFDSLDDQQDPSFTIRERNSMAQKRIKLSELPLYLNQKAHEDFQRQCQK',
 'P07204_TRBM_HUMAN': 'MLGVLVLGALALAGLGFPAPAEPQPGGSQCVEHDCFALYPGPATFLNASQICDGLRGHLMTVRSSVAADVISLLLNGDGGVGRRRLWIGLQLPPGCGDPKRLGPLRGFQWVTGDNNTSYSRWARLDLNGAPLCGPLCVAVSAAEATVPSEPIWEEQQCEVKADGFLCEFHFPATCRPLAVEPGAAAAAVSITYGTPFAARGADFQALPVGSSAAVAPLGLQLMCTAPPGAVQGHWAREAPGAWDCSVENGGCEHACNAIPGAPRCQCPAGA

In [5]:
def glycosylation_motif(seq):
    # N{P}[ST]{P}
    # N, not P, S or T, not P
    pos1 = seq[0] == "N"
    pos2 = seq[1] != "P"
    pos3 = (seq[2] == "S") | (seq[2] == "T")
    pos4 = seq[3] != "P"
    if pos1 and pos2 and pos3 and pos4:
        return True
    return False

In [6]:
def check_glycosylation(sequences):
    for prot_id, seq in sequences.items():
        motifs = []
        for i, _ in enumerate(seq[:-4]):
            if glycosylation_motif(seq[i:i+4]):
                motifs.append(str(i+1))
        if len(motifs) > 0:
            print(prot_id)
            print(" ".join(motifs))

In [7]:
check_glycosylation(sequences)

B5ZC00
85 118 142 306 395
P07204_TRBM_HUMAN
47 115 116 382 409
P20840_SAG1_YEAST
79 109 135 248 306 348 364 402 485 501 614


In [8]:
uniprot_list = []
with open("/Users/npapadop/Downloads/rosalind_mprt.txt", "r") as rosalind:
    for line in rosalind.readlines():
        uniprot_list.append(line.strip())

In [9]:
valid_uniprot_id = [s.split("_")[0] for s in uniprot_list]

sequences = {}
for seqid, valid in zip(uniprot_list, valid_uniprot_id):
    sequences[seqid] = uniprot_seq(valid)

check_glycosylation(sequences)

P04921_GLPC_HUMAN
8
Q4FZD7
528
P24592_IBP6_HUMAN
229
P01008_ANT3_HUMAN
128 167 187 224
P07204_TRBM_HUMAN
47 115 116 382 409
A0QQ98
44
P98119_URT1_DESRO
153 398
Q47A87
310 616
P01589_IL2A_HUMAN
70 89
