# Biopython - Esercizio2

Prendere in input un entry in formato `embl` di una sequenza nucleotidica di mRNA e, senza conoscere la proteina effettivamente annotata nel file ma solo sulla base della sequenza nucleotidica del trascritto, trovare tutte le proteine di oltre 1000 amminoacidi che il trascritto potrebbe esprimere.

In questo esercizio definiamo come proteina potenzialmente esprimibile dal trascritto la proteina che si ottiene dalla traduzione di una qualsiasi sottostringa che inizia con lo start codon `atg` e finisce con uno degli stop codon (tale che non comprenda stop codon in mezzo).

Installare il package `Bio` di Biopython.

Importare il package `Bio`.

In [1]:
import Bio

Importare il package `SeqIO`.

In [2]:
from Bio import SeqIO

---

### Ottenere il record del file di input 

In [3]:
embl_record = SeqIO.read('./M10051.txt', 'embl')

In [4]:
embl_record

SeqRecord(seq=Seq('GGGGGGCTGCGCGGCCGGGTCGGTGCGCACACGAGAAGGACGCGCGGCCCCCAG...AAA', IUPACAmbiguousDNA()), id='M10051.1', name='M10051', description='Human insulin receptor mRNA, complete cds.', dbxrefs=['MD5:e4e6ebf2e723a500c1dd62385c279351', 'Ensembl-Gn:ENSG00000171105', 'Ensembl-Tr:ENST00000302850', 'Ensembl-Tr:ENST00000341500', 'EuropePMC:PMC2739203', 'EuropePMC:PMC3164640', 'EuropePMC:PMC452597'])

Ottenere la lista dei 3 frame di lettura della sequenza (come oggetti di tipo `SeqRecord`).

**Primo frame di lettura**: il più lungo prefisso la cui lunghezza è multiplo di 3.

**Secondo frame di lettura**: la più lunga sottostringa che inzia alla posizione 1 la cui lunghezza è multiplo di 3.

**Terzo frame di lettura**: la più lunga sottostringa che inzia alla posizione 2 la cui lunghezza è multiplo di 3.

In [5]:
readings = [embl_record[f:] for f in [0,1,2]]
readings = [r[:len(r)-len(r)%3] for r in readings]

In [6]:
readings

[SeqRecord(seq=Seq('GGGGGGCTGCGCGGCCGGGTCGGTGCGCACACGAGAAGGACGCGCGGCCCCCAG...CAA', IUPACAmbiguousDNA()), id='M10051.1', name='M10051', description='Human insulin receptor mRNA, complete cds.', dbxrefs=[]),
 SeqRecord(seq=Seq('GGGGGCTGCGCGGCCGGGTCGGTGCGCACACGAGAAGGACGCGCGGCCCCCAGC...AAA', IUPACAmbiguousDNA()), id='M10051.1', name='M10051', description='Human insulin receptor mRNA, complete cds.', dbxrefs=[]),
 SeqRecord(seq=Seq('GGGGCTGCGCGGCCGGGTCGGTGCGCACACGAGAAGGACGCGCGGCCCCCAGCG...CCA', IUPACAmbiguousDNA()), id='M10051.1', name='M10051', description='Human insulin receptor mRNA, complete cds.', dbxrefs=[])]

Ottenere la lista delle tre traduzioni dei frame di lettura (come oggetti di tipo `str`).

In [7]:
reading_translations = [str(r.translate().seq) for r in readings]

In [8]:
reading_translations

['GGLRGRVGAHTRRTRGPQRSWGPPRSMTPAGQRRAPDPRRPRAPAAMGTGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVCPGMDIRNNLTRLHELENCSVIEGHLQILLMFKTRPEDFRDLSFPKLIMITDYLLLFRVYGLESLKDLFPNLTVIRGSRLFFNYALVIFEMVHLKELGLYNLMNITRGSVRIEKNNELCYLATIDWSRILDSVEDNHIVLNKDDNEECGDICPGTAKGKTNCPATVINGQFVERCWTHSHCQKVCPTICKSHGCTAEGLCCHSECLGNCSQPDDPTKCVACRNFYLDGRCVETCPPPYYHFQDWRCVNFSFCQDLHHKCKNSRRQGCHQYVIHNNKCIPECPSGYTMNSSNLLCTPCLGPCPKVCHLLEGEKTIDSVTSAQELRGCTVINGSLIINIRGGNNLAAELEANLGLIEEISGYLKIRRSYALVSLSFFRKLRLIRGETLEIGNYSFYALDNQNLRQLWDWSKHNLTTTQGKLFFHYNPKLCLSEIHKMEEVSGTKGRQERNDIALKTNGDKASCENELLKFSYIRTSFDKILLRWEPYWPPDFRDLLGFMLFYKEAPYQNVTEFDGQDACGSNSWTVVDIDPPLRSNDPKSQNHPGWLMRGLKPWTQYAIFVKTLVTFSDERRTYGAKSDIIYVQTDATNPSVPLDPISVSNSSSQIILKWKPPSDPNGNITHYLVFWERQAEDSELFELDYCLKGLKLPSRTWSPPFESEDSQKHNQSEYEDSAGECCSCPKTDSQILKELEESSFRKTFEDYLHNVVFVPRKTSSGTGAEDPRPSRKRRSLGDVGNVTVAVPTVAAFPNTSSTSVPTSPEEHRPFEKVVNKESLVISGLRHFTGYRIELQACNQDTPEERCSVAAYVSARTMPEAKADDIVGPVTHEIFENNVVHLMWQEPKEPNGLIVLYEVSYRRYGDEELHLCVSRKHFALERGCRLRGLSPGNYSVRIRATSLAGNGSWTEPTYFYVTDYLDVPS

Separare ognuna delle 3 traduzioni tramite il simbolo `*`, che rappresenta lo stop codon, e produrre la lista delle liste delle singole parti separate, chiamate nel seguito **stop chunks**.

In [9]:
stop_chunk_list = [t.split('*') for t in reading_translations]

In [10]:
stop_chunk_list

[['GGLRGRVGAHTRRTRGPQRSWGPPRSMTPAGQRRAPDPRRPRAPAAMGTGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVCPGMDIRNNLTRLHELENCSVIEGHLQILLMFKTRPEDFRDLSFPKLIMITDYLLLFRVYGLESLKDLFPNLTVIRGSRLFFNYALVIFEMVHLKELGLYNLMNITRGSVRIEKNNELCYLATIDWSRILDSVEDNHIVLNKDDNEECGDICPGTAKGKTNCPATVINGQFVERCWTHSHCQKVCPTICKSHGCTAEGLCCHSECLGNCSQPDDPTKCVACRNFYLDGRCVETCPPPYYHFQDWRCVNFSFCQDLHHKCKNSRRQGCHQYVIHNNKCIPECPSGYTMNSSNLLCTPCLGPCPKVCHLLEGEKTIDSVTSAQELRGCTVINGSLIINIRGGNNLAAELEANLGLIEEISGYLKIRRSYALVSLSFFRKLRLIRGETLEIGNYSFYALDNQNLRQLWDWSKHNLTTTQGKLFFHYNPKLCLSEIHKMEEVSGTKGRQERNDIALKTNGDKASCENELLKFSYIRTSFDKILLRWEPYWPPDFRDLLGFMLFYKEAPYQNVTEFDGQDACGSNSWTVVDIDPPLRSNDPKSQNHPGWLMRGLKPWTQYAIFVKTLVTFSDERRTYGAKSDIIYVQTDATNPSVPLDPISVSNSSSQIILKWKPPSDPNGNITHYLVFWERQAEDSELFELDYCLKGLKLPSRTWSPPFESEDSQKHNQSEYEDSAGECCSCPKTDSQILKELEESSFRKTFEDYLHNVVFVPRKTSSGTGAEDPRPSRKRRSLGDVGNVTVAVPTVAAFPNTSSTSVPTSPEEHRPFEKVVNKESLVISGLRHFTGYRIELQACNQDTPEERCSVAAYVSARTMPEAKADDIVGPVTHEIFENNVVHLMWQEPKEPNGLIVLYEVSYRRYGDEELHLCVSRKHFALERGCRLRGLSPGNYSVRIRATSLAGNGSWTEPTYFYVTDYLDVP

Trasformare la lista precedente da lista di liste di stringhe a lista di stringhe (lista di stop chunks).

In [11]:
stop_chunk_list = [chunk for list_of_list in stop_chunk_list for chunk in list_of_list]

In [12]:
stop_chunk_list

['GGLRGRVGAHTRRTRGPQRSWGPPRSMTPAGQRRAPDPRRPRAPAAMGTGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVCPGMDIRNNLTRLHELENCSVIEGHLQILLMFKTRPEDFRDLSFPKLIMITDYLLLFRVYGLESLKDLFPNLTVIRGSRLFFNYALVIFEMVHLKELGLYNLMNITRGSVRIEKNNELCYLATIDWSRILDSVEDNHIVLNKDDNEECGDICPGTAKGKTNCPATVINGQFVERCWTHSHCQKVCPTICKSHGCTAEGLCCHSECLGNCSQPDDPTKCVACRNFYLDGRCVETCPPPYYHFQDWRCVNFSFCQDLHHKCKNSRRQGCHQYVIHNNKCIPECPSGYTMNSSNLLCTPCLGPCPKVCHLLEGEKTIDSVTSAQELRGCTVINGSLIINIRGGNNLAAELEANLGLIEEISGYLKIRRSYALVSLSFFRKLRLIRGETLEIGNYSFYALDNQNLRQLWDWSKHNLTTTQGKLFFHYNPKLCLSEIHKMEEVSGTKGRQERNDIALKTNGDKASCENELLKFSYIRTSFDKILLRWEPYWPPDFRDLLGFMLFYKEAPYQNVTEFDGQDACGSNSWTVVDIDPPLRSNDPKSQNHPGWLMRGLKPWTQYAIFVKTLVTFSDERRTYGAKSDIIYVQTDATNPSVPLDPISVSNSSSQIILKWKPPSDPNGNITHYLVFWERQAEDSELFELDYCLKGLKLPSRTWSPPFESEDSQKHNQSEYEDSAGECCSCPKTDSQILKELEESSFRKTFEDYLHNVVFVPRKTSSGTGAEDPRPSRKRRSLGDVGNVTVAVPTVAAFPNTSSTSVPTSPEEHRPFEKVVNKESLVISGLRHFTGYRIELQACNQDTPEERCSVAAYVSARTMPEAKADDIVGPVTHEIFENNVVHLMWQEPKEPNGLIVLYEVSYRRYGDEELHLCVSRKHFALERGCRLRGLSPGNYSVRIRATSLAGNGSWTEPTYFYVTDYLDVPS

Da ognuno degli stop chunks della lista precedente estrarre le sottostringhe consecutive che iniziano con `M` (e terminano una posizione prima della `M` successiva), che chiameremo **M chunks**.

Produrre una lista di liste ciascuna contenenti gli **M chunks** per un dato stop chunk.

Esempio di stop chunk con spazio inserito prima di ogni `M`:

    QCLPWRGRAGVPI MAFLWFESLWK MQDSHDST MSSGVQRSFLY MSVHLKVDSFGYQFN
    
**M chunks** dello stop chunk di esempio:

    MAFLWFESLWK MQDSHDST MSSGVQRSFLY MSVHLKVDSFGYQFN

In [13]:
import re

m_chunk_list = [re.findall('(M[^M]+)', stop_chunk) for stop_chunk in stop_chunk_list]

In [14]:
m_chunk_list

[['MTPAGQRRAPDPRRPRAPAA',
  'MGTGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVCPG',
  'MDIRNNLTRLHELENCSVIEGHLQILL',
  'MFKTRPEDFRDLSFPKLI',
  'MITDYLLLFRVYGLESLKDLFPNLTVIRGSRLFFNYALVIFE',
  'MVHLKELGLYNL',
  'MNITRGSVRIEKNNELCYLATIDWSRILDSVEDNHIVLNKDDNEECGDICPGTAKGKTNCPATVINGQFVERCWTHSHCQKVCPTICKSHGCTAEGLCCHSECLGNCSQPDDPTKCVACRNFYLDGRCVETCPPPYYHFQDWRCVNFSFCQDLHHKCKNSRRQGCHQYVIHNNKCIPECPSGYT',
  'MNSSNLLCTPCLGPCPKVCHLLEGEKTIDSVTSAQELRGCTVINGSLIINIRGGNNLAAELEANLGLIEEISGYLKIRRSYALVSLSFFRKLRLIRGETLEIGNYSFYALDNQNLRQLWDWSKHNLTTTQGKLFFHYNPKLCLSEIHK',
  'MEEVSGTKGRQERNDIALKTNGDKASCENELLKFSYIRTSFDKILLRWEPYWPPDFRDLLGF',
  'MLFYKEAPYQNVTEFDGQDACGSNSWTVVDIDPPLRSNDPKSQNHPGWL',
  'MRGLKPWTQYAIFVKTLVTFSDERRTYGAKSDIIYVQTDATNPSVPLDPISVSNSSSQIILKWKPPSDPNGNITHYLVFWERQAEDSELFELDYCLKGLKLPSRTWSPPFESEDSQKHNQSEYEDSAGECCSCPKTDSQILKELEESSFRKTFEDYLHNVVFVPRKTSSGTGAEDPRPSRKRRSLGDVGNVTVAVPTVAAFPNTSSTSVPTSPEEHRPFEKVVNKESLVISGLRHFTGYRIELQACNQDTPEERCSVAAYVSART',
  'MPEAKADDIVGPVTHEIFENNVVHL',
  'MWQEPKEPNGLIVLYEVSYRRYGDEELHLCVSRKHF

Per ognuna delle liste di **M chunks**, effettuare le concatenazioni degli **M chunks**: prima concatenare tutte le stringhe dalla prima all'ultima, poi concatenare tutte le stringhe dalla seconda all'ultima, etc. per ottenere le proteine potenzialmente esprimibili dal trascritto.

Ad esempio per questa lista degli **M chunks** di uno stop chunk:

    MAFLWFESLWK MQDSHDST MSSGVQRSFLY MSVHLKVDSFGYQFN
    
si devono effettuare quattro concatenazioni e ottenere quindi quattro proteine potenziali:

1. Concatenazione di:

        MAFLWFESLWK MQDSHDST MSSGVQRSFLY MSVHLKVDSFGYQFN
1. Concatenazione di:

        MQDSHDST MSSGVQRSFLY MSVHLKVDSFGYQFN
1. Concatenazione di:

        MSSGVQRSFLY MSVHLKVDSFGYQFN
1. Concatenazione di:

        MSVHLKVDSFGYQFN
        
Produrre quindi la lista delle proteine relative a tutti gli stop chunks.

In [15]:
protein_list = [''.join(m_chunks[i:]) for m_chunks in m_chunk_list for i in range(len(m_chunks))]

In [16]:
protein_list

['MTPAGQRRAPDPRRPRAPAAMGTGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVCPGMDIRNNLTRLHELENCSVIEGHLQILLMFKTRPEDFRDLSFPKLIMITDYLLLFRVYGLESLKDLFPNLTVIRGSRLFFNYALVIFEMVHLKELGLYNLMNITRGSVRIEKNNELCYLATIDWSRILDSVEDNHIVLNKDDNEECGDICPGTAKGKTNCPATVINGQFVERCWTHSHCQKVCPTICKSHGCTAEGLCCHSECLGNCSQPDDPTKCVACRNFYLDGRCVETCPPPYYHFQDWRCVNFSFCQDLHHKCKNSRRQGCHQYVIHNNKCIPECPSGYTMNSSNLLCTPCLGPCPKVCHLLEGEKTIDSVTSAQELRGCTVINGSLIINIRGGNNLAAELEANLGLIEEISGYLKIRRSYALVSLSFFRKLRLIRGETLEIGNYSFYALDNQNLRQLWDWSKHNLTTTQGKLFFHYNPKLCLSEIHKMEEVSGTKGRQERNDIALKTNGDKASCENELLKFSYIRTSFDKILLRWEPYWPPDFRDLLGFMLFYKEAPYQNVTEFDGQDACGSNSWTVVDIDPPLRSNDPKSQNHPGWLMRGLKPWTQYAIFVKTLVTFSDERRTYGAKSDIIYVQTDATNPSVPLDPISVSNSSSQIILKWKPPSDPNGNITHYLVFWERQAEDSELFELDYCLKGLKLPSRTWSPPFESEDSQKHNQSEYEDSAGECCSCPKTDSQILKELEESSFRKTFEDYLHNVVFVPRKTSSGTGAEDPRPSRKRRSLGDVGNVTVAVPTVAAFPNTSSTSVPTSPEEHRPFEKVVNKESLVISGLRHFTGYRIELQACNQDTPEERCSVAAYVSARTMPEAKADDIVGPVTHEIFENNVVHLMWQEPKEPNGLIVLYEVSYRRYGDEELHLCVSRKHFALERGCRLRGLSPGNYSVRIRATSLAGNGSWTEPTYFYVTDYLDVPSNIAKIIIGPLIFVFLFSVVIGSIYLF

Eliminare le proteine che non superano una lunghezza di 1000 amminoacidi.

In [17]:
protein_list = [protein for protein in protein_list if len(protein) > 1000]

In [18]:
protein_list

['MTPAGQRRAPDPRRPRAPAAMGTGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVCPGMDIRNNLTRLHELENCSVIEGHLQILLMFKTRPEDFRDLSFPKLIMITDYLLLFRVYGLESLKDLFPNLTVIRGSRLFFNYALVIFEMVHLKELGLYNLMNITRGSVRIEKNNELCYLATIDWSRILDSVEDNHIVLNKDDNEECGDICPGTAKGKTNCPATVINGQFVERCWTHSHCQKVCPTICKSHGCTAEGLCCHSECLGNCSQPDDPTKCVACRNFYLDGRCVETCPPPYYHFQDWRCVNFSFCQDLHHKCKNSRRQGCHQYVIHNNKCIPECPSGYTMNSSNLLCTPCLGPCPKVCHLLEGEKTIDSVTSAQELRGCTVINGSLIINIRGGNNLAAELEANLGLIEEISGYLKIRRSYALVSLSFFRKLRLIRGETLEIGNYSFYALDNQNLRQLWDWSKHNLTTTQGKLFFHYNPKLCLSEIHKMEEVSGTKGRQERNDIALKTNGDKASCENELLKFSYIRTSFDKILLRWEPYWPPDFRDLLGFMLFYKEAPYQNVTEFDGQDACGSNSWTVVDIDPPLRSNDPKSQNHPGWLMRGLKPWTQYAIFVKTLVTFSDERRTYGAKSDIIYVQTDATNPSVPLDPISVSNSSSQIILKWKPPSDPNGNITHYLVFWERQAEDSELFELDYCLKGLKLPSRTWSPPFESEDSQKHNQSEYEDSAGECCSCPKTDSQILKELEESSFRKTFEDYLHNVVFVPRKTSSGTGAEDPRPSRKRRSLGDVGNVTVAVPTVAAFPNTSSTSVPTSPEEHRPFEKVVNKESLVISGLRHFTGYRIELQACNQDTPEERCSVAAYVSARTMPEAKADDIVGPVTHEIFENNVVHLMWQEPKEPNGLIVLYEVSYRRYGDEELHLCVSRKHFALERGCRLRGLSPGNYSVRIRATSLAGNGSWTEPTYFYVTDYLDVPSNIAKIIIGPLIFVFLFSVVIGSIYLF