# Biopython - Esercizio1

Prendere in input un file in formato `FASTA` di sequenze EST (Expresssed Sequence Tag) e 
separare le sequenze in due diversi gruppi:

- A: sequenze EST con coding sequence
- B: sequenze EST senza coding sequence

Per ognuna delle sequenze del gruppo A estrarre la coding sequence e tradurla in proteina. Produrre tutte le proteine ottenute in un file in formato `FASTA`.

Dal gruppo B eliminare le sequenze più corte di 500 basi e produrre quelle rimaste in un file in formato `FASTA`, ordinate per lunghezza crescente, dopo averne effettuato il reverse&complement se il `clone_end` indicato è pari a `5'`.

---

**Suggerimenti**

Le sequenze del gruppo A hanno un header `FASTA` contenente il tag `/cds` che fornisce le posizioni 1-based di start ed end della coding sequence. Le sequenze del gruppo B non hanno invece questo tag. Le sequenze che non presentano questo tag fanno parte del gruppo B.

Il tag `/clone_end` delle sequenze del gruppo B indica se la sequenza ha direzione `5'3'` oppure `3'5'`. `/clone_end=5'` indica una direzione `3'5'` e quindi si deve operare un reverse&complement al fine di rendere la sequenza concorde con lo strand di trascrizione del suo gene di origine.

Il GenBank ID è indicato dal tag `/gb`.

Esempio di header `FASTA` per sequenza del gruppo A:

    >gnl|UG|Hs#S3219558 Homo sapiens Rho GTPase activating protein 4 (ARHGAP4), transcript variant 2, mRNA /cds=p(59,2899) /gb=NM_001666 /gi=258613905 /ug=Hs.701324 /len=3285
    
Esempio di header `FASTA` per sequenza del gruppo B:

    >gnl|UG|Hs#S1027289 os53f09.s1 NCI_CGAP_Br2 Homo sapiens cDNA clone IMAGE:1609097 3', mRNA sequence /clone=IMAGE:1609097 /clone_end=3' /gb=AI000530 /gi=3191084 /ug=Hs.701324 /len=458
    
---

**Requisiti di output**
    
L'header `FASTA` del file di output delle traduzioni del gruppo A deve contenere unicamente il GenBank ID, mentre quello del file di output delle sequenze del gruppo B deve contenere oltre al GenBank ID anche la lunghezza della sequenza.

---

Installare il package `Bio` di Biopython.

Importare il package `Bio`.

In [1]:
import Bio

Importare il package `SeqIO`.

In [2]:
from Bio import SeqIO

---

### Ottenere la lista dei record delle sequenze EST

### Separare le sequenze in due liste (gruppo A e gruppo B)

Ottenere la lista dei tag `/cds` da ognuno degli header `FASTA` in input.

In [6]:
cds_start_end_list

[[],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 ['p(59,3019)'],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 ['p(59,2899)'],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 

Ottenere la lista dei record del gruppo A (oggetti di tipo `SeqRecord`).

In [8]:
est_a_list

[(SeqRecord(seq=Seq('GGCCCGCTCACGGCCGCGTGGGAGCAGTGGGGTTCGACGGCGCGGCCGCGAGGC...AAA', SingleLetterAlphabet()), id='gnl|UG|Hs#S15631921', name='gnl|UG|Hs#S15631921', description='gnl|UG|Hs#S15631921 Homo sapiens Rho GTPase activating protein 4, mRNA (cDNA clone MGC:59737 IMAGE:6379390), complete cds /cds=p(59,3019) /gb=BC052303 /gi=30353955 /ug=Hs.701324 /len=3387', dbxrefs=[]),
  'p(59,3019)'),
 (SeqRecord(seq=Seq('GGCCCGCTCACGGCCGCGTGGGAGCAGTGGGGTTCGACGGCGCGGCCGCGAGGC...AAA', SingleLetterAlphabet()), id='gnl|UG|Hs#S3219558', name='gnl|UG|Hs#S3219558', description='gnl|UG|Hs#S3219558 Homo sapiens Rho GTPase activating protein 4 (ARHGAP4), transcript variant 2, mRNA /cds=p(59,2899) /gb=NM_001666 /gi=258613905 /ug=Hs.701324 /len=3285', dbxrefs=[]),
  'p(59,2899)'),
 (SeqRecord(seq=Seq('CGGGAAGCTGCGGCGGGAGCGGGGGCTGCAGGCTGAGTATGAGACGCAAGTCAA...CCC', SingleLetterAlphabet()), id='gnl|UG|Hs#S417706', name='gnl|UG|Hs#S417706', description='gnl|UG|Hs#S417706 Homo sapiens mRNA for KIAA0131 gene, p

Ottenere la lista dei record del gruppo B (oggetti di tipo `SeqRecord`).

In [10]:
est_b_list

[SeqRecord(seq=Seq('AATCAACACGAGGTCTTTATGAATCGCCACCCAGCCCTGCCAGGCATCTGAGCA...GCC', SingleLetterAlphabet()), id='gnl|UG|Hs#S1027289', name='gnl|UG|Hs#S1027289', description="gnl|UG|Hs#S1027289 os53f09.s1 NCI_CGAP_Br2 Homo sapiens cDNA clone IMAGE:1609097 3', mRNA sequence /clone=IMAGE:1609097 /clone_end=3' /gb=AI000530 /gi=3191084 /ug=Hs.701324 /len=458", dbxrefs=[]),
 SeqRecord(seq=Seq('TTTTTTTTTTTTGGGAATCAACACGAGGTCTTTATGAATCGCCACCCAGCCCTG...CAA', SingleLetterAlphabet()), id='gnl|UG|Hs#S1072854', name='gnl|UG|Hs#S1072854', description="gnl|UG|Hs#S1072854 oz13f05.x1 Soares_fetal_liver_spleen_1NFLS_S1 Homo sapiens cDNA clone IMAGE:1675233 3' similar to SW:RGC1_HUMAN P98171 RHO-GAP HEMATOPOIETIC PROTEIN C1 ;, mRNA sequence /clone=IMAGE:1675233 /clone_end=3' /gb=AI078474 /gi=3412882 /ug=Hs.701324 /len=461", dbxrefs=[]),
 SeqRecord(seq=Seq('TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCGGGGAATCAACACGAGGTC...GGC', SingleLetterAlphabet()), id='gnl|UG|Hs#S1140573', name='gnl|UG|Hs#S1140573', description

### Produrre in output le traduzioni delle coding sequence del gruppo A

a) Estrarre la lista delle posizioni di inizio e fine delle cds delle sequenze del gruppo A.

In [13]:
cds_start_end_list

[[59, 3019],
 [59, 2899],
 [1, 2830],
 [53, 2824],
 [59, 3019],
 [43, 2883],
 [1, 1073]]

b) Eliminare le coding sequence la cui lunghezza non è un multiplo di 3 e aggiornare di conseguenza la lista delle sequenze del gruppo A.

Costruire una lista di valori booleani tali che l'i-esimo valore è `True` se l'i-esima sequenza ha una coding sequence che è multiplo di 3.

In [15]:
is_multiple_3

[True, True, False, True, True, True, False]

Usare questa list per aggiornare la lista degli start ed end.

In [17]:
cds_start_end_list

[[59, 3019], [59, 2899], [53, 2824], [59, 3019], [43, 2883]]

Aggiornare anche la lista delle sequenze.

In [19]:
est_a_list

[(SeqRecord(seq=Seq('GGCCCGCTCACGGCCGCGTGGGAGCAGTGGGGTTCGACGGCGCGGCCGCGAGGC...AAA', SingleLetterAlphabet()), id='gnl|UG|Hs#S15631921', name='gnl|UG|Hs#S15631921', description='gnl|UG|Hs#S15631921 Homo sapiens Rho GTPase activating protein 4, mRNA (cDNA clone MGC:59737 IMAGE:6379390), complete cds /cds=p(59,3019) /gb=BC052303 /gi=30353955 /ug=Hs.701324 /len=3387', dbxrefs=[]),
  'p(59,3019)'),
 (SeqRecord(seq=Seq('GGCCCGCTCACGGCCGCGTGGGAGCAGTGGGGTTCGACGGCGCGGCCGCGAGGC...AAA', SingleLetterAlphabet()), id='gnl|UG|Hs#S3219558', name='gnl|UG|Hs#S3219558', description='gnl|UG|Hs#S3219558 Homo sapiens Rho GTPase activating protein 4 (ARHGAP4), transcript variant 2, mRNA /cds=p(59,2899) /gb=NM_001666 /gi=258613905 /ug=Hs.701324 /len=3285', dbxrefs=[]),
  'p(59,2899)'),
 (SeqRecord(seq=Seq('ATTCAGAGGCTTTCCAGGAACGAGGGAGAGACAGGAAGCCAAACTGAAAGAGAT...GCT', SingleLetterAlphabet()), id='gnl|UG|Hs#S50215240', name='gnl|UG|Hs#S50215240', description='gnl|UG|Hs#S50215240 Homo sapiens cDNA FLJ54515 compl

Ottenere la lista delle relative coding sequence.

In [21]:
coding_sequence_list

[SeqRecord(seq=Seq('ATGGCCGCTCACGGGAAGCTGCGGCGGGAGCGGGGGCTGCAGGCTGAGTATGAG...TGA', SingleLetterAlphabet()), id='gnl|UG|Hs#S15631921', name='gnl|UG|Hs#S15631921', description='gnl|UG|Hs#S15631921 Homo sapiens Rho GTPase activating protein 4, mRNA (cDNA clone MGC:59737 IMAGE:6379390), complete cds /cds=p(59,3019) /gb=BC052303 /gi=30353955 /ug=Hs.701324 /len=3387', dbxrefs=[]),
 SeqRecord(seq=Seq('ATGGCCGCTCACGGGAAGCTGCGGCGGGAGCGGGGGCTGCAGGCTGAGTATGAG...TGA', SingleLetterAlphabet()), id='gnl|UG|Hs#S3219558', name='gnl|UG|Hs#S3219558', description='gnl|UG|Hs#S3219558 Homo sapiens Rho GTPase activating protein 4 (ARHGAP4), transcript variant 2, mRNA /cds=p(59,2899) /gb=NM_001666 /gi=258613905 /ug=Hs.701324 /len=3285', dbxrefs=[]),
 SeqRecord(seq=Seq('ATGCGCTGGCAGCTGAGCGAGCAGCTGCGCTGCCTGGAGCTGCAGGGCGAGCTG...TGA', SingleLetterAlphabet()), id='gnl|UG|Hs#S50215240', name='gnl|UG|Hs#S50215240', description='gnl|UG|Hs#S50215240 Homo sapiens cDNA FLJ54515 complete cds, highly similar to Rho-GTPase

Ottenere la lista delle traduzioni delle coding sequence.

In [23]:
translation_list

[SeqRecord(seq=Seq('MAAHGKLRRERGLQAEYETQVKEMRWQLSEQLRCLELQGELRRELLQELAEFMR...PH*', HasStopCodon(ExtendedIUPACProtein(), '*')), id='<unknown id>', name='<unknown name>', description='<unknown description>', dbxrefs=[]),
 SeqRecord(seq=Seq('MAAHGKLRRERGLQAEYETQVKEMRWQLSEQLRCLELQGELRRELLQELAEFMR...PH*', HasStopCodon(ExtendedIUPACProtein(), '*')), id='<unknown id>', name='<unknown name>', description='<unknown description>', dbxrefs=[]),
 SeqRecord(seq=Seq('MRWQLSEQLRCLELQGELRRELLQELAEFMRRRAEVELEYSRGLEKLAERFSSR...PH*', HasStopCodon(ExtendedIUPACProtein(), '*')), id='<unknown id>', name='<unknown name>', description='<unknown description>', dbxrefs=[]),
 SeqRecord(seq=Seq('MAAHGKLRRERGLQAEYETQVKEMRWQLSEQLRCLELQGELRRELLQELAEFMR...PH*', HasStopCodon(ExtendedIUPACProtein(), '*')), id='<unknown id>', name='<unknown name>', description='<unknown description>', dbxrefs=[]),
 SeqRecord(seq=Seq('MAAHGKLRRERGLQAEYETQVKEMRWQLSEQLRCLELQGELRRELLQELAEFMR...PH*', HasStopCodon(ExtendedIUPACProtein(), '*')

Assegnare ad ogni traduzione il GenBank ID come attributo `id` del relativo record e la stringa nulla come attributo `description`.

In [25]:
translation_list

[SeqRecord(seq=Seq('MAAHGKLRRERGLQAEYETQVKEMRWQLSEQLRCLELQGELRRELLQELAEFMR...PH*', HasStopCodon(ExtendedIUPACProtein(), '*')), id='BC052303', name='<unknown name>', description='', dbxrefs=[]),
 SeqRecord(seq=Seq('MAAHGKLRRERGLQAEYETQVKEMRWQLSEQLRCLELQGELRRELLQELAEFMR...PH*', HasStopCodon(ExtendedIUPACProtein(), '*')), id='NM_001666', name='<unknown name>', description='', dbxrefs=[]),
 SeqRecord(seq=Seq('MRWQLSEQLRCLELQGELRRELLQELAEFMRRRAEVELEYSRGLEKLAERFSSR...PH*', HasStopCodon(ExtendedIUPACProtein(), '*')), id='AK294562', name='<unknown name>', description='', dbxrefs=[]),
 SeqRecord(seq=Seq('MAAHGKLRRERGLQAEYETQVKEMRWQLSEQLRCLELQGELRRELLQELAEFMR...PH*', HasStopCodon(ExtendedIUPACProtein(), '*')), id='NM_001164741', name='<unknown name>', description='', dbxrefs=[]),
 SeqRecord(seq=Seq('MAAHGKLRRERGLQAEYETQVKEMRWQLSEQLRCLELQGELRRELLQELAEFMR...PH*', HasStopCodon(ExtendedIUPACProtein(), '*')), id='X78817', name='<unknown name>', description='', dbxrefs=[])]

Scrivere le traduzioni nel file di output `translation.fa`.

### Produrre in output le sequenze del gruppo B.

Eliminare le sequenze più corte di 500 basi.

In [28]:
est_b_list

[SeqRecord(seq=Seq('TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCGGGGAATCAACACGAGGTC...GGC', SingleLetterAlphabet()), id='gnl|UG|Hs#S1140573', name='gnl|UG|Hs#S1140573', description="gnl|UG|Hs#S1140573 qx09d10.x1 NCI_CGAP_Lym12 Homo sapiens cDNA clone IMAGE:2000851 3' similar to SW:RGC1_HUMAN P98171 RHO-GAP HEMATOPOIETIC PROTEIN C1 ;, mRNA sequence /clone=IMAGE:2000851 /clone_end=3' /gb=AI249937 /gi=3846466 /ug=Hs.701324 /len=516", dbxrefs=[]),
 SeqRecord(seq=Seq('TTTTCGGGGAATCAACACGAGGTCTTTATGAATCGCCACCCAGCCCTGCCAGGC...GTC', SingleLetterAlphabet()), id='gnl|UG|Hs#S1185526', name='gnl|UG|Hs#S1185526', description="gnl|UG|Hs#S1185526 qh70a03.x1 Soares_fetal_liver_spleen_1NFLS_S1 Homo sapiens cDNA clone IMAGE:1849996 3' similar to SW:RGC1_HUMAN P98171 RHO-GAP HEMATOPOIETIC PROTEIN C1 ;, mRNA sequence /clone=IMAGE:1849996 /clone_end=3' /gb=AI248207 /gi=3843604 /ug=Hs.701324 /len=578", dbxrefs=[]),
 SeqRecord(seq=Seq('GAATCAACACGAGGTCTTTATGAATCGCCACCCAGCCCTGCCAGGCATCTGAGC...GCG', SingleLetterAlphabe

Eseguire il reverse&complement delle sequenze che hanno `clone_end` pari a 5'.

In [30]:
est_b_list

[SeqRecord(seq=Seq('TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCGGGGAATCAACACGAGGTC...GGC', SingleLetterAlphabet()), id='gnl|UG|Hs#S1140573', name='gnl|UG|Hs#S1140573', description="gnl|UG|Hs#S1140573 qx09d10.x1 NCI_CGAP_Lym12 Homo sapiens cDNA clone IMAGE:2000851 3' similar to SW:RGC1_HUMAN P98171 RHO-GAP HEMATOPOIETIC PROTEIN C1 ;, mRNA sequence /clone=IMAGE:2000851 /clone_end=3' /gb=AI249937 /gi=3846466 /ug=Hs.701324 /len=516", dbxrefs=[]),
 SeqRecord(seq=Seq('TTTTCGGGGAATCAACACGAGGTCTTTATGAATCGCCACCCAGCCCTGCCAGGC...GTC', SingleLetterAlphabet()), id='gnl|UG|Hs#S1185526', name='gnl|UG|Hs#S1185526', description="gnl|UG|Hs#S1185526 qh70a03.x1 Soares_fetal_liver_spleen_1NFLS_S1 Homo sapiens cDNA clone IMAGE:1849996 3' similar to SW:RGC1_HUMAN P98171 RHO-GAP HEMATOPOIETIC PROTEIN C1 ;, mRNA sequence /clone=IMAGE:1849996 /clone_end=3' /gb=AI248207 /gi=3843604 /ug=Hs.701324 /len=578", dbxrefs=[]),
 SeqRecord(seq=Seq('GAATCAACACGAGGTCTTTATGAATCGCCACCCAGCCCTGCCAGGCATCTGAGC...GCG', SingleLetterAlphabe

Assegnare ad ogni sequenza il suo GenBank ID come attributo `id` e la sua lunghezza come attributo `description`.

In [32]:
est_b_list

[SeqRecord(seq=Seq('TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCGGGGAATCAACACGAGGTC...GGC', SingleLetterAlphabet()), id='AI249937', name='gnl|UG|Hs#S1140573', description='516', dbxrefs=[]),
 SeqRecord(seq=Seq('TTTTCGGGGAATCAACACGAGGTCTTTATGAATCGCCACCCAGCCCTGCCAGGC...GTC', SingleLetterAlphabet()), id='AI248207', name='gnl|UG|Hs#S1185526', description='578', dbxrefs=[]),
 SeqRecord(seq=Seq('GAATCAACACGAGGTCTTTATGAATCGCCACCCAGCCCTGCCAGGCATCTGAGC...GCG', SingleLetterAlphabet()), id='AI445322', name='gnl|UG|Hs#S1285115', description='524', dbxrefs=[]),
 SeqRecord(seq=Seq('TTTGGGGAATCAACACGAGGTCTTTATGAATCGCCACCCAGCCCTGCCAGGCAT...AAG', SingleLetterAlphabet()), id='AI590391', name='gnl|UG|Hs#S1347553', description='549', dbxrefs=[]),
 SeqRecord(seq=Seq('TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCGGGGAATAAACACGAGGTTT...CCC', SingleLetterAlphabet()), id='AI762509', name='gnl|UG|Hs#S1445435', description='556', dbxrefs=[]),
 SeqRecord(seq=Seq('TTTCGGGGAATCAACACGAGGTCTTTATGAATCGCCACCCAGCCCTGCCAGGCA...CCC', SingleL

Ordinare le sequenze per lunghezza crescente.

Scrivere le sequenze nel file di output `ests-500.fa`.