## 3 SeqIO

As classes Seq e SeqRecord são ótimas para guarda as informações de sequências e suas anotações, mas normalmente as informações para popular os objetos dessas classes estão em arquivos. Uma forma de popular esses objetos seria então abrir um arquivo com o método open() do python e extrair as informações que precisamos. Esse processo de extrair informações chamamos de parsing. Dependendo da formatação do arquivo pode ser bem trabalhoso extrair essas informações. Felizmente para nós o Biopython, na classe SeqIO, traz consigo parsers para diversos formatos de arquivo e retorna um Generator que pode ser iterado em um loop for para retornar um SeqRecord. Os formatos suportado são os seguintes:


| Format name           | Read | Write       | Index | Notes                                                                                                                                                                                                                                                                                                                                                                                       |
|-----------------------|------|-------------|-------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| abi                   | 1.58 | No          | N/A   | Reads the ABI “Sanger” capillary sequence traces files, including the PHRED quality scores for the base calls. This allows ABI to FASTQ conversion. Note each ABI file contains one and only one sequence (so there is no point in indexing the file).                                                                                                                                      |
| abi-trim              | 1.71 | No          | N/A   | Same as “abi” but with quality trimming with Mott’s algorithm.                                                                                                                                                                                                                                                                                                                              |
| ace                   | 1.47 | No          | 1.52  | Reads the contig sequences from an ACE assembly file. Uses Bio.Sequencing.Ace internally                                                                                                                                                                                                                                                                                                    |
| cif-atom              | 1.73 | No          | No    | Uses Bio.PDB.MMCIFParser to determine the (partial) protein sequence as it appears in the structure based on the atomic coordinates.                                                                                                                                                                                                                                                        |
| cif-seqres            | 1.73 | No          | No    | Reads a macromolecular Crystallographic Information File (mmCIF) file to determine the complete protein sequence as defined by the _pdbx_poly_seq_scheme records.                                                                                                                                                                                                                           |
| clustal               | 1.43 | 1.43        | No    | The alignment format of Clustal X and Clustal W.                                                                                                                                                                                                                                                                                                                                            |
| embl                  | 1.43 | 1.54        | 1.52  | The EMBL flat file format. Uses Bio.GenBank internally.                                                                                                                                                                                                                                                                                                                                     |
| fasta                 | 1.43 | 1.43        | 1.52  | This refers to the input FASTA file format introduced for Bill Pearson’s FASTA tool, where each record starts with a “>” line. Resulting sequences have a generic alphabet by default.                                                                                                                                                                                                      |
| fasta-2line           | 1.71 | 1.71        | No    | FASTA format variant with no line wrapping and exactly two lines per record.                                                                                                                                                                                                                                                                                                                |
| fastq-sanger or fastq | 1.50 | 1.50        | 1.52  | FASTQ files are a bit like FASTA files but also include sequencing qualities. In Biopython, “fastq” (or the alias “fastq-sanger”) refers to Sanger style FASTQ files which encode PHRED qualities using an ASCII offset of 33. See also the incompatible “fastq-solexa” and “fastq-illumina” variants used in early Solexa/Illumina pipelines, Illumina pipeline 1.8 produces Sanger FASTQ. |
| fastq-solexa          | 1.50 | 1.50        | 1.52  | In Biopython, “fastq-solexa” refers to the original Solexa/Illumina style FASTQ files which encode Solexa qualities using an ASCII offset of 64. See also what we call the “fastq-illumina” format.                                                                                                                                                                                         |
| fastq-illumina        | 1.51 | 1.51        | 1.52  | In Biopython, “fastq-illumina” refers to early Solexa/Illumina style FASTQ files (from pipeline version 1.3 to 1.7) which encode PHRED qualities using an ASCII offset of 64. For good quality reads, PHRED and Solexa scores are approximately equal, so the “fastq-solexa” and “fastq-illumina” variants are almost equivalent.                                                           |
| genbank or gb         | 1.43 | 1.48 / 1.51 | 1.52  | The GenBank or GenPept flat file format. Uses Bio.GenBank internally for parsing. Biopython 1.48 to 1.50 wrote basic GenBank files with only minimal annotation, while 1.51 onwards will also write the features table.                                                                                                                                                                     |
| ig                    | 1.47 | No          | 1.52  | This refers to the IntelliGenetics file format, apparently the same as the MASE alignment format.                                                                                                                                                                                                                                                                                           |
| imgt                  | 1.56 | 1.56        | 1.56  | This refers to the IMGT variant of the EMBL plain text file format.                                                                                                                                                                                                                                                                                                                         |
| nexus                 | 1.43 | 1.48        | No    | The NEXUS multiple alignment format, also known as PAUP format. Uses Bio.Nexus internally.                                                                                                                                                                                                                                                                                                  |
| pdb-seqres            | 1.61 | No          | No    | Reads a Protein Data Bank (PDB) file to determine the complete protein sequence as it appears in the header (no dependency on Bio.PDB and NumPy).                                                                                                                                                                                                                                           |
| pdb-atom              | 1.61 | No          | No    | Uses Bio.PDB to determine the (partial) protein sequence as it appears in the structure based on the atom coordinate section of the file (requires NumPy).                                                                                                                                                                                                                                  |
| phd                   | 1.46 | 1.52        | 1.52  | PHD files are output from PHRED, used by PHRAP and CONSED for input. Uses Bio.Sequencing.Phd internally.                                                                                                                                                                                                                                                                                    |
| phylip                | 1.43 | 1.43        | No    | PHYLIP files. Truncates names at 10 characters.                                                                                                                                                                                                                                                                                                                                             |
| pir                   | 1.48 | 1.71        | 1.52  | A “FASTA like” format introduced by the National Biomedical Research Foundation (NBRF) for the Protein Information Resource (PIR) database, now part of UniProt.                                                                                                                                                                                                                            |
| seqxml                | 1.58 | 1.58        | No    | Simple sequence XML file format.                                                                                                                                                                                                                                                                                                                                                            |
| sff                   | 1.54 | 1.54        | 1.54  | Standard Flowgram Format (SFF) binary files produced by Roche 454 and IonTorrent/IonProton sequencing machines.                                                                                                                                                                                                                                                                             |
| sff-trim              | 1.54 | No          | 1.54  | Standard Flowgram Format applying the trimming listed in the file.                                                                                                                                                                                                                                                                                                                          |
| stockholm             | 1.43 | 1.43        | No    | The Stockholm alignment format is also known as PFAM format.                                                                                                                                                                                                                                                                                                                                |
| swiss                 | 1.43 | No          | 1.52  | Swiss-Prot aka UniProt format. Uses Bio.SwissProtinternally. See also the UniProt XML format.                                                                                                                                                                                                                                                                                               |
| tab                   | 1.48 | 1.48        | 1.52  | Simple two column tab separated sequence files, where each line holds a record’s identifier and sequence. For example, this is used by Aligent’s eArray software when saving microarray probes in a minimal tab delimited text file.                                                                                                                                                        |
| qual                  | 1.50 | 1.50        | 1.52  | Qual files are a bit like FASTA files but instead of the sequence, record space separated integer sequencing values as PHRED quality scores. A matched pair of FASTA and QUAL files are often used as an alternative to a single FASTQ file.                                                                                                                                                |
| uniprot-xml           | 1.56 | No          | 1.56  | UniProt XML format, successor to the plain text Swiss-Prot format.                                                                                                                                                                                                                                                                                                                          |

No exemplo abaixo vamos abrir um arquivo chamado "ls_orchid.fasta" que está no formato FASTA.

In [1]:
from Bio import SeqIO
records = SeqIO.parse("ls_orchid.fasta", "fasta")
for seq_record in records:
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))

gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
740
gi|2765657|emb|Z78532.1|CCZ78532
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAG...GGC', SingleLetterAlphabet())
753
gi|2765656|emb|Z78531.1|CFZ78531
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TAA', SingleLetterAlphabet())
748
gi|2765655|emb|Z78530.1|CMZ78530
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAAACAACAT...CAT', SingleLetterAlphabet())
744
gi|2765654|emb|Z78529.1|CLZ78529
Seq('ACGGCGAGCTGCCGAAGGACATTGTTGAGACAGCAGAATATACGATTGAGTGAA...AAA', SingleLetterAlphabet())
733
gi|2765652|emb|Z78527.1|CYZ78527
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...CCC', SingleLetterAlphabet())
718
gi|2765651|emb|Z78526.1|CGZ78526
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...TGT', SingleLetterAlphabet())
730
gi|2765650|emb|Z78525.1|CAZ78525
Seq('TGTTGAGATAGCAGAATATACATCGAGTGAATCCGGAGGACCTGTGGTTATTCG...GC

### 3.1 Carregar os SeqRecords em uma lista

Como o método parse retorna um Generator, a cada iteração ele retorna un SeqRecord, porém não tem como você voltar para uma sequência anterior ou pular para a sequência final. Uma forma contorna isso é ler todas as sequências e guardar em uma lista utilizando list comprehension. A vantagem de fazer isso é que sempre podemos percorrer a lista de qualquer forma e acessar qualquer sequência a qualquer momento. Porém a desvantagem é que fazer isso você estará carregando o arquivo todo na memória RAM. Como os arquivos que utilizamos em bioinformática não tem o costume de serem muito pequenos isso pode ser um problemão.

In [2]:
records = [seq_record for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta")]

for record in records:
    print(record)

ID: gi|2765658|emb|Z78533.1|CIZ78533
Name: gi|2765658|emb|Z78533.1|CIZ78533
Description: gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
Number of features: 0
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
ID: gi|2765657|emb|Z78532.1|CCZ78532
Name: gi|2765657|emb|Z78532.1|CCZ78532
Description: gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA
Number of features: 0
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAG...GGC', SingleLetterAlphabet())
ID: gi|2765656|emb|Z78531.1|CFZ78531
Name: gi|2765656|emb|Z78531.1|CFZ78531
Description: gi|2765656|emb|Z78531.1|CFZ78531 C.fasciculatum 5.8S rRNA gene and ITS1 and ITS2 DNA
Number of features: 0
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TAA', SingleLetterAlphabet())
ID: gi|2765655|emb|Z78530.1|CMZ78530
Name: gi|2765655|emb|Z78530.1|CMZ78530
Description: gi|2765655|emb|Z78530.1|CMZ78530 C.margaritaceum 5

Outra forma de criar uma lista de SeqRecords é fazendo um casting da para lista.

In [3]:
from Bio import SeqIO
records = list(SeqIO.parse("ls_orchid.fasta", "fasta"))
print ("%s\n" % type(records))

print("Found %i records" % len(records))

print("The last record")
last_record = records[-1] #using Python's list tricks
print(last_record.id)
print(last_record.seq)
print(len(last_record))

print("The first record")
first_record = records[0] #remember, Python counts from zero
print(first_record.id)
print(first_record.seq)
print(len(first_record))

<class 'list'>

Found 94 records
The last record
gi|2765564|emb|Z78439.1|PBZ78439
CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACTTTGGTCACCCATGGGCATTTGCTGTTGAAGTGACCTAGATTTGCCATCGAGCCTCCTTGGGAGCTTTCTTGTTGGCGAGATCTAAACCCCTGCCCGGCGGAGTTGGGCGCCAAGTCATATGACACATAATTGGTGAAGGGGGTGGTAATCCTGCCCTGACCCTCCCCAAATTATTTTTTTAACAACTCTCAGCAACGGATATCTCGGCTCTTGCATCGATGAAGAACGCAGCGAAATGCGATAATGGTGTGAATTGCAGAATCCCGTGAACATCGAGTCTTTGAACGCAAGTTGCGCCCGAGGCCATCAGGCCAAGGGCACGCCTGCCTGGGCATTGCGAGTCATATCTCTCCCTTAATGAGGCTGTCCATACATACTGTTCAGCCGGTGCGGATGTGAGTTTGGCCCCTTGTTCTTTGGTACGGGGGGTCTAAGAGCTGCATGGGCTTTGGATGGTCCTAAATACGGAAAGAGGTGGACGAACTATGCTACAACAAAATTGTTGTGCAAATGCCCCGGTTGGCCGTTTAGTTGGGCC
592
The first record
gi|2765658|emb|Z78533.1|CIZ78533
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTGAATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGGCCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAAAGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGT

### 3.3 Parser para o formato do GenBank

Apesar do arquivo FASTA ser um dos formatos mais práticos na bioinformática, ele não necessariamente é o melhor. Por conta da sua simplicidade ele não contém diversas anotações que o formato Genbank do NCBI tem. Agora vamos dar uma olhada nele.

In [4]:
records = SeqIO.parse("ls_orchid.gbk", "genbank")
for record in records:
    print("Id: %s" % record.id)
    print("Name: %s" % record.name)
    print("Description: %s" % record.description)
    for feature in record.features:
        print(feature)
    for annotation in record.annotations:
        print("%s: %s" % (annotation, record.annotations[annotation]))
        
    print("-"*100+"\n")

Id: Z78533.1
Name: Z78533
Description: C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
type: source
location: [0:740](+)
qualifiers:
    Key: db_xref, Value: ['taxon:49711']
    Key: mol_type, Value: ['genomic DNA']
    Key: organism, Value: ['Cypripedium irapeanum']

type: misc_feature
location: [0:380](+)
qualifiers:
    Key: note, Value: ['internal transcribed spacer 1']

type: gene
location: [380:550](+)
qualifiers:
    Key: gene, Value: ['5.8S rRNA']

type: rRNA
location: [380:550](+)
qualifiers:
    Key: gene, Value: ['5.8S rRNA']
    Key: product, Value: ['5.8S ribosomal RNA']

type: misc_feature
location: [550:740](+)
qualifiers:
    Key: note, Value: ['internal transcribed spacer 2']

molecule_type: DNA
topology: linear
data_file_division: PLN
date: 30-NOV-2006
accessions: ['Z78533']
sequence_version: 1
gi: 2765658
keywords: ['5.8S ribosomal RNA', '5.8S rRNA gene', 'internal transcribed spacer', 'ITS1', 'ITS2']
source: Cypripedium irapeanum
organism: Cypripedium irapeanum
tax

accessions: ['Z78514']
sequence_version: 1
gi: 2765639
keywords: ['5.8S ribosomal RNA', '5.8S rRNA gene', 'internal transcribed spacer', 'ITS1', 'ITS2']
source: Phragmipedium schlimii
organism: Phragmipedium schlimii
taxonomy: ['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Tracheophyta', 'Spermatophyta', 'Magnoliophyta', 'Liliopsida', 'Asparagales', 'Orchidaceae', 'Cypripedioideae', 'Phragmipedium']
references: [Reference(title='Phylogenetics of the slipper orchids (Cypripedioideae: Orchidaceae): nuclear rDNA ITS sequences', ...), Reference(title='Direct Submission', ...)]
----------------------------------------------------------------------------------------------------

Id: Z78513.1
Name: Z78513
Description: P.besseae 5.8S rRNA gene and ITS1 and ITS2 DNA
type: source
location: [0:742](+)
qualifiers:
    Key: db_xref, Value: ['taxon:53125']
    Key: mol_type, Value: ['genomic DNA']
    Key: organism, Value: ['Phragmipedium besseae']

type: misc_feature
location: [0:38

data_file_division: PLN
date: 30-NOV-2006
accessions: ['Z78481']
sequence_version: 1
gi: 2765606
keywords: ['5.8S ribosomal RNA', '5.8S rRNA gene', 'internal transcribed spacer', 'ITS1', 'ITS2']
source: Paphiopedilum insigne
organism: Paphiopedilum insigne
taxonomy: ['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Tracheophyta', 'Spermatophyta', 'Magnoliophyta', 'Liliopsida', 'Asparagales', 'Orchidaceae', 'Cypripedioideae', 'Paphiopedilum']
references: [Reference(title='Phylogenetics of the slipper orchids (Cypripedioideae: Orchidaceae): nuclear rDNA ITS sequences', ...), Reference(title='Direct Submission', ...)]
----------------------------------------------------------------------------------------------------

Id: Z78480.1
Name: Z78480
Description: P.gratrixianum 5.8S rRNA gene and ITS1 and ITS2 DNA
type: source
location: [0:587](+)
qualifiers:
    Key: db_xref, Value: ['taxon:53090']
    Key: mol_type, Value: ['genomic DNA']
    Key: organism, Value: ['Paphiopedilum 


type: rRNA
location: [380:550](+)
qualifiers:
    Key: gene, Value: ['5.8S rRNA']
    Key: product, Value: ['5.8S ribosomal RNA']

type: misc_feature
location: [550:739](+)
qualifiers:
    Key: note, Value: ['internal transcribed spacer 2']

molecule_type: DNA
topology: linear
data_file_division: PLN
date: 30-NOV-2006
accessions: ['Z78457']
sequence_version: 1
gi: 2765582
keywords: ['5.8S ribosomal RNA', '5.8S rRNA gene', 'internal transcribed spacer', 'ITS1', 'ITS2']
source: Paphiopedilum callosum
organism: Paphiopedilum callosum
taxonomy: ['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Tracheophyta', 'Spermatophyta', 'Magnoliophyta', 'Liliopsida', 'Asparagales', 'Orchidaceae', 'Cypripedioideae', 'Paphiopedilum']
references: [Reference(title='Phylogenetics of the slipper orchids (Cypripedioideae: Orchidaceae): nuclear rDNA ITS sequences', ...), Reference(title='Direct Submission', ...)]
-----------------------------------------------------------------------------------

Com as informações todas separadas na forma de atributos fica muito mas fácil acessa-las e criar por exemplo uma lista de organismos que estão presentes nas sequências:

In [5]:
records = SeqIO.parse("ls_orchid.gbk", "genbank")
all_species = [record.annotations["organism"] for record in records]
print(all_species)

['Cypripedium irapeanum', 'Cypripedium californicum', 'Cypripedium fasciculatum', 'Cypripedium margaritaceum', 'Cypripedium lichiangense', 'Cypripedium yatabeanum', 'Cypripedium guttatum', 'Cypripedium acaule', 'Cypripedium formosanum', 'Cypripedium himalaicum', 'Cypripedium macranthon', 'Cypripedium calceolus', 'Cypripedium segawai', 'Cypripedium parviflorum var. pubescens', 'Cypripedium reginae', 'Cypripedium flavum', 'Cypripedium passerinum', 'Mexipedium xerophyticum', 'Phragmipedium schlimii', 'Phragmipedium besseae', 'Phragmipedium wallisii', 'Phragmipedium exstaminodium', 'Phragmipedium caricinum', 'Phragmipedium pearcei', 'Phragmipedium longifolium', 'Phragmipedium lindenii', 'Phragmipedium lindleyanum', 'Phragmipedium sargentianum', 'Phragmipedium kaiteurum', 'Phragmipedium czerwiakowianum', 'Phragmipedium boissierianum', 'Phragmipedium caudatum', 'Phragmipedium warszewiczianum', 'Paphiopedilum micranthum', 'Paphiopedilum malipoense', 'Paphiopedilum delenatii', 'Paphiopedilum a

### 3.4 SeqIO to Dict

Uma outra opção a transforma o Generator em lista é usar o método do SeqIO to_dict(). Ao invés de criar uma lista ele cria um dicionário e por padrão ele usa o id de cada SeqRecord como chave do dicionário. Esse método facilita muito o acesso direto as sequências de interesse, porém ao mesmo tempo aumentos muito o uso de memória, então deve ser usado com precaução.

In [6]:
orchid_dict = SeqIO.to_dict(SeqIO.parse("ls_orchid.gbk", "genbank"))
len(orchid_dict)

94

In [16]:
list(orchid_dict.keys())

['Z78533.1',
 'Z78532.1',
 'Z78531.1',
 'Z78530.1',
 'Z78529.1',
 'Z78527.1',
 'Z78526.1',
 'Z78525.1',
 'Z78524.1',
 'Z78523.1',
 'Z78522.1',
 'Z78521.1',
 'Z78520.1',
 'Z78519.1',
 'Z78518.1',
 'Z78517.1',
 'Z78516.1',
 'Z78515.1',
 'Z78514.1',
 'Z78513.1',
 'Z78512.1',
 'Z78511.1',
 'Z78510.1',
 'Z78509.1',
 'Z78508.1',
 'Z78507.1',
 'Z78506.1',
 'Z78505.1',
 'Z78504.1',
 'Z78503.1',
 'Z78502.1',
 'Z78501.1',
 'Z78500.1',
 'Z78499.1',
 'Z78498.1',
 'Z78497.1',
 'Z78496.1',
 'Z78495.1',
 'Z78494.1',
 'Z78493.1',
 'Z78492.1',
 'Z78491.1',
 'Z78490.1',
 'Z78489.1',
 'Z78488.1',
 'Z78487.1',
 'Z78486.1',
 'Z78485.1',
 'Z78484.1',
 'Z78483.1',
 'Z78482.1',
 'Z78481.1',
 'Z78480.1',
 'Z78479.1',
 'Z78478.1',
 'Z78477.1',
 'Z78476.1',
 'Z78475.1',
 'Z78474.1',
 'Z78473.1',
 'Z78472.1',
 'Z78471.1',
 'Z78470.1',
 'Z78469.1',
 'Z78468.1',
 'Z78467.1',
 'Z78466.1',
 'Z78465.1',
 'Z78464.1',
 'Z78463.1',
 'Z78462.1',
 'Z78461.1',
 'Z78460.1',
 'Z78459.1',
 'Z78458.1',
 'Z78457.1',
 'Z78456.1',

In [8]:
list(orchid_dict.values())[:5]

[SeqRecord(seq=Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', IUPACAmbiguousDNA()), id='Z78533.1', name='Z78533', description='C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA', dbxrefs=[]),
 SeqRecord(seq=Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAG...GGC', IUPACAmbiguousDNA()), id='Z78532.1', name='Z78532', description='C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA', dbxrefs=[]),
 SeqRecord(seq=Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TAA', IUPACAmbiguousDNA()), id='Z78531.1', name='Z78531', description='C.fasciculatum 5.8S rRNA gene and ITS1 and ITS2 DNA', dbxrefs=[]),
 SeqRecord(seq=Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAAACAACAT...CAT', IUPACAmbiguousDNA()), id='Z78530.1', name='Z78530', description='C.margaritaceum 5.8S rRNA gene and ITS1 and ITS2 DNA', dbxrefs=[]),
 SeqRecord(seq=Seq('ACGGCGAGCTGCCGAAGGACATTGTTGAGACAGCAGAATATACGATTGAGTGAA...AAA', IUPACAmbiguousDNA()), id='Z78529.1', name='Z78529', descrip

In [9]:
record = orchid_dict["Z78475.1"]
print(record.description)
print(repr(record.seq))

P.supardii 5.8S rRNA gene and ITS1 and ITS2 DNA
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GGT', IUPACAmbiguousDNA())


### 3.5 Escrevendo SeqRecords em arquivo

Da mesma forma que o SeqIO nos permite ler um arquivo ele também permite que escrevamos um arquivo com nossos SeqRecords. Para exemplificar vamos criar três SeqRecords e junta-los em uma lista. Em seguida vamos utilizar o método write() do SeqIO para escrever as sequências em um arquivo no formato FASTA.

In [10]:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_protein

rec1 = SeqRecord(Seq("MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD" \
                    +"GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK" \
                    +"NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM" \
                    +"SSAC", generic_protein),
                 id="gi|14150838|gb|AAK54648.1|AF376133_1",
                 description="chalcone synthase [Cucumis sativus]")

rec2 = SeqRecord(Seq("YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLDARQ" \
                    +"DMVVVEIPKLGKEAAVKAIKEWGQ", generic_protein),
                 id="gi|13919613|gb|AAK33142.1|",
                 description="chalcone synthase [Fragaria vesca subsp. bracteata]")

rec3 = SeqRecord(Seq("MVTVEEFRRAQCAEGPATVMAIGTATPSNCVDQSTYPDYYFRITNSEHKVELKEKFKRMC" \
                    +"EKSMIKKRYMHLTEEILKENPNICAYMAPSLDARQDIVVVEVPKLGKEAAQKAIKEWGQP" \
                    +"KSKITHLVFCTTSGVDMPGCDYQLTKLLGLRPSVKRFMMYQQGCFAGGTVLRMAKDLAEN" \
                    +"NKGARVLVVCSEITAVTFRGPNDTHLDSLVGQALFGDGAAAVIIGSDPIPEVERPLFELV" \
                    +"SAAQTLLPDSEGAIDGHLREVGLTFHLLKDVPGLISKNIEKSLVEAFQPLGISDWNSLFW" \
                    +"IAHPGGPAILDQVELKLGLKQEKLKATRKVLSNYGNMSSACVLFILDEMRKASAKEGLGT" \
                    +"TGEGLEWGVLFGFGPGLTVETVVLHSVAT", generic_protein),
                 id="gi|13925890|gb|AAK49457.1|",
                 description="chalcone synthase [Nicotiana tabacum]")

my_records = [rec1, rec2, rec3]

In [11]:
from Bio import SeqIO
SeqIO.write(my_records, "minhas_proteinas.fa", "fasta")

3

Em seguida vamos abrir esse arquivo e verificar se as sequências foram escritas de maneira correta.

In [12]:
for record in SeqIO.parse("minhas_proteinas.fa", "fasta"):
    print(record.id)
    print(record.seq)
    print(len(record))

gi|14150838|gb|AAK54648.1|AF376133_1
MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGDGAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISKNIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNMSSAC
184
gi|13919613|gb|AAK33142.1|
YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLDARQDMVVVEIPKLGKEAAVKAIKEWGQ
84
gi|13925890|gb|AAK49457.1|
MVTVEEFRRAQCAEGPATVMAIGTATPSNCVDQSTYPDYYFRITNSEHKVELKEKFKRMCEKSMIKKRYMHLTEEILKENPNICAYMAPSLDARQDIVVVEVPKLGKEAAQKAIKEWGQPKSKITHLVFCTTSGVDMPGCDYQLTKLLGLRPSVKRFMMYQQGCFAGGTVLRMAKDLAENNKGARVLVVCSEITAVTFRGPNDTHLDSLVGQALFGDGAAAVIIGSDPIPEVERPLFELVSAAQTLLPDSEGAIDGHLREVGLTFHLLKDVPGLISKNIEKSLVEAFQPLGISDWNSLFWIAHPGGPAILDQVELKLGLKQEKLKATRKVLSNYGNMSSACVLFILDEMRKASAKEGLGTTGEGLEWGVLFGFGPGLTVETVVLHSVAT
389


## 4 Recuperando sequências direto do NCBI

Se a sequência que vocês está trabalhando está no NCBI você recupera-la diretamento da web pelo id. Depois de importar a classe Entrez você deve setar um email que será utilizado. Esse controle por email é só para evitar que muitas requisições sejam feitas ao mesmo tempo por uma pessoa. Utilizando o método efetch() você pode escolher de qual banco de dados você quer recuperar a sequência, qual o formato de retorno e o id da sequência e ele te retorna um handle para essa sequência. Com o handle você pode utilizar o SeqIO para fazer o parser dessa sequência.

In [13]:
from Bio import Entrez

Entrez.email = "osvaldor@unicamp.br"
with Entrez.efetch(db="nucleotide", rettype="fasta", retmode="text", id="NM_015565.2") as handle:
    record = SeqIO.read(handle, "fasta")
    print(">%s\n%s...%s" % (record.id, record.seq[:70], record.seq[-10:]))

>NM_015565.2
AGTGGGGCAGCAAATGGACAGGGTGGGTGGCGGAAAAGGGCCCGGGGGAAGTTATTACAGGGTGTCCTCT...TTGTACTTCT


Da mesma forma você também pode fazer o download da sequência no formato do Genbank também.

In [14]:
with Entrez.efetch(db="nucleotide", rettype="gb", retmode="text", id="NM_015565.2") as handle:
    record = SeqIO.read(handle, "gb")
    print(">%s\n%s...%s" % (record.id, record.seq[:70],record.seq[-10:]))
    for feature in record.features:
        print(feature)

>NM_015565.2
AGTGGGGCAGCAAATGGACAGGGTGGGTGGCGGAAAAGGGCCCGGGGGAAGTTATTACAGGGTGTCCTCT...TTGTACTTCT
type: source
location: [0:7756](+)
qualifiers:
    Key: chromosome, Value: ['21']
    Key: db_xref, Value: ['taxon:9606']
    Key: map, Value: ['21q21.3']
    Key: mol_type, Value: ['mRNA']
    Key: organism, Value: ['Homo sapiens']

type: gene
location: [0:7756](+)
qualifiers:
    Key: db_xref, Value: ['GeneID:26046', 'HGNC:HGNC:13082', 'MIM:613083']
    Key: gene, Value: ['LTN1']
    Key: gene_synonym, Value: ['C21orf10; C21orf98; RNF160; ZNF294']
    Key: note, Value: ['listerin E3 ubiquitin protein ligase 1']

type: exon
location: [0:193](+)
qualifiers:
    Key: gene, Value: ['LTN1']
    Key: gene_synonym, Value: ['C21orf10; C21orf98; RNF160; ZNF294']
    Key: inference, Value: ['alignment:Splign:2.1.0']

type: CDS
location: [13:5452](+)
qualifiers:
    Key: EC_number, Value: ['2.3.2.27']
    Key: codon_start, Value: ['1']
    Key: db_xref, Value: ['CCDS:CCDS33527.2', 'GeneID:26046', 'H

Também é possível buscar por uma lista de ids separados por vírgula.

In [15]:
with Entrez.efetch(db="nucleotide", rettype="gb", retmode="text",
                   id="6273291,6273290,6273289") as handle:
    for record in SeqIO.parse(handle, "gb"):
        print("%s %s..." % (record.id, record.description[:50]))
    

AF191665.1 Opuntia marenae rpl16 gene; chloroplast gene for c...
AF191664.1 Opuntia clavata rpl16 gene; chloroplast gene for c...
AF191663.1 Opuntia bradtiana rpl16 gene; chloroplast gene for...


### Exercício 2

Abra o arquivo 'ls_orchid.fasta' dos exemplos anteriores, selecione apenas as sequências dos ids abaixo e faça o complementar reverso delas. No final salve tudo em um arquivo no formato FASTA. 

Z78533.1

Z78532.1

Z78531.1

Z78440.1

Z78439.1