# SeqIO Biopython

In order to use the SeqIO module, first we need to import it:

In [2]:
from Bio import SeqIO

### Excellent material on SeqIO:
https://biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html

The SeqIO object is a generator:

In [75]:
myseq = SeqIO.parse("Cebus_albifrons_NC_002763.1.gb", "genbank")
print(myseq)

<generator object parse at 0x7fd7441b24f8>


So, in order to use it, we need to iterate over its values. Doin so in a for loop will return lots of info on the genbank file

In [172]:
for i in SeqIO.parse("Cebus_albifrons_NC_002763.1.gb", "genbank"):
    print(i)

ID: NC_002763.1
Name: NC_002763
Description: Cebus albifrons mitochondrion, complete genome
Database cross-references: Project:11945, BioProject:PRJNA11945
Number of features: 55
/molecule_type=DNA
/topology=circular
/data_file_division=PRI
/date=01-FEB-2010
/accessions=['NC_002763']
/sequence_version=1
/keywords=['RefSeq']
/source=mitochondrion Cebus albifrons (white-fronted capuchin)
/organism=Cebus albifrons
/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Platyrrhini', 'Cebidae', 'Cebinae', 'Cebus']
/references=[Reference(title='Molecular estimates of primate divergences and new hypotheses for primate dispersal and the origin of modern humans', ...), Reference(title='Direct Submission', ...), Reference(title='Direct Submission', ...)]
/comment=REVIEWED REFSEQ: This record has been curated by NCBI staff. The
reference sequence was derived from AJ309866.
COMPLETENESS: full 

If we look at the atributes of the object, we will see that there are several methods (e.g. id, features, annotations, seq, etc...) that could be used to provide direct access to some info from the genbank file:

In [70]:
for i in SeqIO.parse("Cebus_albifrons_NC_002763.1.gb", "genbank"):
    print(dir(i))

['__add__', '__bool__', '__class__', '__contains__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__le___', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__nonzero__', '__radd__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_per_letter_annotations', '_seq', '_set_per_letter_annotations', '_set_seq', 'annotations', 'dbxrefs', 'description', 'features', 'format', 'id', 'letter_annotations', 'lower', 'name', 'reverse_complement', 'seq', 'translate', 'upper']


For instance, if we print the "annotation" field, we will get a dictionary, that could be used to obtain specific informations:

In [111]:
for i in SeqIO.parse("Cebus_albifrons_NC_002763.1.gb", "genbank"):
    print(i.annotations)
    print(type(i.annotations))

{'molecule_type': 'DNA', 'topology': 'circular', 'data_file_division': 'PRI', 'date': '01-FEB-2010', 'accessions': ['NC_002763'], 'sequence_version': 1, 'keywords': ['RefSeq'], 'source': 'mitochondrion Cebus albifrons (white-fronted capuchin)', 'organism': 'Cebus albifrons', 'taxonomy': ['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Platyrrhini', 'Cebidae', 'Cebinae', 'Cebus'], 'references': [Reference(title='Molecular estimates of primate divergences and new hypotheses for primate dispersal and the origin of modern humans', ...), Reference(title='Direct Submission', ...), Reference(title='Direct Submission', ...)], 'comment': 'REVIEWED REFSEQ: This record has been curated by NCBI staff. The\nreference sequence was derived from AJ309866.\nCOMPLETENESS: full length.'}
<class 'dict'>


Suppose we want to get the species name. "organism" is a key in that dictionary, and its value is the name of the species. So, we could extract this information using a dictionary "get" method:

In [60]:
for i in SeqIO.parse("Cebus_albifrons_NC_002763.1.gb", "genbank"):
    print(i.annotations.get("organism"))

Cebus albifrons


We can also use the "format" field to convert a genbank file into a fasta:

In [87]:
for i in SeqIO.parse("Cebus_albifrons_NC_002763.1.gb", "genbank"):
    print(i.format("fasta"))

>NC_002763.1 Cebus albifrons mitochondrion, complete genome
GTTAATGTAGCTTAATACTCAAAGCAAGGCACTGAAAATGCCTAGACGGGTATTTACTAC
CCCATAAACACACAGGTTTGGTCCTAGCCTTTCTATTAGCCCTCAGTGAGATTACACATG
CAAGCATCTACTATCCTGTGAGAATGCCCTCTAGAACACCAAATTATGAGGAGCGAGTAT
CAAGCACGCATATATGCAGCTCAAAACACTTTGCTTAGCCACACCCCCACGGGAAACAGC
AGTGACAAACTTTTAGCAATAAACGAAAGTTTAACTAAGCTACACTGACAATAGAGTTGG
TCAATTTCGTGCCAGCCACCGCGGCCATACGATTAACTCAAGTTAATAAAGTCCGGCGTA
AAGAGTGTTTAAGGCCCCACCCTCAATAAAGCTAACCTATAACTAAGTTGTGGAAAACTC
CAGTTATAGTGAAATACCCTACGAAAGTGGCTTTAATATTCCTGAATACACCATAGCTAA
GACACAAACTGGGATTAGATACCCCACTATGCCTAGCCCTAAACTCCAATAACTCTACCA
ACAAAATTACTCGCCAGAACACTACAAGCAATAGCTTGAAACTCAAAGGACCTGGCGGTG
CTTTACATCCGTCTAGAGGAGCCTGTTCTGTAATCGATATACCCCGATAAACCTTACCAC
CTCTTGCCCCCAGCCTGTATACCGCCATCCTCAGCAAACTCCCTAAAGATCGTAAAGTAA
GCAAAAGTATTACCATAAAAACGTTAGGTCAAGGTGCAGCCAATGAAGTGGAAAGAAATG
GGCTACATTTTCTAATCTAGAAAATTACACGATAGCCTTTATGAAATTTAAAGGCCCAAG
GTGGATTTAGCAGTAAATCAAGAATAGAGAGCTTGATTGAAGCAAGGCCATTAAGCACGC
ACACACCGCCCGTCACCCTCCTCAA

We can also get the reverse complement of a sequence. If we only use reverse_complement, we will get some information regarding the other strand:

In [92]:
for i in SeqIO.parse("Cebus_albifrons_NC_002763.1.gb", "genbank"):
    print(i.reverse_complement())

ID: <unknown id>
Name: <unknown name>
Description: <unknown description>
Number of features: 55
Seq('TTAATTAGGGCCCAGTATAAGGATATAGCAGTAGGTTATATAGATTTGAGGTCA...AAC', IUPACAmbiguousDNA())


But if we look at the methods of the reverse_complement, we will see that there are many methods to extract more specific data:

In [91]:
for i in SeqIO.parse("Cebus_albifrons_NC_002763.1.gb", "genbank"):
    print(dir(i.reverse_complement()))

['__add__', '__bool__', '__class__', '__contains__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__le___', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__nonzero__', '__radd__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_per_letter_annotations', '_seq', '_set_per_letter_annotations', '_set_seq', 'annotations', 'dbxrefs', 'description', 'features', 'format', 'id', 'letter_annotations', 'lower', 'name', 'reverse_complement', 'seq', 'translate', 'upper']


In [113]:
for i in SeqIO.parse("Cebus_albifrons_NC_002763.1.gb", "genbank"):
    print(i._seq)

GTTAATGTAGCTTAATACTCAAAGCAAGGCACTGAAAATGCCTAGACGGGTATTTACTACCCCATAAACACACAGGTTTGGTCCTAGCCTTTCTATTAGCCCTCAGTGAGATTACACATGCAAGCATCTACTATCCTGTGAGAATGCCCTCTAGAACACCAAATTATGAGGAGCGAGTATCAAGCACGCATATATGCAGCTCAAAACACTTTGCTTAGCCACACCCCCACGGGAAACAGCAGTGACAAACTTTTAGCAATAAACGAAAGTTTAACTAAGCTACACTGACAATAGAGTTGGTCAATTTCGTGCCAGCCACCGCGGCCATACGATTAACTCAAGTTAATAAAGTCCGGCGTAAAGAGTGTTTAAGGCCCCACCCTCAATAAAGCTAACCTATAACTAAGTTGTGGAAAACTCCAGTTATAGTGAAATACCCTACGAAAGTGGCTTTAATATTCCTGAATACACCATAGCTAAGACACAAACTGGGATTAGATACCCCACTATGCCTAGCCCTAAACTCCAATAACTCTACCAACAAAATTACTCGCCAGAACACTACAAGCAATAGCTTGAAACTCAAAGGACCTGGCGGTGCTTTACATCCGTCTAGAGGAGCCTGTTCTGTAATCGATATACCCCGATAAACCTTACCACCTCTTGCCCCCAGCCTGTATACCGCCATCCTCAGCAAACTCCCTAAAGATCGTAAAGTAAGCAAAAGTATTACCATAAAAACGTTAGGTCAAGGTGCAGCCAATGAAGTGGAAAGAAATGGGCTACATTTTCTAATCTAGAAAATTACACGATAGCCTTTATGAAATTTAAAGGCCCAAGGTGGATTTAGCAGTAAATCAAGAATAGAGAGCTTGATTGAAGCAAGGCCATTAAGCACGCACACACCGCCCGTCACCCTCCTCAACCATCATGTAGAAAATATATTAACAATTTAATCCGCTTCATATTGATGCAGAGGGGATAAGTCGTAACATGGTAA

In [123]:
for i in SeqIO.parse("Cebus_albifrons_NC_002763.1.gb", "genbank"):
    header = i.description
    print((i.reverse_complement(header, "", "").format("fasta")))

>Cebus albifrons mitochondrion, complete genome
TTAATTAGGGCCCAGTATAAGGATATAGCAGTAGGTTATATAGATTTGAGGTCAGGGCGA
TTATGTGGTATAGATCTGTTTTCTGAGGGATAGAGATTGAGGTATGCGGTAGGTAATTGT
ATCTAGTATTATTTTAATAGGTGTGTACTATGGGGTGTAAAATTCTATTTAATTTGTTGA
TTTATAAAATTTAGTTGCTAATGGTAAAATATCTCTTTACGACGCTCGCCCCAGAAAATA
TGGGATAAGGTTGAAGGGTTGGTGTGTTAAATTTTTGGGTATAGTGCTGTAATTATCTAT
TATGTCCTGAGACCATTGACTGAATAACACCTAGTGGGCGGGTTGGGGGCCAAGACATTC
AATTTAGGTGCGATGATAGCATAAGGCATGGCAGTACACAGACAGGTCCCACAGGACGGG
GGGCGGGACTTCTACCGGAGCCTACGGCAATGCTGAGGATTTCCCTCCCCCCCCCTTTCC
CCACCCAGGCACCCTATGCATCCAGTGACGCGGTTAAGAGGGTGATAGCGCCACACCATC
GTGATGTCTTATTTAAGAGGAACGTGGACACGGTCTTAGTGAGATGGCCCTGAGGTAGGA
ACCAAATGCCAGGTATAGTTTCAGTATAACCAAGCCCTGTCTATATGGGCCCGGAGCGAG
AAGAGCCGCAGAAGTGGGCGGGTTGGTGGTTTCACGGAGGTTGGTAGATTAAGAGACCAA
ATATACAAGGGGATATCCATGGTTAGAAGGACTGAAGAAATAGGATGCGCTATGTCCGGT
TAATCATTTGATGTGCAGATGTACTATCAATGATTCTATGGGCTGTACGATATCCGTGGT
GACTGTAATTGGAGCGTGGGTTTGAGTTTAATGCGCTATAGTCCGCTGTGAAATAATAGT
TCCTGCTTGTAAGCATGTTCTGTGACGTAGTATGCAC

In [130]:
for i in SeqIO.parse("Cebus_albifrons_NC_002763.1.gb", "genbank"):
    print(type(i.features))    
    print(i.features)

<class 'list'>
[SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(16554), strand=1), type='source'), SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(70), strand=1), type='tRNA'), SeqFeature(FeatureLocation(ExactPosition(70), ExactPosition(1028), strand=1), type='rRNA'), SeqFeature(FeatureLocation(ExactPosition(1028), ExactPosition(1086), strand=1), type='tRNA'), SeqFeature(FeatureLocation(ExactPosition(1086), ExactPosition(2651), strand=1), type='rRNA'), SeqFeature(FeatureLocation(ExactPosition(2230), ExactPosition(2366), strand=1), type='STS'), SeqFeature(FeatureLocation(ExactPosition(2327), ExactPosition(2520), strand=1), type='STS'), SeqFeature(FeatureLocation(ExactPosition(2651), ExactPosition(2726), strand=1), type='tRNA'), SeqFeature(FeatureLocation(ExactPosition(2728), ExactPosition(3685), strand=1), type='gene'), SeqFeature(FeatureLocation(ExactPosition(2728), ExactPosition(3685), strand=1), type='CDS'), SeqFeature(FeatureLocation(ExactPosition(3684), ExactP

We can, for instance, generate a multifasta containing all tRNA features from the genbank:

In [15]:
for i in SeqIO.parse("Cebus_albifrons_NC_002763.1.gb", "genbank"):
    for feature in i.features:
        if feature.type == "rRNA":
            print(feature)
        #    header = feature.qualifiers.get('product')[0]
        #    print(header)

type: rRNA
location: [70:1028](+)
qualifiers:
    Key: product, Value: ['12S ribosomal RNA']

type: rRNA
location: [1086:2651](+)
qualifiers:
    Key: product, Value: ['16S ribosomal RNA']



In [176]:
for i in SeqIO.parse("Cebus_albifrons_NC_002763.1.gb", "genbank"):
    for feature in i.features:
        if feature.type == "tRNA":
            header = feature.qualifiers.get('product')[0]
            seq = feature.location.extract(i.seq)
            print(">{}\n{}".format(header, seq))

>tRNA-Phe
GTTAATGTAGCTTAATACTCAAAGCAAGGCACTGAAAATGCCTAGACGGGTATTTACTACCCCATAAACA
>tRNA-Val
CAAAGTGTAGCTTAAATTAAAGCATCTGGCCTACACCCAGAAGATCTCACAACAACCG
>tRNA-Leu
GTTAAGATGGCAGAGCCCGGCAATTGCATAAAACTTAAAACTTTACAATCAGAGGTTCAACTCCTCTTCTTAACA
>tRNA-Ile
AGAAATATGTCTGACAAAAGAATTACTTTGATAGAGTAAACTATAGAGGTTTAAATCCTCTTATTTCTA
>tRNA-Gln
TAGAGTATGGTGTAATAGGTAGCACGGAGGATTTTGAGTTCTTAGGAATAGGTTCGAGTCCTATAATTCTAG
>tRNA-Met
AGTAAGGTCAGCTAAATAAGCTATCGGGCCCATACCCCGAAAATGTTGGTCCAATCCTTCCCGTACTA
>tRNA-Trp
AGAAATTTAGGTTAATAAGACCAAGAGCCTTCAAAGCCCCTAGTAAGTAAATTTTACTTAATTTCTG
>tRNA-Ala
GAGGGCTTAGCTTAATTAAAGTAGTTGATTTGCGTTCAATTGATGCAAGGTATAGTTTGCAGTCCTTA
>tRNA-Asn
TAGATTGAAGCCAGTTGATTAGGTTATTTAGCTGTTAACTAAATTTTTGTGGGTTAAAGTCCCATCAGTCTAG
>tRNA-Cys
AGCCCTGAGGTGAACTGTCATGTTGAACTGCAAATTCAAAGAAGCAGCTTCAATGCTGCCGGGGCTT
>tRNA-Tyr
GGTAAAATGGCTGAGTAAGCATTAGACTGTAAATCTAAAGACAGAGAAGTGACCCCCTCTTTTTACCA
>tRNA-Ser
GAAAAAGTCATAGGGTCTATGGGATTGGCTTGAAACCAATTTTTGGGGGTTCAAATCCTTCCTTTTTCG
>tRNA-Asp
GAGATATTAGTAAAATAAATTACATAACTTTGTC

# Handling fasta files:

The fasta files have way less options, but one of them is the seq.description, that gets the whole header line....
***(FINISH THIS LATER...)***

# Handling ace files:

Ace files can be treated as sequences or as alignments, as can be seen [here](https://biopython.org/wiki/Ace_contig_class)

These are only a few examples of what SeqIO (and the Biopython module in general) can do in only a few lines of code. Live long and prosper and happy parsing!