# Parsing GenBank sequences using SeqIO

The *SeqIO.parse()* method can also be used to parse sequence records stored in GenBank flatfile format.

We parse GenBank sequences the same way we parse FASTA sequences, to yield an *iterator* of *SeqRecord* objects. However, each *SeqRecord* object will now contain data for the *features* of the sequence.

In this Notebook we will analyze *ls_orchid.fasta*, which was downloaded from the following page: 
https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.gbk

In [None]:
from Bio import SeqIO # to parse sequence data

with open('ls_orchid.gbk') as handle :
    sequences = SeqIO.parse(handle, "genbank")
    seq_record = next(sequences) # get the first sequence record

seq_record

Just like FASTA records, we can get the ID, length, sequence, etc from the sequence record

In [None]:
print()
print("ID = ", seq_record.id)
print("seq length = ", len(seq_record))
print() 
print(seq_record.seq)

The sequence records contains a list of *SeqFeature* objects, stored in *seq_record.features*.

In [None]:
print(seq_record.features)

Let's get the third feature:

In [None]:
thirdFeature = seq_record.features[2]
thirdFeature

Each feature has a *type* and a *location*.

In [None]:
print("type = ", thirdFeature.type)
print("location = ", thirdFeature.location)

The *feature location* is a special object that has a *start* and *end* position, as well as information about the strand the feature is on (with +1 the given strand, and -1 its complement). Start and end positions may be *fuzzy*, (e.g., from <98..110), and are adjusted so slicing using the *start* and *end* values returns the correct sequence, e.g. (the sequence corresponding to the feature can be extracted using `seq[start:end]`). 

Feature locations may also consist of multiple regions (e.g., across multiple exons). 

In [None]:
start = thirdFeature.location.start
end = thirdFeature.location.end
print("sequence of first feature:", seq_record.seq[start:end])

Feature *qualifiers* (e.g., /organism="Cypripedium irapeanum") are stored in a dictionary. The qualifiers available will vary from one feature to another. Note that in the *qualifiers* dictionary, the values are always stored in lists.

In [None]:
firstFeature = seq_record.features[0]
firstFeature.qualifiers

Look up the organism for this feature. Note that the qualifier always returns a list.

In [None]:
print(firstFeature.qualifiers['organism'])

To get the organism, extract the first element of the list.

In [None]:
print(firstFeature.qualifiers['organism'][0])