# Biopython
Biopython is a module for handling biological data, mostly sequence data for protein or nucleic acid sequences
<br/>
You can find documentation and tutorials here: https://biopython.org/

# Install biopython
You only need to run this if you do not already have biopython installed

In [None]:
%pip install biopython

# Reading fasta files
There are two main filetypes for storing sequence data that we will encounter in this class, the first are fasta files. Fasta files have header lines starting with ">" with information about the sequence and sequence lines that don't start with a special character and contain the sequence. Biopython provides code to read fasta files

In [29]:
from Bio import SeqIO
from Bio.Seq import Seq

#to iterate through records in a fasta file use the following for loop
#you can then get information from the record variable
#we can also store the records in a list to use later
records = []
for record in SeqIO.parse("dna_sequences.fasta", "fasta"):
    print("ID:", record.id)
    print("Description:", record.description)
    print("Sequence:", str(record.seq))
    print("Length:", len(record.seq))
    records.append(record)
    
#we can also translate DNA sequence to protein sequence
for record in records:
    #the sequence is stored as a string in the record and first needs to be converted to Biopython sequence object
    #sequence = Seq(record.seq)
    sequence = record.seq
    protein_seq = sequence.translate()
    print(record.id)
    print(protein_seq)
    print()

ID: J01609.1
Description: J01609.1 Escherichia coli dihydrofolate reductase (folA) gene, complete cds
Sequence: ATGCGGCGAGTCCAGGGAGAGAGCGTGGACTCGCCAGCAGAATATAAAATTTTCCTCAACATCATCCTCGCACCAGTCGACGACGGTTTACGCTTTACGTATAGTGGCGACAATTTTTTTTATCGGGAAATCTCAATGATCAGTCTGATTGCGGCGTTAGCGGTAGATCGCGTTATCGGCATGGAAAACGCCATGCCGTGGAACCTGCCTGCCGATCTCGCCTGGTTTAAACGCAACACCTTAAATAAACCCGTGATTATGGGCCGCCATACCTGGGAATCAATCGGTCGTCCGTTGCCAGGACGCAAAAATATTATCCTCAGCAGTCAACCGGGTACGGACGATCGCGTAACGTGGGTGAAGTCGGTGGATGAAGCCATCGCGGCGTGTGGTGACGTACCAGAAATCATGGTGATTGGCGGCGGTCGCGTTTATGAACAGTTCTTGCCAAAAGCGCAAAAACTGTATCTGACGCATATCGACGCAGAAGTGGAAGGCGACACCCATTTCCCGGATTACGAGCCGGATGACTGGGAATCGGTATTCAGCGAATTCCACGATGCTGATGCGCAGAACTCTCACAGCTATTGCTTTGAGATTCTGGAGCGGCGGTAATTTTGTATAGAATTTACGGCTAGCGCCGGATGCGACGCCGGTCGCGTCTTATCCGGCCTTCCTATATCAGGCTGTGTTTAAGACGCCGCCGCTTCGGCCAAATCCTTATGCCGGTTCGACGGCTGGACAAAATACTGTTTATCTTCCCAGCGCAGGCAGGTTA
Length: 778
ID: M10922.1
Description: M10922.1 Lactobacillus casei dihydrofolate reductase gene (DHFR), compl

# Reading Genbank Files
Genbank files have more information than fastas and are usually meant to give annotations for a genome or region of a genome. Annotations include coding regions (CDS) and sometimes annotations of domains such as PFAMs. Genbank files use the file extension .gbk, .gb, or .gbff

In [33]:
#use this if the genbank has a single record
#a record is an annotated DNA segment, this will be the case for complete genomes or single regions of geonmes
#multiple records may occur in cases of incomplete genomes
#Use SeqIO.parse("example.gbk", "genbank") if your file contains multiple records
record = SeqIO.read("plasmid.gb", "genbank")

#read overall information about the genbank record
print("ID:", record.id)
print("Name:", record.name)
print("Description:", record.description)
print("Sequence length:", len(record.seq))
print("Organism:", record.annotations.get("organism"))

print("DNA seq:",record.seq)

#we can iterate over the annotations of the sequence, which are called featuers
for feature in record.features:
    #feature.type stores information about what type of feature it is (e.g. CDS for coding region)
    print(feature.type)
    #feature.location gives the start, end, and strand (+ or -) of the feature
    print(feature.location)
    #feature.qualifiers has all other information about the feature as a dictionary
    print(feature.qualifiers)

ID: X51450.1
Name: X51450
Description: Artificial cloning vector plasmid BD64
Sequence length: 4780
Organism: Cloning vector pBD64
DNA seq: CCGGGATAGACTGTAACATTCTCACGCATAAAATCCCCTTTCATTTTCTAATGTAAATCTATTACCTTATTATTAATTCAATTCGCTCATAATTAATCCTTTTTCTTATTACGCAAAATGGCCCGATTTAAGCACACCCTTTATTCCGTTAATGCGCCATGACAGCCATGATAATTACTAATACTAGGAGAAGTTAATAAATACGTAACCAACATGATTAACAATTATTAGAGGTCATCGTTCAAAATGGTATGCGTTTTGACACATCCACTATATATCCGTGTCGTTCTGTCCACTCCTGAATCCCATTCCAGAAATTCTCTAGCGATTCCAGAAGTTTCTCAGAGTCGGAAAGTTGACCAGACATTACGAACTGGCACAGATGGTCATAACCTGAAGGAAGATCTGATTGCTTAACTGCTTCAGTTAAGACCGAAGCGCTCGTCGTATAACAGATGCGATGATGCAGACCAATCAACATGGCACCTGCCATTGCTACCTGTACAGTCAAGGATGGTAGAAATGTTGTCGGTCCTTGCACACGAATATTACGCCATTTGCCTGCATATTCAAACAGCTCTTCTACGATAAGGGCACAAATCGCATCGTGGAACGTTTGGGCTTCTACCGATTTAGCAGTTTGATACACTTTCTCTAAGTATCCACCTGAATCATAAATCGGCAAAATAGAGAAAAATTGACCATGTGTAAGCGGCCAATCTGATTCCACCTGAGATGCATAATCTAGTAGAATCTCTTCGCTATCAAAATTCACTTCCACCTTCCACTCACCGGTTGTCCATTCATGGCTGAACTCTGCTTCCTCTGTTGACATGACACACATCATCTCAATATCCGAA