# Welcome to Day 4! 

## Extracting and storing sequence data, working with GenBank files

### Section 1: Extracting and storing information from sequence data

### Section 2: Parsing and manipulating GenBank (.gbk/.gbff) files

---

## Session summary


Day 4 will have two different components. We will go over over how to extract sequence information, like GC content, lengnth, number of methionines, or whether a motif is present, and then save it to a file. We will also extract only information from the headers and save that to a file. Afterwards, we move into uncharted territory by learning how to work with GenBank files. Although GenBank files may look intimdating, we apply what we learned from working with Fasta files and realize that working with them can be straightforward.

---


---
## For Google colab users only

Run the following commands

In [None]:
pip install Biopython

In [None]:
!wget https://raw.githubusercontent.com/agmcfarland/biopython_workshop/master/antibiotic_resistance_genes_short.fasta
!wget https://raw.githubusercontent.com/agmcfarland/biopython_workshop/master/mixed_args.fasta
!wget https://raw.githubusercontent.com/agmcfarland/biopython_workshop/master/short_reads.fastq
!wget https://raw.githubusercontent.com/agmcfarland/biopython_workshop/master/fabv_small.gbk
!wget https://raw.githubusercontent.com/agmcfarland/biopython_workshop/master/salmonella_genome.gbk
!wget https://raw.githubusercontent.com/agmcfarland/biopython_workshop/master/salmonella_genome_2.gbk
!wget https://raw.githubusercontent.com/agmcfarland/biopython_workshop/master/salmonella_genome_3.gbk

---

# Section 1: Extracting and storing information from sequence data

Sometimes we might want to loop through sequences in a file and obtain sequence information, such as the GC content or the length of the sequence. We have obtained sequence information like this in Day 3 in order to filter for sequences, but now we wil learn how to store that information instead.

Here are some Biopython functions that can be used to obtain sequence information.

In [None]:
# import Biopython modules we will be using
from Bio import SeqIO
from Bio.Seq import Seq

In [None]:
# Calculate sequence length
for record in SeqIO.parse('antibiotic_resistance_genes_short.fasta','fasta'):
    print(len(record.seq))

In [None]:
# Calculate number of AUG 
for record in SeqIO.parse('antibiotic_resistance_genes_short.fasta','fasta'):
    print(record.seq.count('ATG'))

In [None]:
# calculate GC content manually
for record in SeqIO.parse('antibiotic_resistance_genes_short.fasta','fasta'):
    c_count = record.seq.count('C')
    g_count = record.seq.count('G')
    seq_length = len(record.seq)

    print(100 * (g_count+c_count)/seq_length)

We can save these data to a list or to a new file, depending on what we want to do. We can use the record.id so that each data point extracted has a unique identifier.

In [None]:
# make empty list
store_data = []

for record in SeqIO.parse('antibiotic_resistance_genes_short.fasta','fasta'):
    # each data point contains the record.id and the length in a list
    store_data.append([record.id, len(record.seq)])
# view data
store_data

This list of lists data storage approach can be then converted into a Numpy array or a Pandas `pd.DataFrame()`.

In [None]:
import pandas as pd

df = pd.DataFrame(store_data, columns=['gene_id','length'])
df

--- 
### Exercise 1a

We want to build a dataframe that contains information on whether specific protein motifs are located in each sequence. We will assign a 1 if it is there, and a 0 if it is not.

Use `SeqIO.parse()` to parse the fasta file and `translate()` the `record.seq`. `print()` each `record` in the loop.

In [None]:
for ___ in ___.___('mixed_args.fasta', 'fasta'):
    record.seq = record.___.___
    print(___)

### Exercise 1b

To start, make `store_data` an empty list. 

Now let's search each `record` for the first motif `'QRALGL'` in our sequences using `find()`. Assign a 1 if it is found and a 0 if it is not. 

`append()` `sequence_identifier` and `motif_1` to `store_data`. 

Afterwards, `print()` `store_data`

In [None]:
store_data = ___ 

for ___ in ___.___('mixed_args.fasta', 'fasta'):
    record.seq = record.___.___

    sequence_identifier = record.id

    if record.___.find('QRALGL') > -1:
        motif_1 = ___
    else:
        motif_1 = ___

    store_data.append([___, ___])

print(store_data)

### Exercise 1c 

Now let's test for the presence of two additional motifs, using the same basic code structure as above. Use `find()` to search for the other two motifs. Afterwards, `append()` `sequence_identifier`, `motif_1`, `motif_2`, and `motif_3` to `store_data`.

In [None]:
store_data = ___ 

for ___ in ___.___('mixed_args.fasta', 'fasta'):
    record.seq = record.___.____

    sequence_identifier = record.id

    if ___.___.find('QRALGL') > -1:
        motif_1 = ___
    else:
        motif_1 = ___

    if ___.___.find('GAKR') > -1:
        motif_2 = ___
    else:
        motif_2 = ___

    if ___.___.find('DGTA') > -1:
        motif_3 = ___
    else:
        motif_3 = ___


    store_data.append([___, ___, ___, ___])

store_data

### Exercise 1d

Convert `store_data` into a pandas `pd.DataFrame()`, `df`, with `columns` `'gene_id'`, `'motif1'`, `'motif2'`, `'motif3'`, `'motif4'`

In [None]:
df = pd.___(___, columns=['gene_id','motif1','motif2','___'])
df

Storing data in a dataframe like pandas allows for further downstream processing of sequence data using non-sequence techniques. 


For example, we can mimic a hierarchical clustering of motif detection by sorting the dataframe by presence/absence of all three motifs

In [None]:
df = df.sort_values(['motif1','motif2','motif3'], ascending=(True,True,True))
df

---

Sometimes it is necessary to extract data sequence headers themselves. An example would be taxonomy data that tells you where which species the sequence belongs to. The base python function `split()` is very useful here. If you know how the headers are organized, you can find common characters that can split the header, which is a string, into items in a list. 

As an example:



In [None]:
record.description = 'ErmB AAF86219.1 [Enterococcus faecium]'
# Splits record.description into four items in a list
record.description.split(' ')

In [None]:
record.description = 'ErmB AAF86219.1 [Enterococcus faecium]'
# Splits record.description into two items in a list
record.description.split('[')

We can use `split()` to convert the header into a list that we can then extract specific information from.

In [None]:
record.description = 'ErmB AAF86219.1 [Enterococcus faecium]'
header_data = record.description.split(' ') # header_data is now a list
genus = header_data[2] # retrieving the third item in the header_data list
species = header_data[3] # retrieving the fourth item in the header_data list

print(genus)
print(species)


We can extract sequence information along with header information in a single pass.

In [None]:
# make empty list
store_data = []

for record in SeqIO.parse('antibiotic_resistance_genes_short.fasta','fasta'):

    # get individual counts of each nucleotide
    c_count = record.seq.count('C')
    g_count = record.seq.count('G')
    a_count = record.seq.count('A')
    t_count = record.seq.count('T')
    # get the length of the sequence
    seq_length = len(record.seq)
    # calculate the gc content of the sequence
    gc_content = 100 * (g_count+c_count)/seq_length

    # split the header by whitespace and store the resulting list in header_data
    header_data = record.description.split(' ')
    # get the genus of the sequence by accessing header_data
    genus = header_data[2]
    # get the species of the sequence by accessing header_data
    species = header_data[3]
     
    # store all of the data in store_data 
    store_data.append([record.id, seq_length, c_count, g_count, a_count, t_count, gc_content, genus, species])

In [None]:
df = pd.DataFrame(store_data, columns = ['record_id', 'seq_length', 'c_count', 'g_count', 'a_count', 't_count', 'gc_content', 'genus', 'species'])
df

---
### Exercise 2a

Among other things, we want to extract the genus and species from each sequences in `mixed_args.fasta` 

Before we try to extract data from the headers, we want to examine how the headers are organized to see where we can `split()` them.

Use `SeqIO.parse` to loop through each `record` in `mixed_args.fasta`.

`print()` each `record.description`

In [None]:
for record in SeqIO.___('mixed_args.fasta', 'fasta'):
    print( ___ )

### Exercise 2b

The headers (`record.description`) are more complicated than the previous example. 

However, a good place to `split()` the `record.description` would be the `=` signs.

Store the resulting list in the variable `header_data`. 

`print()` `header_data` to see how `record.description` was split.




In [None]:
for record in SeqIO.parse('mixed_args.fasta', 'fasta'):
    header_data = record.description.___('___')
    print(header_data)
    print('\n')

### Exercise 2c

Using the first `split()` `record.description` as a reference, here is the structure of the split list.

`['ERMA_STAAR Q6GKQ0 rRNA adenine N-6-methyltransferase OS', 'Staphylococcus aureus (strain MRSA252) OX', '282458 GN', 'ermA1 PE', '3 SV', '1']`

We can see that the taxonomic data is the second item in the list (or first if we use Python indexing). 

For each split `record.description`, `append()` both `record.id` and the `taxonomic_info` to the list `taxonomy_store`.


In [None]:
taxonomy_store = ___

for record in SeqIO.parse('mixed_args.fasta', 'fasta'):
    header_data = record.description.___('___')
    taxonomic_info = header_data[___]
    ___.append([___.id,___])

taxonomy_store

### Exercise 2d

Conver the list `taxonomy_store` into a `pd.DataFrame()` with `columns` `'gene_id'` and `'taxonomy'`

In [None]:
df = ___.DataFrame(___, ___ = ['gene_id', 'taxonomy'])
df

---

We can use `SeqIO` for more than just fasta files. For example, we can examine the reads in `.fastq` file using almost the same code as for a fasta file.

In [None]:
for record in SeqIO.parse('short_reads.fastq', 'fastq'):
    print(record)
    print('\n') # for clarity

In [None]:
dir(record)

In [None]:
record._set_seq

---
### Exercise 3a



`print()` the `read_length` of each read in `short_reads.fast` using `len()`.

In [None]:
for record in SeqIO.parse('short_reads.fastq', '___'):
    read_length = ___(record.___)
    print(read_length)

### Exercise 3b
count the **total** number of reads in `short_reads.fastq`. `print()` the total number of reads.

In [None]:
total_count = 0

for record in SeqIO.parse('short_reads.fastq', '___'):
    total_count = total_count + ___

print('total reads:', ___)

### Exercise 3c

`print()` the `record.id` and `read_length` of all reads that have fewer than or equal to 270 bases.

In [None]:
for record in SeqIO.parse('short_reads.fastq', '___'):
    read_length = ___(record.___)
    if read_length <= ___:
        print('ID:', ___.id)
        print('length:', ___)
        print('\n')

---



## Interlude

Before we get to the next section, we need to review dictionaries.

Dictionaries are similar to lists in that they are a way to hold data. A dictionary is different because it is composed of a 'key' and 'value' pairs. For example


In [None]:
my_dict = { 'animal': ['dog','cat'],
            'person': ['taylor','simone'],
            'vehicle': ['car','plane']}

my_dict

You can access the **elements** in a key by using they keyword

In [None]:
# access all elements in key 'animal'
my_dict['animal']

In [None]:
# access the first element of key 'animal'
my_dict['animal'][0]

# Section 2: Parsing and manipulating GenBank (.gbk/.gbff) files

Until now we have been exclusively working with sequences in `fasta` format. As we saw from our most recent example, however, `SeqIO` works with [`fastq` files](https://www.researchgate.net/profile/Morteza-Hosseini-6/publication/309134977/figure/fig2/AS:417452136648711@1476539753452/A-sample-of-the-FASTQ-file_W640.jpg). As we're finding out, `SeqIO` also works with other file types, including `GenBank` files. 

Compared to [`fasta` files](https://www.researchgate.net/profile/Morteza-Hosseini-6/publication/309134977/figure/fig1/AS:417452136648705@1476539753111/A-sample-of-the-Multi-FASTA-file_W640.jpg), [`GenBank` files](https://archaea.i2bc.paris-saclay.fr/baget/Images/hup.png) contain more information.

We will go over how to open and explore the data found in a `GenBank` file as well as how to extract features of interest and save them to a new file.

We will start by opening a `.gbk` file that only has one gene in it called fabV. We can see that we can access the same types of information as we would a `fasta` file.

In [None]:
for record in SeqIO.parse('fabv_small.gbk', 'genbank'):
    # record id
    print(record.id)
    print('\n')
    print(record.description)
    print('\n')
    print(record.seq)

But there additional things we can access, since GenBank files are more information-rich.

The most important new thing we can access is called `features`. You would call upon it by inputting `record.features`

The data stored in record.features `record.features` is actually a list of dictionaries. We can loop through `record.features` to examine its contents:

In [None]:
for feature in record.features:
    print(feature)
    print('\n') # for clarity

From the above, we can see that each loop returns three basic things: type, location, and qualifiers. 

In [None]:
# what is the specific feature for?
for feature in record.features:
    print(feature.type)

In [None]:
# where in the genome/sequence is the feature of interest located?
for feature in record.features:
    print(feature.location)

In [None]:
# the special data associated with the feature
for feature in record.features:
    print(feature.qualifiers)

We can access specific information found in `record.features`. **But** we have to make sure we specify the correct `feature.type` first.

In [None]:
# the organism that the sequence belongs to is stored under feature type called 'source'
for feature in record.features:
    if feature.type == 'source':
        print(feature.qualifiers['organism'])

In [None]:
# the molecular weight is stored under the feature type 'Protein'
for feature in record.features:
    if feature.type == 'Protein':
        print(feature.qualifiers['calculated_mol_wt'])

---
### Exercise 4a

`print()` each `feature` found in `record.features` in `'fabv_small.gbk'`.

In [None]:
for record in SeqIO.parse('fabv_small.gbk', '___'):
    for feature in record.___:
        print(feature)

### Exercise 4b

`print()` all the `feature.qualifiers` found in `feature.type` `'Region'`

In [None]:
for record in SeqIO.parse('fabv_small.gbk', '___'):
    for feature in record.___:
        if feature.___ == 'Region':
            print(feature.___)

### Exercise 4c

`print()` the database cross reference (`'db_xref'`) identifier stored in `feature.qualifiers` that belongs to `feature.type` `'Region'`

In [None]:
for record in SeqIO.parse('fabv_small.gbk', '___'):
    for feature in record.___:
        if feature.___ == 'Region':
            print(feature.___['___'])

---

After working with fasta files for so long, it might seem that GenBank files are un-necessarily complex, even if its structure can hold more information. And it is true that perhaps for just one gene/protein, we wouldn't want to access the sequence in GenBank format. 

Where the GenBank format truly shines is for storing assembled whole genomes that are annotated with predicted coding regions, tRNA, rRNA, and other features.


In [None]:
# record-level data will give you the name of the genome and also the nucleotide sequence for the whole assembly. Each contig is a new record.
for record in SeqIO.parse('salmonella_genome.gbk','genbank'):
    print(record.description)
    print(len(record.seq))
    print('\n') # for clarity

From the above output, we can see that we have a really well-assembled genome. It only has two contigs. The first is the complete chromosomal genome and the second is a circular plasmid. When we examine the lengths, we notice there is a big discrepancy in the number of basepairs each has.


We can easily extract some useful metadata about the genomes features. We start by input `record.annotations`. This is will make a wall of text but don't be afraid. It is simply a big dictionary that has a couple of dictionaries within it.

In [None]:
for record in SeqIO.parse('salmonella_genome.gbk','genbank'):
    rec_annot = record.annotations

rec_annot

The real meat containing the summary of the overall genomic features is found under `record.annotations['structured_comment']['Genome-Annotation-Data']`

In [None]:
for record in SeqIO.parse('salmonella_genome.gbk','genbank'):
    rec_annot_metadata = record.annotations['structured_comment']['Genome-Annotation-Data']

rec_annot_metadata

We can extract specific items from this dictionary using keywords.

In [None]:
# number of coding sequences
for record in SeqIO.parse('salmonella_genome.gbk','genbank'):
    n_coding_sequences = record.annotations['structured_comment']['Genome-Annotation-Data']['CDSs (total)']

n_coding_sequences

In [None]:
# number of CRISPR arrays
for record in SeqIO.parse('salmonella_genome.gbk','genbank'):
    n_crispr_arrays = record.annotations['structured_comment']['Genome-Annotation-Data']['CRISPR Arrays']

n_crispr_arrays

We can also access other parts of `record.annotations`


In [None]:
# full taxonomic breakdown
for record in SeqIO.parse('salmonella_genome.gbk','genbank'):
    genome_taxonomy = record.annotations['taxonomy']

genome_taxonomy

----

### Exercise 5a:

Retrieve the list of database cross-reference identifiers stored in `record.dbxrefs`. 


In [None]:
for record in SeqIO.___('salmonella_genome.___','genbank'):
    genome_assembly_ids = ___.dbxrefs

genome_assembly_ids

### Exercise 5b:

from `genome_assembly_ids`, select the `'Assembly'` ID. This is the whole genome identifer for the genome in NCBI

In [None]:
___[2]

### Exercise 5c:

Retrieve the number of pseudogenes stored in `record.annotations['structured_comment']['Genome-Annotation-Data']` using the keyword `'Pseudo Genes (total)'`


In [None]:
for record in SeqIO.parse('salmonella_genome.gbk','genbank'):
    n_pseudo_genes = record.___['structured_comment']['Genome-Annotation-Data']['___']

n_pseudo_genes

### Exercise 5d:

Use `len()` to retrieve the length of the **whole** genome. 

Remember that each `record` object provides information for each contig. Our genome has **two** contigs, so we will need to add up the lengths of `record.seq`

In [None]:
total_genome_length = ___

for record in SeqIO.parse('salmonella_genome.gbk','genbank'):
    contig_length = len(___.___)
    total_genome_length = total_genome_length + ___

total_genome_length

### Exercise 5e:

Being able to to quickly go through a genome and extract relevant metadata is a common bioinformatic task. 

We will put everything we learned in the previous steps of exercise 5 together into a single step. Look at the comments for hints

Extract the following metadata: 

1. NCBI whole-genome id 
2. number of genes
3. number of pseudogenes
4. the length of the genome.

`append()` the extracted metadata to store_data in that order.

In [None]:
store_data = ___

total_genome_length = ___

for record in SeqIO.parse('salmonella_genome.gbk','genbank'):

    # ncbi whole-genome id stored in record
    genome_assembly_ids = ___.dbxrefs
    ncbi_whole_genome_id = ___[2]

    # number of genes stored in record.annotations under keyword 'Genes (coding)'
    n_genes = record.___['structured_comment']['Genome-Annotation-Data']['Genes (coding)']

    # number of pseudogenes stored in record.annotations under keyword 'Pseudo Genes (total)'
    n_pseudo_genes = record.___['structured_comment']['Genome-Annotation-Data']['___']

    # length of the genome from getting the len() of record.seq
    contig_length = len(record.___)
    total_genome_length = total_genome_length + ___

    # append items to store_data
    store_data.append([ncbi_whole_genome_id, n_genes, n_pseudo_genes, total_genome_length])

store_data

### Exercise 5f

Extract data from three different `genbank` format salmonella genomes in the `current_directory` using `os.listdir()`. 

Use the above code as a guide

In [None]:
import os

current_directory = os.getcwd()

store_data = []

for file in ___.___(current_directory):
    if ___.startswith('salmonella_genome'):
        if ___.endswith('.gbk'):

            ## use biopython work with each individual genbank file that passess that above two conditions
            total_genome_length = 0
            for record in SeqIO.parse(file,'genbank'):

                # ncbi whole-genome id stored in record
                genome_assembly_ids = record.dbxrefs
                ncbi_whole_genome_id = genome_assembly_ids[2]

                # number of genes stored in record.annotations under keyword 'Genes (coding)'
                n_genes = record.annotations['structured_comment']['Genome-Annotation-Data']['Genes (coding)']

                # number of pseudogenes stored in record.annotations under keyword 'Pseudo Genes (total)'
                n_pseudo_genes = record.annotations['structured_comment']['Genome-Annotation-Data']['Pseudo Genes (total)']

                # length of the genome from getting the len() of record.seq
                contig_length = len(record.seq)
                total_genome_length = total_genome_length + contig_length

            # for each genbank file, store the extracted data in the list store_data
            store_data.append([ncbi_whole_genome_id, n_genes, n_pseudo_genes, total_genome_length])

# convert store_data to a pandas dataframe
df = ___.DataFrame(___, columns = ['ncbi_whole_genome_id', 'gene_count', 'pseudogene_count', 'whole_genome_length'])
df

---

One last thing to go over are the record.features in a large GenBank file. 

Let's say we wanted to manually count the number of coding sequences 



In [None]:
# record-level data will give you the name of the genome and also the nucleotide sequence for the whole assembly. Each contig is a new record.
cds_count = 0

for record in SeqIO.parse('salmonella_genome.gbk','genbank'):
    for feature in record.features:
        if feature.type == 'CDS':
            cds_count = cds_count + 1

cds_count

Or, let's say we want to make a list of all the gene products descriptions in our genome for further analysis

In [None]:
product_list = []

for record in SeqIO.parse('salmonella_genome.gbk','genbank'):
    for feature in record.features:
        if feature.type == 'CDS':
            product_list.append(feature.qualifiers['product'][0])

product_list

There are a lot of products here. What if we wanted to find only the ones that are involved in conjugal transfer? Also, are these genes located in the chromosomal or plasmid DNA, or both?

In [None]:
conjugal_transfer_list = []

for record in SeqIO.parse('salmonella_genome.gbk','genbank'):
    for feature in record.features:
        if feature.type == 'CDS':
            if feature.qualifiers['product'][0].lower().find('conjugal') > -1:
                conjugal_transfer_list.append([record.description,feature.qualifiers['product'][0]])



df = pd.DataFrame(conjugal_transfer_list, columns = ['record_description','product'])
df

Surprisingly, conjujgal transfer genes were only found on the chromosome. In the next exercise, let's get all the gene products found in the plasmid.

---

### Exercise 6a:

As has been stated before, in a whole-genome GenBank file, each record is a contig in the genome. In our genome .gbk, there are only two contigs, chromosomal and plasmid contigs. 

`print()` the `record.id` of each contig.



In [None]:
for record in SeqIO.parse('salmonella_genome.gbk','genbank'):
    print(___)

### Exercise 6b:

The second contig (the plasmid contig) has the record.id `NZ_CP059062.1`



Let's set our loop so that when the `record.id` is `NZ_CP059062.1`, a variable called `product_description` is created from each `feature.qualifiers['product'][0]` when `feature.type` is `CDS`.  

`append()` the `product_description` to the list `plasmid_products`

In [None]:
plasmid_products = ___

for record in SeqIO.parse('salmonella_genome.gbk','genbank'):
    if record.id == '___':
        for feature in record.features:
            if feature.type == 'CDS':
                product_description = feature.qualifiers['___'][0]
                ___.append(product_description)

plasmid_products

### Exercise 6c:

Do the same as in exercise 6b, but also extract the `ncbi_accession`, which is the gene ID, that is stored in `feature.qualifiers' under the keyword 'locus_tag'.

`append()` both the `ncbi_accession` and `product_description` to `plasmid_products`

Convert the list to a `pd.DataFrame()` with `columns` `'ncbi_accession'` and `'plasmid_products'`

In [None]:
plasmid_products = ___

for record in SeqIO.parse('salmonella_genome.gbk','genbank'):
    if record.id == '___':
        for feature in record.features:
            if feature.type == 'CDS':
                product_description = feature.qualifiers['___'][0]
                ncbi_accession = feature.___['locus_tag'][0]
                
                plasmid_products.append([___, ___])

df = ___.DataFrame(___, columns = ['ncbi_accession','plasmid_products'])
df

---

That's it for Day 4! 