# Lab 8: Summarizing the *SARS-COV-2* genome

In this lab we will use Biopython to summarize the *Severe acute respiratory syndrome coronavirus 2 (SARS-Cov-2)* genome.

The entry has been downloaded from https://www.ncbi.nlm.nih.gov/nuccore/NC_045512 and saved as a text file which is available on the course page.

In [None]:
from Bio import SeqIO # to parse sequence data

with open('SARS-COV-2.gbk') as handle :
    sequences = SeqIO.parse(handle, "genbank")
    seq_record = next(sequences) # get the first sequence record

seq_record

### Find all protein products from CDS features

For each CDS, output the gene name, protein id, and product in a table format that uses *f-strings* (see below) so that each column has a width of 20 characters. 

### Find all mature protein products of *orf1ab*

The gene *orf1ab* is known as a *polyprotein*. A polyprotein is a gene that is translated into a single protein that is then cleaved (cut) into multiple proteins, which are then referred to as *mature peptides*. For each of the mature peptides produced by *orf1ab*, output the gene name, protein id, and product in a table using the same format as the previous problem.  Also use Python to count and then output the number of mature peptides produced by *orf1ab*.

### Extracting viral RNA sequences for coronavirus testing

Current tests for *SARS-COV-2* involve testing for the presence of viral RNA molecules. For example, "the Stanford test screens first for the presence of viral RNA encoding a protein called an envelope protein, which is found in the membrane that surrounds the virus and plays an important role in the viral life cycle, including budding from an infected host cells. It then confirms the positive result by testing for a gene encoding a second protein called RNA-dependent RNA polymerase."

Source: http://med.stanford.edu/news/all-news/2020/03/stanford-medicine-COVID-19-test-now-in-use.html

**Question**: Output the RNA sequences for the *envelope protein* CDS and the *RNA-dependent RNA polymerase*, which is a *mature protein*. Note: you should output the gene sequences from the seq_record entry; these sequences are actually DNA sequences -- the viral RNA is identical but will have uracil instead of thymine.


### Aside: f-strings

An *f-string* is a formatted string in Python. Formatting includes specification of the width of the output, or specifying the number of decimal places for a decimal number. We will only worry about setting the output width here.

The basic syntax of a formatted string is
```
f'text {value:format} more text'
```

where *value* is a string or the name of a variable, and format can be

- a number, to denote the width (e.g., 10) (with default left-alignment)
- a number, preceded by a '<' for left-alignment, a '^' for center-alignment, or a '>'for right-alignment

The cell below demonstrates how to display 'hi' inside a block of 10 characters, where 'hi' is left-aligned, centered, or right-aligned.

In [None]:
print(f'{"hi":<10}') # left-align
print(f'{"hi":^10}') # center-align
print(f'{"hi":>10}') # right-align

We can also use *f-strings* to display the values of variables, as in the example below.

In [None]:
fName = 'Jane'
lName = 'Doe'
print(f'Hello {fName} {lName}!')