## Reading and writing from files

In [1]:
### open function and file objects
?open

In [None]:
### reading everything from a file as a single string

In [None]:
### reading a file line by line

### Write a function to parse a FASTA file

### Introducing the FASTA file format for sequence data

The most ubiquitous file format used to represent nucleotide and protein sequence data is the FASTA format.  Wikipedia has a good [overview of the FASTA format](https://en.wikipedia.org/wiki/FASTA_format).  We'll illustrate this format with an example -- the COVID-19 reference genome, which can be found on [Genbank](https://www.ncbi.nlm.nih.gov/genbank/) via [this link](https://www.ncbi.nlm.nih.gov/nuccore/NC_045512).

Summary of FASTA format:
 
 * Each file can hold one or more sequence records
 
 * The beginning of each record is delimited by a line called a header, which has a `>` character at the beginning, followed by the name associated with that record (and an optional description). For example `>seq1 Involved in...` would indicate the beginning of a record with the name `seq1` and the description "Involved in...".
 
 * On or more sequence lines follow header lines.  These lines are usually wrapped to have length <=80 characters but this is not required.



In [None]:
def parse_FASTA(fname):
    record_dict = {}
    f = open(fname, 'r')
    
    recname = ""           # will hold names of records
    seq = ""               # will hold seq strings
    active_record = False  # indicates whether we are currently working on building a record
    
    for line in f.readlines():
        
        line = line.strip()  # strip any whitespace at beginning/end of line
        
        if line == "":       # empty line
            continue         # go to next iteration of for loop
            
 
        if line[0] == ">":                 # are we dealing with a new record?
            if active_record:              # did we already have an active record?
                record_dict[recname] = seq # if so, add to old active record to the dict so we can
                                           # begin a new one 
            
            recname = line[1:].split()[0]  # name of new record
            seq = ""                       # reset variable holding the string
            active_record = True           # set flag to indicate we now have an active record
            continue                       # go to the next iteration of for loop, as there's nothing else to do
        
        seq += line
        
    if active_record:               # if we've exhausted all the lines, we might still have an active record
        record_dict[recname] = seq  # if so, add it to the dict

    return record_dict
        
            

### Using our parse_FASTA function

Download the following files to your computer:

* [`covid-ref.fsa`](https://github.com/bio208fs-class/bio208fs-lecture/raw/master/data/covid-ref.fsa)
* [`covid-S-and-E.fsa`](https://github.com/bio208fs-class/bio208fs-lecture/raw/master/data/covid-S-and-E.fsa)

# Assignment 01

Using codeblocks, solve the following problems:

a) Show how to use the `parse_FASTA` function to read the file `covid-S-and-E.fsa` file. (1 pt)

In [None]:
# code for 1a

b) Show how to use list comprehension to get the gene names of each of the records in `covid-S-and-E.fsa` (1 pt)

In [6]:
# code for 1b

c) Show how to get the sequence corresponding to the gene with the name `YP_009724390.1` (1 pt)

In [7]:
# code for 1c

d) Show how to use a single for-loop to create two lists containing the name of each protein in `covid-S-and-E.fsa` and a corresponding list giving the length of each of those proteins (2 pts)

In [8]:
# code for 1d

# Assignment 02

Using codeblocks, solve the following problems:
 
a) Refer to the [GenBank record page for the COVID-19 reference genome](https://www.ncbi.nlm.nih.gov/nuccore/NC_045512).  What are the genome coordinates for the coding sequence for the "S" (spike) gene?  What are the genome coordinates for the coding sequence of the "E" (envelope) gene? Assign the start and stop coordinates for each of these genes to variables with appropriate names (1 pt)

In [None]:
# code for 2a

b) Using the DNA nucleotide sequence you loaded from the `covid-ref.fsa` file, show how to retrieve the nucleotide sequences corresponding to the "S" and "E" genes. Remember that Python strings are 0-indexed, whereas GenBank using 1-index coordinates (3 pts)

In [None]:
# code for 2b

# Assignment 03

a) Write a translation function, `translate`, that takes as an input a string representing a DNA coding sequence and returns a string representing the corresponding protein sequence (5 pts)

In [12]:
# code for 3a

def translate(seq):
    pass # replace with your code

b) Test your `translate` function by applying it to the coding sequence of the "S" and "E" genes, and comparing your results to the protein sequences from the `covid-S-and-E.fsa` file provided above.

In [3]:
# code for 3b

# Assigment 04

The protein coding regions of many genes are encoded not as single continuous blocks of the genome, but instead in regions called "exons" that are separated by non-coding regions called "introns".  Following transcription, intronic sequences are "spliced out" of meseenger RNA (mRNA) by a protein complex called the "Sliceosome".  The end product of this splicing process is the sequence that will actually be translated by ribosomes.

a) Write a splicing function that takes as input two arguments:

1.  a string representing the genomic DNA sequence of the gene
2.  a list of list (or tuples), where each sublist (tuple) contains a pair of numerical (integer) coordinates giving the start and stop coordinates (1-indexed, relative to the beginning of the sequence) of the exons of the gene. 

The output should be a string representing to the spliced DNA sequence of the gene (i.e. the exons concatenated in the correct order) (5 pts)

In [10]:
# code for 4a

def splice(seq, exons):
    pass  # replace w/your code

b) test your splice function by looking up the exon information for the yeast gene [ACT1](https://www.yeastgenome.org/locus/S000001855) at the [Saccharomyces Genome Database](https://yeastgenome.org) and using that information to splice the file [`ACT1-genomic.fsa`](https://github.com/bio208fs-class/bio208fs-lecture/raw/master/data/ACT1-genomic.fsa) and then comparing your result to the spliced version of ACT1 in the file [`ACT1-coding.fsa`](https://github.com/bio208fs-class/bio208fs-lecture/raw/master/data/ACT1-coding.fsa) (2 pts)

In [None]:
# code for 4b
