## Reading and writing from files

The `open` function is the standard way to gain access to a file on your filesystem.  Files can be opened for reading only, writing, or both reading and writing.  By default files are opened in read only mode.  The primary argument to the `open` function is a string giving the path and name of the file you want to open:

In [1]:
### open function and file objects
f = open("/Users/pmagwene/Downloads/covid-ref.fsa")

The simplest way to manipulate a file is simply to read all the information from it, return the data in the file as a string:

In [2]:
### reading everything from a file as a single string
s = f.read()

Once you've read what you need from the file it's good practice to close it (failing to close a file can lead to a memory leak in some contexts, but it's usually not a problem in an interactive environment like Jupyter notebooks).

In [3]:
f.close()

An alternate way to read a file is within a `with` statement as illustrated below.  The advantage of the `with` function is it insures you're file is closed (i.e. you don't need to explicitly call the `close` method).

In [4]:
with open("/Users/pmagwene/Downloads/covid-ref.fsa") as f:
    s = f.read()

Once we've read the file into a string we can apply all the standard string methods and operators to it:

In [5]:
len(s)  ## how many characters were in the file?

30429

In [6]:
s[:50]  ## first 50 chars in the file

'>NC_045512.2 Severe acute respiratory syndrome cor'

## Reading a file by lines

Sometimes it's more convenient or more efficient to get the information in the file in terms of lines.  The `readlines` method associated with file object read's all the lines at once in a list:

In [7]:
## return a list of the lines in the file
with open("/Users/pmagwene/Downloads/covid-ref.fsa") as f:
    lines = f.readlines()

In [8]:
len(lines)

430

In [9]:
lines[0]

'>NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome\n'

In [10]:
lines[1]

'ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA\n'

## Reading a file one line at a time

The `readlines` function illustrated above reads all the lines at once. That's works well if your file has a modest number of lines, but for a file with millions of lines (or very long single lines)  `readlines` might exhaust the memory of your computer.  One way to work around this is to process files line by line, reading only one line at a time.  This can be done with a for loop applied directly to the file object:

In [11]:
with open("/Users/pmagwene/Downloads/covid-ref.fsa") as f:
    nchars = []
    for line in f:
        nchars.append(len(line)) # let's pretend our task was to count line lengths for files w/millions of lines

nchars[:10]  # the first chars 

[97, 71, 71, 71, 71, 71, 71, 71, 71, 71]

For such a simple computation we'd typically use a list comprehension:

In [12]:
with open("/Users/pmagwene/Downloads/covid-ref.fsa") as f:
    nchars = [len(line) for line in f]

nchars[:10] 

[97, 71, 71, 71, 71, 71, 71, 71, 71, 71]

# The FASTA file format for sequence data

The FASTA file format is the most commonly used file format used to represent nucleotide and protein sequence data.  Wikipedia has a good [overview of the FASTA format](https://en.wikipedia.org/wiki/FASTA_format).  

Summary of FASTA format:
 
 * Each file can hold one or more sequence records
 
 * The beginning of each record is delimited by a line called a header, which has a `>` character at the beginning, followed by the name associated with that record (and an optional description). For example `>seq1 Involved in...` would indicate the beginning of a record with the name `seq1` and the description "Involved in...".
 
 * On or more sequence lines follow header lines.  These lines are usually wrapped to have length <=80 characters but this is not required.



We'll illustrate this format with an example -- the COVID-19 reference genome.  The  [Genbank](https://www.ncbi.nlm.nih.gov/genbank/) record for the COVID-19 reference genome can be accessed via [this link](https://www.ncbi.nlm.nih.gov/nuccore/NC_045512) while a direct link the the FASTA formatted DNA sequence can be accessed [here](https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta&log$=seqview&format=text). 

I've also posted a copy of the COVID-19 reference genome in the course Github site, with a direct link to the raw FASTA file here [covid-ref.fsa](https://github.com/bio208fs-class/bio208fs-lecture/raw/master/data/covid-ref.fsa).


In [13]:
def parse_FASTA(fname):
    record_dict = {}
    f = open(fname, 'r')
    
    recname = ""           # will hold names of records
    seq = ""               # will hold seq strings
    active_record = False  # indicates whether we are currently working on building a record
    
    for line in f.readlines():
        
        line = line.strip()  # strip any whitespace at beginning/end of line
        
        if line == "":       # empty line
            continue         # go to next iteration of for loop
            
 
        if line[0] == ">":                 # are we dealing with a new record?
            if active_record:              # did we already have an active record?
                record_dict[recname] = seq # if so, add to old active record to the dict so we can
                                           # begin a new one 
            
            recname = line[1:].split()[0]  # name of new record
            seq = ""                       # reset variable holding the string
            active_record = True           # set flag to indicate we now have an active record
            continue                       # go to the next iteration of for loop, as there's nothing else to do
        
        seq += line
        
    if active_record:               # if we've exhausted all the lines, we might still have an active record
        record_dict[recname] = seq  # if so, add it to the dict

    return record_dict
        
            

To test our `parse_FASTA` function download the `covid-ref.fsa` file linked above to your computer and modify the lines below to load and parse the sequence records contained in that file.

In [14]:
recs = parse_FASTA("/Users/pmagwene/Downloads/covid-ref.fsa")

In [15]:
type(recs)

dict

In [16]:
len(recs)

1

In [17]:
list(recs.keys())  # get a list of the keys in recs

['NC_045512.2']

In [18]:
recs["NC_045512.2"][:10]  # get the first 10 characters in the sequence for the record with this name

'ATTAAAGGTT'

### Using our parse_FASTA function

Download the following files to your computer:

* [`covid-ref.fsa`](https://github.com/bio208fs-class/bio208fs-lecture/raw/master/data/covid-ref.fsa)
* [`covid-S-and-E.fsa`](https://github.com/bio208fs-class/bio208fs-lecture/raw/master/data/covid-S-and-E.fsa)

# Assignment 01

Using codeblocks, solve the following problems:

a) Show how to use the `parse_FASTA` function to read the file `covid-S-and-E.fsa` file. (1 pt)

In [19]:
# code for 1a

b) Show how to use list comprehension to get the gene names of each of the records in `covid-S-and-E.fsa` (1 pt)

In [20]:
# code for 1b

c) Show how to get the sequence corresponding to the gene with the name `YP_009724390.1` (1 pt)

In [21]:
# code for 1c

d) Show how to use a single for-loop to create two lists containing the name of each protein in `covid-S-and-E.fsa` and a corresponding list giving the length of each of those proteins (2 pts)

In [22]:
# code for 1d

# Assignment 02

Using codeblocks, solve the following problems:
 
a) Refer to the [GenBank record page for the COVID-19 reference genome](https://www.ncbi.nlm.nih.gov/nuccore/NC_045512).  What are the genome coordinates for the coding sequence for the "S" (spike) gene?  What are the genome coordinates for the coding sequence of the "E" (envelope) gene? Assign the start and stop coordinates for each of these genes to variables with appropriate names (1 pt)

In [23]:
# code for 2a

b) Using the DNA nucleotide sequence you loaded from the `covid-ref.fsa` file, show how to retrieve the nucleotide sequences corresponding to the "S" and "E" genes. Remember that Python strings are 0-indexed, whereas GenBank using 1-index coordinates (3 pts)

In [24]:
# code for 2b

# Assignment 03

a) Write a translation function, `translate`, that takes as an input a string representing a DNA coding sequence and returns a string representing the corresponding protein sequence (5 pts)

In [25]:
# code for 3a

def translate(seq):
    pass # replace with your code

b) Test your `translate` function by applying it to the coding sequence of the "S" and "E" genes, and comparing your results to the protein sequences from the `covid-S-and-E.fsa` file provided above.

In [26]:
# code for 3b

# Assigment 04

The protein coding regions of many genes are encoded not as single continuous blocks of the genome, but instead in regions called "exons" that are separated by non-coding regions called "introns".  Following transcription, intronic sequences are "spliced out" of meseenger RNA (mRNA) by a protein complex called the "Sliceosome".  The end product of this splicing process is the sequence that will actually be translated by ribosomes.

a) Write a splicing function that takes as input two arguments:

1.  a string representing the genomic DNA sequence of the gene
2.  a list of list (or tuples), where each sublist (tuple) contains a pair of numerical (integer) coordinates giving the start and stop coordinates (1-indexed, relative to the beginning of the sequence) of the exons of the gene. 

The output should be a string representing to the spliced DNA sequence of the gene (i.e. the exons concatenated in the correct order) (5 pts)

In [27]:
# code for 4a

def splice(seq, exons):
    pass  # replace w/your code

b) test your splice function by looking up the exon information for the yeast gene [ACT1](https://www.yeastgenome.org/locus/S000001855) at the [Saccharomyces Genome Database](https://yeastgenome.org) and using that information to splice the file [`ACT1-genomic.fsa`](https://github.com/bio208fs-class/bio208fs-lecture/raw/master/data/ACT1-genomic.fsa) and then comparing your result to the spliced version of ACT1 in the file [`ACT1-coding.fsa`](https://github.com/bio208fs-class/bio208fs-lecture/raw/master/data/ACT1-coding.fsa) (2 pts)

In [28]:
# code for 4b
