# Working with files

You will often use Python to work with data stored in a separate file. A few important points to consider:

* do I need to read the entire dataset/file into memory?
* are there concurrency issues?

A simple syntax for working with files is as folliows:

```python
stream = open(filename,mode)

for line in stream:
    # do something

stream.close()
```



Another way to read a file line by line is the `readline()` method:

```python
stream = open(filename,mode)

firstLine = stream.readline()
secondLine = stream.readline()

stream.close()
```

You can also use the read method to extract a specific number of bytes. For example:

```python
stream = open(filename,mode)

twoBytes = stream.read(2)

stream.close()
```

**It is very important to close your file after you've opened it**. Attempting to work with the same memory space concurrently can cause data corruption and hardware faults.

A more "Pythonic", contemporary style:

```python
with open(filename, mode) as FILEHANDLE:

    for line in FILEHANDLE:
    
        #do something
```

Note here that there is no explicit call to close. This elegant `with` construction takes care of it for you!

**What syntax does one use to open a file for reading? For writing?**

**What if the file doesn't already exist?**

https://docs.python.org/3/tutorial/inputoutput.html

The "modes" for `open()` are essentially the sams as those for `fopen()` in C:

http://www.manpagez.com/man/3/fopen/

You can open more than one file object concurrently, which is useful for reading and writing to separate files:

```python
with open('input.txt', 'r') as IN, open('output.txt', 'w') as OUT:

    dosomething()
```

# Exercise 10

Using the tab-delimited annotation file "dictyBase_17022017_mod.gff", extract only the first 10 features and print these to a new file. 

Before you begin, be aware that printing many lines of text will consume a lot of memory and may cause your browser to stall or crash. As you're building and debugging your code, you can exit a loop using the break command:

```python
for line in IN:

    print(line)
    break
```

**Which method can be used to remove the newline character from a file?**

# Exercise 11

Using the same Dictyostelium GFF, report all of the feature entries for exons on chromosome 1 between coordinates 40000 and 50000.

Column 1: chromosome
Column 2: data source
Column 3: feature type
Column 4: feature start coordinate
Column 5: feature end coordinate

hint: use `split()`

# Exercise 12

Read the tab-delimited annotation file "GRCm38_chr1.gtf". It has the following 9-column format desribing one feature entry per line (http://www.ensembl.org/info/website/upload/gff.html):

1"\t"havana"\t"transcript"\t"3073253"\t"3074322"\t"."\t"+"\t"."\t"gene_biotype "TEC"; gene_id "ENSMUSG00000102693"; gene_name "4933401J01Rik"; gene_source "havana"; gene_version "1"; havana_gene "OTTMUSG00000049935"; havana_gene_version "1"; havana_transcript "OTTMUST00000127109"; havana_transcript_version "1"; tag "basic"; transcript_biotype "TEC"; transcript_id "ENSMUST00000193812"; transcript_name "4933401J01Rik-001"; transcript_source "havana"; transcript_support_level "NA"; transcript_version "1"; tss_id "TSS81250";

Extract the the sequence for every UTR interval between coordinates 166000000 and 166100000 using the FASTA file "GRCm38_chr1.fa". The UTR sequences should be returned in FASTA format, using the start and end coordinates and UTR length as the FASTA header.

**Consider writing pseudo-code before you begin writing actual code**

# Exercise 13

Sometimes FASTA files have the sequence data separated by some number of characters per line, for example "GRCm38_chr1_100pl.fa". Repeat Exercise 12 using this line-separated FASTA file.

# A Brief Intro to Biopython

Via Wikipedia (https://en.wikipedia.org/wiki/Biopython):

"The Biopython Project is an open-source collection of non-commercial Python tools for computational biology and bioinformatics, created by an international association of developers. It contains classes to represent biological sequences and sequence annotations, and it is able to read and write to a variety of file formats. It also allows for a programmatic means of accessing online databases of biological information, such as those at NCBI. Separate modules extend Biopython's capabilities to sequence alignment, protein structure, population genetics, phylogenetics, sequence motifs, and machine learning. Biopython is one of a number of Bio projects designed to reduce code duplication in computational biology."

You can install Biopython into the desired environment using the Anaconda Navigator GUI. Alternatively, you can install from the command line using something like the following:

```
conda info --envs

source activate <py-env>  # for Windows: activate <py-env> 

conda install -c anaconda biopython

# more info on conda-managed environments here: https://conda.io/docs/using/envs.html
```

Have a look at the Biopython documentation here:

http://biopython.org/wiki/Documentation

Many Python modules include much more code than you will actually use in any one script. This is particularly true for huge packages like Biopython. Usually, you only need to import a small section of the code. The syntax looks something like:

```python
from Bio import SeqIO
```

Bio is the name for the Biopython module in its entirety. SeqIO is one of the "names" within the module (you can read more about namespaces, and scope here: ). So in the above syntax, you are only importing the code in the namesapce of SeqIO. Thus the related functions are accessed using the dot (.) notation.

```python
with open('GRCm38_chr1_100pl.fa', 'r') as FASTA:
    
    for record in SeqIO.parse(FASTA, "fasta"):
    
        # do something with record.id, record.seq, etc
```

The documentation for SeqIO is (currently) here: 


You can load multiple sections of a module simultanesously:

```python 
from Bio.Phylo.PAML import codeml, baseml, yn00
```

Much of the Biopython functionality is included in the "cookbook" (http://biopython.org/DIST/docs/tutorial/Tutorial.html) or in domain specific documentation, eg 
http://biopython.org/wiki/SeqIO 
http://biopython.org/wiki/PAML 

Another very interesting project to keep an eye on is [scikit-bio](http://scikit-bio.org/). Parsing a FASTA file with scikit-bio might look something like the following:

```python
import skbio.io

for seq in skbio.io.read("myseqs.fasta", format="fasta"):

    print(seq.metadata["id"])
    print(str(len(seq)))
    print(seq)
```

# Exercise 14

Repeat exercise 12/13 using Biopython or scikit-bio to parse your fasta file.

# Exercise 15

The goal of this exercise is to reconstruct transcripts from a GTF file. Read the tab-delimited annotation file "GRCm38_chr1.gtf". Determine all of the exons for each transcript (by transcript ID), and then concatenate these to the transcript model sequence from the FASTA file "GRCm38_chr1.fa". For simplicity, limit this exercise to + strand transcript models.

Hint: if you are using multidimensional dicts and running into key errors (uninitialized keys), consider using the `try` and `except` construction:

```python
try:

    #try to populate your multi-dimensional dict
    
except KeyError:

    #populate your multi-dimensional dict
```

You can read more about exceptions in Python here: https://docs.python.org/3/library/exceptions.html

Alternatively, you can consider using the following custom class of dict (uses "autovivification"):

```python
class Vividict(dict):
    def __missing__(self, key):
        value = self[key] = type(self)()
        return value
```