## Notebook 2.3: Files I/O

This notebook will correspond with chapter 7 in the official Python tutorial https://docs.python.org/3/tutorial/.  


### Learning objectives: 

By the end of this exercise you should:

1. Understand how to import libraries.
2. Read and write data to files. 
3. Be able to load fastq genomic data from a file to a Python object.

### Importing a package
Python is very *atomic* language, meaning that many packages in the standard library are packaged into individual libraries that need to be loaded in order to access their utilities. This makes Python very light weight since the base language does not need to load all of these extra utilities unless we ask it to. To load a package that is installed on our system we can call the `import` function like below. Here we are also using a package that is not part of the standard library but was installed separately, called requests, which is used to download data from the web.

In [None]:
import os
import gzip
import requests

### Download data files for this notebook
Run the bash script below to create a new folder and download two files that we will use in this notebook into that folder. This code should look familiar, we used very similar bash commands in the notebooks from session 1. 

In [None]:
%%bash
mkdir -p datafiles/
wget http://eaton-lab.org/data/40578.fastq.gz -q -O datafiles/40578.fastq.gz
wget http://eaton-lab.org/data/iris-data-dirty.csv -q -O datafiles/iris-data-dirty.csv

We can perform the same task using Python. Here we will name the directory for the files "datafiles2" to differentiate it. In this case the Python version of the code looks quite a bit more complicated than the bash script. This isn't always the case, indeed Python code is often much simpler to read. By the end of this notebook you should be able to understand the code below.

In [None]:
# make a new directory
os.makedirs("datafiles2", exist_ok=True)

# download files to that directory
url1 = "http://eaton-lab.org/data/40578.fastq.gz"
with open("./datafiles2/40578.fastq.gz", 'wb') as ffile:
    ffile.write(requests.get(url1).content)

url2 = "http://eaton-lab.org/data/iris-data-dirty.csv"
with open("./datafiles2/iris-data-dirty.csv", 'wb') as ffile:
    ffile.write(requests.get(url2).content)

### List directories
Another common tool that we used in the bash terminal is the `ls` command to look at the files in a given location in the filesystem. Below is the `ls` command as well as a Python equivalent. The `os.listdir()` function in Python returns the contents as a `list`. 

In [None]:
%%bash
ls datafiles/

In [None]:
os.listdir("datafiles2/")

### Using packages
The `os` package has many functions but we will be using just a small part of it today, primarily the `path` submodule. Just like everything else in Python packages are also objects, and so we can access all of the functions in this package using tab completion. Put your cursor after the period in the cell below and press `<tab>` to see available options in `os`. There are many!

In [None]:
## use tab-completion after the '.' to see available options in os
os.

### Filepath operations with the `os` package
A type of string that is often difficult to format properly when writing code is a filepath. If the string representation of a filepath is incorrect by even a single typo then the path will not be found. This becomes extra tricky when a program needs to access filepaths on different types of computers, since filepaths look different on a Mac and PC, for example. Here understanding the filesystem hierarchy that we learned in lesson 1 becomes important. Fortunately the `os.path` package makes this easy. 

### Using `os.path`
The `os.path` submodule is used to format filepaths. We can expand shortened path names, we can join together multiple paths, we can search for special directories like $HOME, or current directory. Essentially, the package is making calls similar to those we learned from bash scripting last week, such as `pwd` to show your current directory, or `~` as a shorthand for your home directory. Here we can access those filepaths as string variables and work with them very easily. 

NB: The goal here is not for you to master the `os` package, but to understand that many such packages exist in the Python standard library and that you can use tab-completion, google search, and other sources to find them and how to use them.

In [None]:
# return my $HOME directory
os.path.expanduser("~")

In [None]:
# convert relative path to a full path
os.path.abspath('./')

<div class="alert alert-success">
    <b>Action:</b> Write a relative path to the iris-data-dirty.csv file that we downloaded earlier and expand it to a full path using the `os.path.abspath()` function.
</div>

### Operations on filepaths

In [None]:
# assign my current dir to a variable
curdir = os.path.abspath('.')
curdir

In [None]:
# get the lowest level directory in curdir
os.path.basename(curdir)

In [None]:
# get the directory structure above curdir
os.path.dirname(curdir)

### Joining filepaths
Because it can be hard to keep track of the "/" characters between directories and filepaths it is useful to use the `.join` function of the `os.path` module to join together path names. Here we will create string variable with a new pathname for a file that doesn't yet exist in our current directory. You can see in the three examples below that it doesn't matter when we include a "/" after a directory name or not, the `join` function figures it out for us. 

In [None]:
# see how os.path.join handles '/' characters in path names
print(os.path.join("/home/user/fakeuser", "folder1/", "folder2", "newfile.txt"))
print(os.path.join("/home/user/fakeuser", "folder1", "folder2", "newfile.txt"))
print(os.path.join("/home/user/fakeuser/", "folder1/", "folder2/", "newfile.txt"))

In [None]:
# get the full path name to a newfile in our current directory
newfile = os.path.join(curdir, "newfile.txt")
newfile

### Writing files

The function `open` can be used to create views of files. The format for this is `open(filename, mode)` where mode is the thing you plan to do with this file. The main arguments for this are `w` for 'write', `r` for 'read', or `a` for append. Below we will use `w` to write, which we can use to create a new file. 

In [None]:
# get an open file object
ofile = open("./datafiles/helloworld.txt", 'w')

# see the file object
ofile

#### File objects
As with other objects, `ofile` has attributes and functions that we can access and see by using tab-completion. Move your cursor to the end of the object below after the period and use tab to see some of the options. 

In [None]:
## use tab to see options associated with open file objects
ofile.

Use the `.write()` function to write a string to the file. 

In [None]:
# write a string to the file. 
# It returns the number of characters written, which we can ignore for now.
ofile.write("Hello world")

In [None]:
# when we are done writing to the file use .close()
ofile.close()

### Reading files
To read the data from a file we use a similar format as to write, but with the mode flag `r`. When we show the representation of the file object below you can see that this also returns an open file object, but this time in read mode. We can now access a different set of functions from this object to retrieve data from the file. We will use the `.read()` function to read and return all contents from the file as a string object and store it as the variable `idata`. 

In [None]:
ifile = open("./datafiles/iris-data-dirty.csv", 'r')
ifile

In [None]:
## read returns all of the contents as a string
idata = ifile.read()

In [None]:
## show the first 50 characters
idata[:50]

In [None]:
## close the file handle
ifile.close()

### Gzip compressed files
Gzip compression, as well as many other forms of compression are easily handled in Python using the standard library. The `gzip` module has an `open()` function that acts just like the regular `open` to create a file object. Let's try it out on the compressed fastq file we just downloaded. 

Let's also practice using `os.path` to find the full filepath of the `40578.fastq.gz` file. 

Then, as in the last example we simply use `.read()` to read the full contents and store it in a variable. Because the data in this file is stored as a bytestring we need to also add `.decode()` to convert it to a `utf-8` string.

In [None]:
## get full path to the file in our current directory
gzfile = os.path.abspath("./datafiles/40578.fastq.gz")
gzfile

In [None]:
## read compressed byte data from this file
ffile = gzip.open(gzfile, 'rb')
fdata = ffile.read().decode()
ffile.close()

In [None]:
## show some data from the file
print(fdata[:200])

### Reading data with the `read()` function
The `read()` function is nice for reading in a large chunk of text, but it then requires us to parse that text using string processing, like we learned in our earlier notebook. Let's use string processing to split the contents of the file into a list. Perhaps instead of separating contents on every line, as we did for this file when we analyzed it from a bash terminal, we instead would like to chunk it up so that it is split into elements that cover four lines. We can do this by using our own "split" separator. From looking at the text above we can see that each four line element is separated by a `"\n@"` character, so we'll use that. 

In [None]:
## split the fdata string on each occurrence of "\n@"
freads = fdata.split("\n@")

## print the first element in the list
print("The first read: \n{}".format(freads[0]))

## print the last element in the list
print("\nThe last read: \n{}".format(freads[-1]))

## print the number of reads in the file
print("\nN reads in the file = {}".format(len(freads)))

## The fastq file format
Read details of the [fastq file format here](https://en.wikipedia.org/wiki/FASTQ_format). This is a file format for next-generation sequence data that we will use frequently throughout this course. 

In [None]:
### Phred quality scores
The fastq sequence format stores sequence reads 

### Using context to automatically open & close files

In Python there is a special keyword called `with` that can be used to wrap statements into a context dependency. That means that everything which takes place inside of the with statement will know about what happend in the with statement. This is often used to open a file object. File objects have a context dependency so that when they are opened with `with` they will automatically close themselves when the statement is ended. See an example below. This is a much more compact way of opening and closing files than what we were using before. 

In [None]:
## infile will automatically close when finished.
with open("./datafiles/iris-data-dirty.csv", 'r') as infile:
    data = infile.readlines()

In [None]:
data[:10]

## Downloading data from the web in Python

The standard format for using the `requests` library is to make a GET request to a url, which is a request to read the data from that page. This will return a `response` object which we can then access for information. The `response` object will contain an error message if the url is invalid, or blocked, and it will contain the HTML text of the webpage if it is successful. 

In [None]:
# store urls as strings
url1 = "http://eaton-lab.org/data/40578.fastq.gz"
url2 = "http://eaton-lab.org/data/iris-data-dirty.csv"

The new variable 'response' here is a Python object just like the other object types we've learned about. We can access functions of this object using tab completion. 

In [None]:
# see the response object (200 means successful GET)
response = requests.get(url2)
response

In [None]:
# show the first 50 characters of data
response.text[:50]

In [None]:
# split the string of text on each newline character
lines = response.text.split("\n")[:10]
lines

It is often useful to split a string into separate elements as a list, and then operate on those list elements. When finished, we then wish to join the list elements back together into a string object. This can be done using the `.join()` function, which is a function of string objects. The object calling join is the string that you want to be placed in between each element of the list being joined. Some examples below. 

In [None]:
# join together lines with no separator
"".join(lines)

In [None]:
# join on newline characters
"\n".join(lines)

In [None]:
# remember newlines are only rendered when you print
print("\n".join(lines))

In [None]:
# join on an arbitrary phrase
"Helloworld".join(lines)

### Challenges
Your challenge is to perform similar tasks to those we did in the first bash assignment, but using Python. We'll focus on filtering and counting the Iris data set. This will use the skills you learned for operating on strings and lists, as well as reading and writing files. 

<div class="alert alert-success">
    <b>Action:</b> 
    This challenge builds on the last challenge from the last notebook. You can reuse your function from the last notebook to generate random sequence data. Write code below to combine a fasta header (e.g., "> sequence name") and random sequence data to create valid fasta data. Then write the data to a file and save it as "datafiles/sequence.fasta". 
</div>

<div class="alert alert-success">
    <b>Action:</b> 
    You have now learned about two sequence file formats, fasta and fastq. If you do not remember the details of fasta then use google or look back at your notebooks from session 1. Fastq contains more information than fasta since it also stores quality information for each base. Your challenge here is to write a function to convert one format to the other. All of the code you need is composed in snippets in examples above. Feel free to use google or the chatroom to seek further help if needed. Your answer must: (1) Write a function; (2) The function must read the 'datafiles/40578.fastq.gz' file from disk; (3) It must convert the data to fasta format; and (4) It must write the result to a file "datafiles/40578.fasta".     
    
Be sure you look at your fasta file after you write it to check that it looks how you expect. If not, modify your code and try again. 
</div>

<div class="alert alert-success">
    <b>Question:</b> 
   Describe each step of your function above verbally, in other words, explain how and why it works. Describe any parts that gave you trouble and how you found a solution. Enter your answer below using Markdown. 
   </div>

<div class="alert alert-success">
    <b>Action:</b> 
    Save your notebook and download as HTML to upload to courseworks.
</div>