# Reading and Writing Files

Now that we can process text, all we need is... more text. And odds are, that text is going to come in the form of a file, so it's high time that we start using them.

## Opening Filehandles

A filehandle is an object that controls the stream of information between your program and a file stored somewhere on the computer. Filehandles are not filenames, and they are not the files themselves. Like variables, filehandles contain the address of the file on the hard drive or other storage media. But unlike variables, filehandles also keep track of your current read position in the file. Imagine your file is like a book in a library. The filehandle tells Python where that book is, and keeps a bookmark in the book for where you currently are. Because filehandles are not the files themeselves deleting a filehandle in your script using the **del** command does nothing to the file that handle refers to.

We create filehandles in the simplest sense with the **open()** command:

```python
fh = open('some_file')
```

where some_file is the path to a file (i.e. the filename) on your filesystem. In general, it is good practice to use absolute path nomenclature (e.g. /Users/aaron/some_file or /home/aaron/some_file), but you can be lazy if you know the file you want is going to be in the same directory as your program.

In [1]:
fh = open('hello.txt')
contents = fh.read()
print contents
fh.close()

THIS IS A TEXT FILE
THAT I AM USING AS AN EXAMPLE
AND WE ARE CURRENTLY
READING FROM IT.


As you can see, the **read()** method of the filehandle just sucks in the whole file in a single string, newlines and all! This is quick and easy, for sure, but it's not necessarily the most orderly way to deal with the contents of a file.

### *readline()*, *readlines()*, and *strip()*
Using any text editor, copy the contents of the following snippet to a text file in your directory for this session, and save the file as gff_head.

```
##gff-version 3   
# feature-ontology so.obo   
# attribute-ontology gff3_attributes.obo   
##sequence-region NcraOR74A_Chr21 1 64840   
##sequence-region NcraOR74A_LGI 1 9798893   
```

Then try the following:

In [2]:
filename = 'gff_head'
fh = open(filename, 'r')
# the 'r' is for 'read-only', which will keep us from being able to alter
# this file with the filehandle we just created

print fh.readline()
print fh.readline()

lines = fh.readlines()

fh.close()

print lines

##gff-version 3   

# feature-ontology so.obo   

['# attribute-ontology gff3_attributes.obo   \n', '##sequence-region NcraOR74A_Chr21 1 64840   \n', '##sequence-region NcraOR74A_LGI 1 9798893']


While this is a bit of a mess, a few things should become apparent:
1. **readline()** takes in one line (and since **print()** also supplies a newline, we've got an extra linebreak after each of the first two **print** statements.
2. **readlines()** (plural!) takes the entire file, from the current read position all the way to the end, giving back a list of lines (again, with newlines intact).
3. This file has a bunch of whitespace cluttering things up at the end of each line.

All of these complications are easily resolved with the use of the **strip()** method whenever we actually make use of the lines we read:

In [5]:
uglystring = ' \t what a mess!\t         \n   '
print uglystring
print uglystring.strip()

 	 what a mess!	         
   
what a mess!


In [4]:
filename = 'gff_head'
fh = open(filename, 'r')
 
print fh.readline().strip()
print fh.readline().strip()
 
lines = fh.readlines()
 
fh.close()
 
lines[0] = lines[0].strip()
 
print lines

##gff-version 3
# feature-ontology so.obo
['# attribute-ontology gff3_attributes.obo', '##sequence-region NcraOR74A_Chr21 1 64840   \n', '##sequence-region NcraOR74A_LGI 1 9798893']


Now the spaces and newlines are gone from the first two lines, and from the 0th element of the list I printed in the last print statement (since I only bothered to **strip()** and put back the 0th element).

One crucially important concept of file input in Python is that each time you read something by any of the three methods I've described you advance your position in the file, which means that you never get the same character or characters twice (unless of course they're in the file twice!)

This is why reading from the filehandle with **readline()** twice in a row gave two different values; as soon as the line is read, the filehandle has moved to the next line, awaiting another read request. This is because filehandles are **iterable**.

We first introduced **iterables** as objects that can be looped over with **for**, as they contain or produce other objects. Filehandles are of this second type, they know how to produce a string and advance themselves in anticipation of the next request. That means that to get back to the beginning of the file, you must either close the file with the **close()** method of the filehandle and reopen it, or use the **seek()** method of the filehandle (which we don't have time to go into -- Google is your friend!)

Aside from this potenitally odd behavior, the **iterable** quality of filehandles also means that they can be treated logically like a sequence of lines, as we will see below.

### Reading Files In a Loop
Certainly one of the most common contexts in which you'll encounter **for** loops is in working your way through a file. You can just put together two things we've already seen to get to where we need to be:

In [9]:
fh = open('gff_head')
#lines = fh.readlines()
for line in fh:
    line = line.strip('#').strip()
    fields = line.split()
    print fields

['gff-version', '3']
['feature-ontology', 'so.obo']
['attribute-ontology', 'gff3_attributes.obo']
['sequence-region', 'NcraOR74A_Chr21', '1', '64840']
['sequence-region', 'NcraOR74A_LGI', '1', '9798893']


This is starting to get a little fancier, but we're only doing things you've seen before: read all the lines in a file into a list, then iterate over the list of lines. For each line, strip off the leading hash symbols, strip off leading and trailing whitespace, split the line into a list, then print the resulting list.

We can simplify this one more step using the fact that filehandles are *iterable*, and know what's being asked of them. So we can replace this:

```python
lines = fh.readlines()
for line in lines:
    ...
    ...
```
with:
```python
for line in fh:
    ...
    ...
```
to exactly the same end.

### Writing to Files
Writing output is sorta like doing the dishes. You just did all this work to cook up a fancy program and analyze some data, and the last thing you want to do is put all your answers away into clean little output files. Fortunately, we'll learn about pickle files later, but for now, we'd best make sure you know how to write output to a file.

The default behavior of the filehandle is to open the file supplied in read mode. However, by giving an additional argument, you can either add lines to the bottom of the specified file, or overwrite it entirely:

In [10]:
filename = 'bands.txt'
# 'w' flag means "writeable"
fh = open(filename, 'w')


# note that we have to add the '\n' if we want it at the end of the line;
# this is in contrast to the print command's behavior.
fh.write('The Beatles were')
fh.write(' the best band.\n')

    
fh.close()

In [12]:
# we are reoppening the file as "writeable", so we will OVERWRITE it
fh = open(filename, 'w')

fh.write ('The Grateful Dead were\n')
fh.write (' the best band.\n')
 
fh.close()

In [13]:
# 'a' flag means "append"
fh = open(filename, 'a')

fh.write("The Beatles weren't even close.\n")
 
fh.close()

While this script doesn't print anything to the screen, if you run it and look at the contents of *bands.txt* you should see the second two sentences. Remember how filehandles track your position in the file? Well, when you open a file with the 'w' argument the position starts at the beginning of the file, so you will *overwrite* the file, while if you open a file with the 'a' argument the position starts at the end of the file, so when you write the data will be appended to the end.

When reading files, the **close()** method is a good thing to keep in mind, but if you forget it, python will close the file at the end of the program's execution. With writing files, however, python may not make the changes you stipulate right away, so if you plan to evaluate the contents of the file you're writing in the same script (or for instance use that file for something else during the run of that script) it is wise to close the filehandle to ensure that all the write operations you've requested are performed.

However, while we're on the subject, it is almost never a good idea to write to a file then read from it in the same script. When your data is in the form of Python objects those objects are stored in memory, and accessing data stored in *memory is 6 to 100,000 times faster than a hard disk*.

While python has no writeline() method, the other two read methods are mirrored for writing to files. The first, **write()** you've already seen. It takes a string and puts it in a file. The only difference between this and **writelines()** is that **writelines()** takes a list of strings and writes them all (But beware! If you want those strings to appear on separate lines, they had best all end with a '\n'!)

In [10]:
#!/usr/bin/env python
 
filename = 'bands.txt'
fh = open(filename, 'a') # Appending to our previous file.
 
lines = ['To be fair, both the Beatles and \n',
         'the Grateful Dead made fantastic music.\n'
         ]
lines.extend(['Everyone has their own musical taste,\n',
              "no sense in fighting about it.\n"
              ])
 
fh.writelines(lines)
 
fh.close()

And check out the contents of *bands.txt* to see your many-line-writing machine in action!