<div align=right>
<img src="img/logosmall.png" width="100px" align=right>
</div>

# Reading and writing files

<div class="alert alert-warning">
Parts of this  section was adapted from copyrighted material in *Jones, M: Python for Biologists: A complete programming course for beginners (2013)*.

**Please do not distribute it!**

## Why are we interested in working with files?

We're again doing things in a slightly different order compared to most programming texts.  Often, introductory programming books or courses will only consider working with files much later along, because working with files can be … fiddly.

So why are we introducing files this early on?

As in the previous section, the answer lies in what we're using Python for.  The data we work with as biologists is most often stored in files, so if we're going to write useful programs we need to learn how to get data out of files.  Similarly, we often need to write our results *back into* files if they're to be of any use to anyone but ourselves.

We're lucky in a couple of ways:

* Most of the types of data we work with in biology are easily stored in *text* files (Think of sequence data!), which are in turn very easy to process using Python.

* (Modern) Python itself makes working with files fairly easy compared to many other languages.

## Text files and binary files

When we talk about *text files*, we are not necessarily talking about something that is human-readable. Rather, we are talking about a file that contains only lines of readable text characters. Examples of text files which you might have encountered include:

* FASTA files of DNA or protein sequences
* files containing output from command-line programs (e.g. BLAST)
* FASTQ files containing DNA sequencing reads
* HTML files
* source code files, such as Python scripts

In contrast, most files that you encounter day-to-day will be *binary files* – ones which are not made up of characters and lines, but of a sequence bytes. Examples include:

* image files (like `.jpg` or `.png`)
* audio and video files
* compressed files (e.g. `.zip` or `.gz` archives)
* compiled, executable programs

In this course we'll only work with text files, since that's by far the most useful in our field.  But rest assured that Python offers powerful facilities for processing binary files as well.

## Working with text files in Jupyter

Before we start looking at how to read or write text files in Python, we should take a quick look at how we can manipulate text files using the Jupyter user interface.

### Writing and reading files using magic commands

Recall from our brief overview of Jupyter that *cell magic* commands start with “`%%`”, and *line magic* commands start with “`%`”.

You can use the `%%file` cell magic to to write the contents of a code cell to a file.  The following code cell will, when executed, write three lines of DNA sequence to a file called `dna.txt` in a subdirectory of the current directory, called `files`:

In [11]:
%%file files/dna.txt
ACTGTACGTGCACTGATC
CTGGCATAGTCTTATTTT
CAGGGCGGCGGATCTCTT

Overwriting files/dna.txt


To read the contents of a file, we can use the operating system shell — recall that adding a “`!`” before a line passes that line to the underlying shell (on Unix) or command interpreter (on Windows).

To list the contents of the file we've just written, one could use the `cat` command on Unix:

In [6]:
!cat files/dna.txt

ACTGTACGTGCACTGATC
CTGGCATAGTCTTATTTT
CAGGGCGGCGGATCTCTT

… or the `type` command on Windows (I haven't tested this):

In [4]:
!type files/dna.txt

/bin/sh: line 0: type: files/dna.txt: not found


To load the contents of the file *into the code cell*, one can use the line magic command `%load`:

In [None]:
# %load files/dna.txt
ACTGTACGTGCACTGATC
CTGGCATAGTCTTATTTT
CAGGGCGGCGGATCTCTT

(This should work on all platforms.)

### Using Jupyter's built-in text editor

Juptyer comes with a fully-featured built-in text editor.  To edit the file we've just written using the editor: 

* go back to the Jupyter *Dashboard* (that should be the tab labeled "Home")
* locate the `files` directory and click on it to open it (like in any file manager)
* locate the file `dna.txt` and click on it

`dna.txt` should now open in a separate text editor tab.  Have a browse through the editor's menu to famiarise yourself with its facilities.

>Text editor nerds:  You can configure the Jupyter editor to use Emacs or Vim keybindings.

## Working with files in Python

Enough about the Jupyter UI;  let's look at how we'd use text files in our Python code.

First, let's use the magic command `%cd` to change the current directory of our Jupyter session to the `files` subdirectory, where we wrote the file `dna.txt` in the previous section:

In [15]:
%cd files

/Users/sabineurban/University/PhD/2017/03_EVOP/Lectures/5&6 - Python (Johann)/Python Tutorial/files


### Using `open()` to read a file

In Python, as in the physical world, we have to open a file before we can read what’s inside it. The Python function that carries out the job of opening a file is called `open()`.

`open` takes two arguments:
* A string which contains the name of the file.
* A string containing a single character.  This character is a 'flag' that indicates the *mode* in which the file should be opened.  For reading mode, the flag is always “`'r'`”.

`open()` returns a whole new type of data — a *file object*:

In [16]:
my_file = open("dna.txt", 'r')

A *file object* is a little more abstract than the string and number types that we saw before. A file object represents something a bit less tangible than a string or a number:  a file on your computer's drive.

Note that we can assign a variable name to the file object, just like with other types of data.

Let's tell Jupyter to evaluate the file object, using the variable name to refer to it:

In [5]:
my_file

<_io.TextIOWrapper name='dna.txt' mode='r' encoding='UTF-8'>

The result of the evaluation is a bit different from what we've seen when we evaluated strings and numbers, since there's no obvious way to map the concept of a file to something that's human-readable.  Instead, Python gives us a textual representation of the file object in angle brackets (“`<>`”).  It's not particularly useful to us, but at least we can learn some things when reading through it, such as the name of the file it refers to (`dna.txt`) and the mode it was opened in (`r`).

The way that we use file objects is a bit different to strings and numbers as well. If you glance back at the examples from the previous chapter you’ll see that most of the time when we want to use a variable containing a string or number we just use the variable name:

In [6]:
my_string = 'abcdefg'
print(my_string)
my_number = 42
print(my_number + 1)

abcdefg
43


In contrast, when we’re working with file objects most of our interaction will be through its *methods*.  Trying to print the file object using `print()` just prints the textual representation of it which we got when we evaluated it within Jupyter:

In [7]:
print(my_file)

<_io.TextIOWrapper name='dna.txt' mode='r' encoding='UTF-8'>


`print()`-ing a file does **not** print out the contents of that file.  When a file has been opened with `open()`, *its contents have not yet been read from disk*.  The file is sitting there, open, ready to be read from, but Python has to be told explicitly to read data from the file.

To read the contents of a file the file object has a `read()` method. It doesn’t take any arguments, and the return value is a string containing all the contents of the text file. Once we’ve read the file contents into a variable, we can treat them just like any other string – for example, we can print them:

In [8]:
file_contents = my_file.read()
print(file_contents)

ACTGTACGTGCACTGATC
CTGGCATAGTCTTATTTT
CAGGGCGGCGGATCTCTT


>When we call the `read()` method of a file object, we read that entire file into memory.  `dna.txt` happens to be a tiny file so that's OK.  But if we try to `read()` a 2GB file we'll probably bring our computer to a standstill.  Soon we'll see how to handle the really big files we often find in our field.

## Closing files

When you open a file using the `open()` function, a whole lot of things happen behind the scenes — things that involve both the Python interpreter and even the operating system, which serves as gatekeeper to the filesystem.

If you open a file and just leave it hanging like that after using it, all that infrastructure remains in place.  Not only is this a waste of resources, but it can lead to some subtle bugs in your code.

Hence, you should *always* close any files you've opened after you're done with them.  Like so:

In [27]:
my_file.close()

Note that `open()` is a built-in function in the core Python language, but `close()` is a method of the file object.  (This asymmetry may seem counter-intuitive.)

Closing a file doesn't destroy the file object.  If there's a variable that references the file object, it still hangs around…  but all the underlying file-handling machinery have been shut down.  Let's try to `read()` again from our now-closed file object:

In [21]:
file_contents = my_file.read()

Closing files you've opened is a matter of good programming hygiene, and we'll always do it in this course.  Think of it as the programming equivalent of brushing your teeth!

## Files, contents and file names

Beginning programmers sometimes get confused between a *file object*, a *file name*, and the *contents* of a file. Take a look at the following bit of code:

In [23]:
my_file_name = "dna.txt"

my_file = open(my_file_name, 'r')

In [24]:
my_file_contents = my_file.read()

In [25]:
my_file.close()

In [26]:
print(my_file_contents)

ACTGTACGTGCACTGATC
CTGGCATAGTCTTATTTT
CAGGGCGGCGGATCTCTT


What’s going on here?

* On line 1, we store the string `dna.txt` in the variable `my_file_name`. 


* On line 2, we use the variable `my_file_name` as the argument to the `open()` function, and store the *file object* returned by `open()` in the variable `my_file`.

  For historical reasons, a variable that references a file object is sometimes called a *filehandle*.


* On line 3, we call the `read()` method on the variable `my_file`, and store the returned string in the variable `my_file_contents`.


* One line 4, we brush our teeth.

The important thing to understand about this code is that there are three separate variables which have different types and which are storing three very different things. `my_file_name` is a string, and it stores the name of a file on disk. `my_file` is a file object, and it represents the file itself. `my_file_contents` is a string, and it stores the text that is in the file.

Remember that variable names are arbitrary – the computer doesn’t care what you call your variables. But it's good to use descriptive variable names as we did here.

It's purely for instructional purposes that we assigned three variables to three objects of different types above.  We could as easily have said:

In [10]:
my_file = open("dna.txt", 'r')
my_file_contents = my_file.read()
print(my_file_contents)
my_file.close()

FileNotFoundError: [Errno 2] No such file or directory: 'dna.txt'

Or even:

In [24]:
my_file_contents = open("dna.txt", 'r').read()
print(my_file_contents)

ACTGTACGTGCACTGATC
CTGGCATAGTCTTATTTT
CAGGGCGGCGGATCTCTT


Or *even:*

In [25]:
print(open("dna.txt", 'r').read())

ACTGTACGTGCACTGATC
CTGGCATAGTCTTATTTT
CAGGGCGGCGGATCTCTT


While that last line is certainly succinct, it's not necessarily very easy to read.  Adding some "unnecessary" intermediate variables can often make code more readable.  On the other hand, using more succinct expressions can *also* make your code more readable, especially to an experienced programmer.  The choice is yours!

One thing, though… we never closed any file objects in those last two code boxes.  Did we forget to brush our teeth?

No, it turns out we never assigned any variable to the file object in either of those cases.  In both cases, Python created a file object when we used the `open()` function, then immediately used its `read()` method to read the file contents.

Since no variable was assigned to the file object itself, it then immediately became *unreferenced*, and would soon be reclaimed by Python's *garbage collection*.  This is the cool bit:  Python's file object is smart enough to close the underlying file automatically when it gets destroyed, so no open files were left "hanging" in these two examples.

## Dealing with newlines

Let’s take a look at the output we get when we try to print some information from a file. We will again use the `dna.txt` file we wrote earlier. Recall that this file contains three lines with a short DNA sequence in each.  Let's have a look at it again:

In [31]:
# %load dna.txt
ACTGTACGTGCACTGATC
CTGGCATAGTCTTATTTT
CAGGGCGGCGGATCTCTT

NameError: name 'ACTGTACGTGCACTGATC' is not defined

We're going to write a simple program to read all the DNA data from the file and print it out along with its length. Putting together the file functions and methods from this section, and the material we saw in the previous section, we get the following code:

In [30]:
# open the file
my_file = open("dna.txt", 'r')

# read the contents
my_dna = my_file.read()

# calculate the length
dna_length = len(my_dna)

# print the output
print("sequence is", my_dna, "and its length is", dna_length)

# close the file
my_file.close()

sequence is ACTGTACGTGCACTGATC
CTGGCATAGTCTTATTTT
CAGGGCGGCGGATCTCTT and its length is 56


If we look at the output we'll see that the program probably didn't do what we intended.  It didn't read the DNA data as one contiguous string, but rather as three separate lines — the way it was encoded into the file.

Every line in a text file ends in a newline character, and Python has included that new line character at the end of the line as part of the string `my_dna`. The string `my_dna` still contains these newline characters, as we'll clearly see this:

In [32]:
my_dna

'ACTGTACGTGCACTGATC\nCTGGCATAGTCTTATTTT\nCAGGGCGGCGGATCTCTT'

>Many text editors would also have written a newline character at the end of the last line in the file, but as one can see here, writing a file with Jupyter's `%%file` magic command doesn't.

Clearly we need a way to process the file on a line-by-line basis.  File objects have a method `readline()` that reads in only one line from an open file.  Let's try it:

In [34]:
# open the file
my_file = open("dna.txt", 'r')

# read a line of data from the open file
my_line1 = my_file.readline()

Let's evaluate the line we've just read in:

In [35]:
my_line1

'ACTGTACGTGCACTGATC\n'

The string still contains that pesky newline; it would be good to be able to strip that off.  We'll get to that in a second, but first let's see what happens if we call `readline()` a second time, assign a variable name to the result, and then evaluate that variable:

In [36]:
my_line2 = my_file.readline()
my_line2

'CTGGCATAGTCTTATTTT\n'

Note that the open file object "remembered" where we were when we last read from it.  When we called `readline()` a second time we got the second line of the file, not the first line again.

Let's try it a third time:

In [37]:
my_line3 = my_file.readline()
my_line3

'CAGGGCGGCGGATCTCTT'

And a fourth time?

In [38]:
my_line4 = my_file.readline()
my_line4

''

Once we've "fallen off" the end of the file, `readline()` just keeps returning the empty string.

In [39]:
my_file.close()

Right, back to the issue of getting rid of those pesky newlines.  One way of doing so would be to use string slicing to lob off the last character:

In [40]:
my_line1[:-1]

'ACTGTACGTGCACTGATC'

This works, but it has some pitfalls.  Using a slice like this will *always* lob off the last character of a string, no matter what it is.  As we've seen, the third line of data from our file (now stored in a string referenced by the variable `my_line3`) does *not* contain a trailing newline character.

What we really need is a way to lob off the last character if (and *only* if) it's a newline.  Because this is such a common problem, strings have a method for doing just that. It's called `rstrip()` (right-strip), and if you call it without any arguments it returns a copy of the string with *all* whitespace stripped off its right-hand side.

("Whitespace" is what we call spaces, newline characters, tabs, and all such non-printing characters.)

In [41]:
my_file = open("dna.txt", 'r')

my_line1 = my_file.readline()
my_dna1 = my_line1.rstrip()
print("sequence is", my_dna1, "and its length is", len(my_dna1), "bases")

my_line2 = my_file.readline()
my_dna2 = my_line2.rstrip()
print("sequence is", my_dna2, "and its length is", len(my_dna2), "bases")

my_line3 = my_file.readline()
my_dna3 = my_line3.rstrip()
print("sequence is", my_dna3, "and its length is", len(my_dna3), "bases")

my_file.close()

sequence is ACTGTACGTGCACTGATC and its length is 18 bases
sequence is CTGGCATAGTCTTATTTT and its length is 18 bases
sequence is CAGGGCGGCGGATCTCTT and its length is 18 bases


…and now the output looks reasonable.

>You might have noticed that we repeat the same piece of code for every line we read from the file.  This is OK if our file has three lines, but what if it has a hundred?  Or a million.  Wouldn't it be nice if we could tell Python:

>*for each line in the file, do “something”*?

>Yes.  Yes it would.  But we're getting ahead of ourselves now.

Another thing to note is that in the code above, we first read the file contents and then removed the newline, in two separate steps:

```python
my_line1 = my_file.readline()
my_dna1 = my_line1.rstrip()
```

It would be more succinct to read the contents and remove the newline all in one go, like this:

```python
my_dna1 = my_file.readline().rstrip()
```

Here we use two different methods in the same statement. Read it from left to right:  We take the `my_file` variable and use the `readline()` method on it, then we take the string value returned by that method and use the `rstrip()` method on it.  We then assign the variable `my_dna1` to the the string value returned by `rstrip()`.

## Aside: String methods for stripping strings

While we're discussing `rstrip()`, we may as well mention its cousins `lstrip()` and `strip()`.

`lstrip()`, as you may have guessed, strips characters off the left-hand side of a string.  `strip()` strips both ends (left and right) at the same time.

`lstrip()`, `rstrip()` and `strip()` can all take a parameter.  In the simplest case, this is the character that they will then proceed to strip off the string (instead of whitespace).  For example:

In [42]:
fasta_header = ">gi1032|some sequence data"
header = fasta_header.lstrip('>')
print(header)

gi1032|some sequence data


In [43]:
quoted_dna = '"CGGAGC"'
print("Original:\t", quoted_dna)
print("Stripped:\t", quoted_dna.strip('"'))

Original:	 "CGGAGC"
Stripped:	 CGGAGC


## Missing files

What happens if we try to read a file that doesn’t exist?

In [44]:
my_file = open("nonexistent.txt", 'r')

FileNotFoundError: [Errno 2] No such file or directory: 'nonexistent.txt'

We get a new type of error that we’ve not seen before, with the obvious name of `FileNotFoundError`.

Ideally we'd like to deal with missing file errors when they occur — we'll learn how to do that later on in the course when we discuss error handling.

## Writing text to files

All the example programs that we’ve seen so far in this book have produced output straight to the screen. That’s great for exploring new features or data interactively, and when working on programs, because it allows you to see the effect of changes to the code right away.

If you want to store data over a longer period, use it with a different application, or share it with someone else, you probably want to write it to a file.

### Opening files for writing

We've seen how to open a file using the `open()` function and *read* its contents. We can also use the same `open()` function to open a file and *write* some data to it. However, we need to use a different *flag* to open the file in write mode.

Here are some of the values that the mode flag can take on:

* `'r'` for reading
* `'w'` for writing
* `'a'` for appending

(If we leave out the second (flag) argument to `open()`, Python defaults to `'r'` for reading.)

The difference between `'w'` and `'a'` is subtle but important:

* If we open a file that already exists using the mode `'w'` we will **overwrite the current contents** with whatever data we write to it.


* If we open an existing file with the mode `'a'`, it will add (append) new data onto the end of the file and will **not** overwrite any existing content.


*  If a file with the specified filename does not already exist, then `'w'` and `'a'` behave identically:  They will both create a new file and write to it.

### Aside:  Getting help

A lot of Python functions and methods have optional arguments beyond those we'll cover in this course. If you want to see all the optional arguments for a particular method or function, look is its help.

There are multiple places you can find help:

#### Getting help from the official documentation

Looking something up in the official documentation at <http://docs.python.org> takes some practice, because there is so much of it!  The documentation for the `open()` function, for instance, is here:

<https://docs.python.org/3/library/functions.html#open>

#### Asking Jupyter for help

Jupyter provides you with two ways of asking for help.  You can use a magic command starting in “`?`” to get help on any Python command or type.  The help pops up in a separate window, which you can close by clicking the ‘`x`’ icon.  You can also pop out the help into a separate tab by using the arrow icon next to the ‘`x`’ icon.

In [45]:
?open

Perhaps more usefully, you can get help in real time while writing code by pressing `Shift-Tab`.  Put the cursor on the word `open` in the code cell below, and press `Shift-Tab`:

In [None]:
open

Pressing `Shift-Tab` multiple times toggles between different ways of displaying the help, some with more detail than others.  Try it.

#### Asking Python for help

You can also ask the Python interpreter itself for help by using the built-in `help()` function, for instance:

In [2]:
#help(open)

Enough about help;  let's get back to writing files…

### Writing to an opened file

Once we’ve opened a file for writing, we can use the file `write()` method to write some text to it.  `write()` works a bit like `print()` – it takes a single string argument – but instead of printing the string to the screen it writes it to the file.

Here's how we use `open()` to open a file for writing, and write a single line of text to it:

In [47]:
my_file = open("out.txt", 'w')
my_file.write("Hello world")
my_file.close()

Because the output is being written to the file in this example, you won’t see any output on the screen if you run it. To check that the code has worked, you’ll have to run it, then open up the file `out.txt` in your text editor and check that its contents are what you expect.

You can use Jupyter's built-in text editor to check whether `out.txt` now exists, and whether it contains the data you'd expect.  (Click `out.txt` on Jupyter's *Home* tab.)  Or you can inspect it here using the `%load` magic:

In [None]:
# %load out.txt
Hello world

With the `write()`, just like with the `print()` function, we can use any string as the argument. This also means that we can use any method or function that returns a string. The following are all perfectly OK:

In [49]:
my_file = open("out.txt", 'w')

# write "abc def"
my_file.write("abc" + " " + "def")

# write "8"
my_file.write(str(len('AGTGCTAG')))

# write "TTGC"
my_file.write("ATGC".replace('A', 'T'))

# write "atgc"
my_file.write("ATGC".lower())

# write contents of my_variable
my_variable = ">gi1234|some sequence data"

my_file.write(my_variable)

26

Huh?  What was that `26`?  It turns out the `write` method does return something.  What can it be?  Let's look it up:

In [8]:
?my_file.write

Object `my_file.write` not found.


OK, we still haven't closed `my_file`, so let's write something else to it:

In [50]:
my_file.write(len("ACTGCTAG"))

TypeError: write() argument must be str, not int

What happened?!  Well, the Python error is clear: `write() argument must be str`.  Unlike the built-in `print()` function, the `write()` method of a file object does not do any implicit conversion to a textual representation.  Hence, we *have* to use `str()` to convert any numbers to strings when using `write()`.

Time to close `my_file`:

In [51]:
my_file.close()

OK, let's look at the contents of `out.txt`, the file that `my_file` referenced:

In [None]:
# %load out.txt
abc def8TTGCatgc>gi1234|some sequence data

Everything was printed on one line!  That *may* not quite have been what we had in mind.

It turns out `write` really doesn't know *any* of `print`'s clever tricks.  It just writes the string you give it as an argument to the file, *exactly as it is*.  It doesn't convert non-strings to a textual representation, and it doesn't sneakily add a newline character.  If we want newlines, we have to add them explicitly:

In [44]:
# open the file again in mode 'w', overwriting any previous contents
my_file = open("out.txt", 'w')

# write "abc def"
my_file.write("abc" + " " +  "def" + '\n')

# write "8"
my_file.write(str(len('AGTGCTAG')) + '\n')

# write "TTGC"
my_file.write("ATGC".replace('A', 'T') + '\n')

# write "atgc"
my_file.write("ATGC".lower() + '\n')

# write contents of my_variable
my_variable = ">gi1234|some sequence data"
my_file.write(my_variable + '\n')

# close the file
my_file.close()

Let's inspect the file's contents again:

In [None]:
# %load out.txt
abc def
8
TTGC
atgc
>gi1234|some sequence data


That's more like what we had in mind!  But adding all those “`+ '\n'`” bits was tedious.  Wouldn't it be great if there were a function like `print()` that could write to an open file and do all the smart things `print()` does?

Well, it turns out there is, and it's called… `print()`.  Let's check out the help for `print`: 

In [55]:
?print

You'll see the summary of `print()`'s syntax looks like this:

`print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)`

`print()` has a bunch of optional* keyword arguments*.  It's time to use one of those: `file`.  Calling a function with a keyword argument looks a little bit different, as you'll see below.  But now we can use all of `print()`'s cleverness when writing to a file.

In [56]:
# open the file again in mode 'w', overwriting any previous contents
my_file = open("out.txt", 'w')

# print "abc def" to the file
print("abc", "def", file=my_file)

# print "8" to the file
print(len('AGTGCTAG'), file=my_file)

# print "TTGC" to the file
print("ATGC".replace('A', 'T'), file=my_file)

# print "atgc" to the file
print("ATGC".lower(), file=my_file)

# print contents of my_variable to the file
my_variable = ">gi1234|some sequence data"
print(my_variable, file=my_file)

# close the file
my_file.close()

Let's check the contents of the file one final time:

In [None]:
%load out.txt

>Note that `write()` is a method of the file object, but `print()` is (as before) just a standard built-in function.

## Aside: remember `help()`

Just to reiterate:  You can get extensive help on Python's built-in functions and objects using the built-in function `help()`… even for things that you can't easily get help for using Jupyter's `Shift-Tab` shortcut.

To see the documentation for the string type's `rstrip()` method, call the `help()` function with the `rstrip()` method of an arbitrary string (e.g. an empty string):

In [None]:
help("".rstrip)

To see help for the entire string type, including a list of all its methods:

In [None]:
help(str)

## Paths and folders

So far, we have only dealt with opening files in the current working directory of our Jupyter kernel.  (We explicitly used `%cd` to change to the directory where our file was located.) What if we want to open a file from a different part of the file system?

The open function is quite happy to deal with files from anywhere on your computer, as long as you give the full path. Just give a *file path* as the argument rather than a *file name*. The format of the file path looks different depending on your operating system. If you’re on Linux or a Mac, it could look something like this:

```python
my_file = open("/home/martin/myfolder/myfile.txt")
my_file = open("/Users/martin/Desktop/myfolder/myfile.txt")
```

If you’re on Windows, it might look like this:

```python
my_file = open(r"c:\windows\Desktop\myfolder\myfile.txt")
```

Note that there's a single letter `r` before the opening quote of the string that contains the path and filename, for the Windows example.  This is because a Windows path is likely to contain backslashes, and Python regards a backslash in a string as part of a special character.  (Remember `\n` and `\t`?)  That 'r' tells Python to treat this as a *raw string*, i.e. it does not interpret any backslash as part of a special character.

---

## Exercises

### 1. Splitting genomic DNA

In the `files` subdirectory there's a file called `genomic_dna.txt` – it contains the same piece of genomic DNA that we were using in the final exercise from the previous section.

>If you've been working through this Notebook step by step, you should already have changed to the `files` subdirectory using `%cd` earlier.

As before, it comprises two exons and an intron. The first exon runs from the start of the sequence to the sixty-third character, and the second exon runs from the ninety-first character to the end of the sequence. Write a program that will print just the coding regions of the DNA sequence.

Write a program that will split the genomic DNA into coding and non-coding parts, and write these sequences to two separate files.




In [83]:
# Exercise 1


# coding bases
exons = open("genomic_dna_coding.txt", 'w')
exon1 = "ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCGATCGATCGATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT"[0:37] + "\n"
exon2 = "ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCGATCGATCGATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT"[92:123]

exons.write(exon1)
exons.write(exon2)
exons.close()
            
   

## noncoding bases, intron 
introns = open("genomic_dna_noncoding.txt", 'w')
intron = "ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCGATCGATCGATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT" [37:91]
introns.write(intron)
introns.close()

In [None]:
# %load genomic_dna_coding.txt
ATCGATCGATCGATCGACTGACTAGTCATAGCTATGC
CATCGATCGATATCGATGCATCGACTACTAT

In [None]:
# %load genomic_dna_noncoding.txt
ATGTAGCTACTCGATCGATCGATCGATCGATCGATCGATCGATCGATCATGCTA

### 2. Writing a FASTA file

FASTA file format is a commonly-used DNA and protein sequence file format. A single sequence in FASTA format looks like this:

```
>sequence_name
ATCGACTGATCGATCGTACGAT
```

Where sequence_name is a header that describes the sequence (the greater-than symbol indicates the start of the header line). Often, the header contains an accession number that relates to the record for the sequence in a public sequence database. A single FASTA file can contain multiple sequences, like this:

```
>sequence_one
ATCGATCGATCGATCGAT
>sequence_two
ACTAGCTAGCTAGCATCG
>sequence_three
ACTGCATCGATCGTACCT
```

Write a program that will create a FASTA file for the following three sequences – make sure that all sequences are in upper case and only contain the bases A, T, G and C.

| SEQUENCE HEADER | DNA SEQUENCE                       |
|-----------------|------------------------------------|
| `ABC123`        | `ATCGTACGATCGATCGATCGCTAGACGTATCG` |
| `DEF456`        | `actgatcgacgatcgatcgatcacgact`     |
| `HIJ789`        | `ACTGAC-ACTGT--ACTGTA----CATGTG`   |

In [33]:
# Exercise 2

dna_seq = open("dna_sequences.txt", 'w')
dna_seq.write("ABC123|ATCGTACGATCGATCGATCGCTAGACGTATCG" + "DEF456|actgatcgacgatcgatcgatcacgact".upper() + "HIJ789|ACTGAC-ACTGT--ACTGTA----CATGTG".replace('-', ''))
dna_seq.close()

In [None]:
# %load dna_sequences.txt
ABC123|ATCGTACGATCGATCGATCGCTAGACGTATCGDEF456|ACTGATCGACGATCGATCGATCACGACTHIJ789|ACTGACACTGTACTGTACATGTG

### 3. Writing multiple FASTA files

Use the data from the previous exercise, but instead of creating a single FASTA file, create three new FASTA files – one per sequence. The names of the FASTA files should be the same as the sequence header names, with the extension `.fasta`.

In [38]:
# Exercise 3
fasta_ABC123 = open("ABC123.fasta", 'w')
fasta_ABC123.write("ABC123|ATCGTACGATCGATCGATCGCTAGACGTATCG")
fasta_ABC123.close()
#----------------
fasta_DEF456 = open("DEF456.fasta", 'w')
fasta_DEF456.write("DEF456|actgatcgacgatcgatcgatcacgact".upper())
fasta_DEF456.close()
#----------------
fasta_HIJ789 = open("HIJ789.fasta", 'w')
fasta_HIJ789.write("HIJ789|ACTGAC-ACTGT--ACTGTA----CATGTG".replace('-', ''))
fasta_HIJ789.close()
