# Working with Files

You're going to be handling a lot of data files during the AI/ML workshop, most of which will be data tables like the one we handled in the Bash session.  Most of this will be done with a library called [pandas](https://pandas.pydata.org) that has very similar syntax to R.  If you want to check out pandas before the workshop, [this](http://swcarpentry.github.io/python-novice-gapminder/08-data-frames/index.html) is a great resource.

Here, I'm actually going to show you how to read in and "parse" a different kind of file referred to as a `fasta` file.  If you ever do any sequencing, you will end up with `fasta` files.  They are plain text files with a very basic format:

```
>header comment
SEQUENCE
```

They can get a little more complex, but that's essentially what they look like.

Our goal in this lesson will be to read in the small file `fruit.fasta` and run some statistics on the sequences.  `fruit.fasta` should be in a folder on your Desktop.

## Reading a file

There are a couple of ways to read in a file.  I'll show you two ways to do this: One that reads the file into memory and one that streams the file.

In [16]:
with open("fruit.fasta", "r") as fruit_seqs_file:
    for line in fruit_seqs_file:
        print(line)

>apple

AGTCTATTGATCCTCAGAT

>banana

TGATTTCTGTAATCCGCCA

>blueberry

ATGAGTCTAGCTAGCGATT

>kiwi

CGAATTGCCGACTATAGTT



In [13]:
print(fruit_seqs_file)

<_io.TextIOWrapper name='fruit.fasta' mode='r' encoding='UTF-8'>


As you can see, we've read in the file and named it `fruit_seqs_file`.  The `"r"` at the end tells `open()` that we want to open the file for reading.  `"r+"` indicates reading and writing and `"w"` indicates writing.  We also have a view of the file contents.  Also know that using `with` automatically closes the file in addition to opening it.

## Parsing a fasta file

When we talk about parsing files, we simply mean going through them and sorting them.  For fasta files, that usually means separating out the headers from the sequences, but keeping them linked...I'm sensing a dictionary.

In [19]:
fruit_seqs_dict = {}

# parse fasta into dict with header as key and sequence as value
with open("fruit.fasta", "r") as fruit_seqs_file:
    for line in fruit_seqs_file:
        line = line.rstrip()
        if line.startswith('>'):
            seq_name = line
        else:
            fruit_seqs_dict[seq_name] = line

print(fruit_seqs_dict)

{'>apple': 'AGTCTATTGATCCTCAGAT', '>banana': 'TGATTTCTGTAATCCGCCA', '>blueberry': 'ATGAGTCTAGCTAGCGATT', '>kiwi': 'CGAATTGCCGACTATAGTT'}


There is another way to write the first bit of that for loop that you may see in the workshop and in other python code.

In [2]:
fruit_seqs_dict = {}

# parse fasta into dict with header as key and sequence as value
with open("fruit.fasta", "r") as fruit_seqs_file:
    lines = [ line.rstrip() for line in fruit_seqs_file ]
    print(lines)
    for line in lines:
        if line.startswith('>'):
            seq_name = line
        else:
            fruit_seqs_dict[seq_name] = line

print(fruit_seqs_dict)

['>apple', 'AGTCTATTGATCCTCAGAT', '>banana', 'TGATTTCTGTAATCCGCCA', '>blueberry', 'ATGAGTCTAGCTAGCGATT', '>kiwi', 'CGAATTGCCGACTATAGTT']
{'>apple': 'AGTCTATTGATCCTCAGAT', '>banana': 'TGATTTCTGTAATCCGCCA', '>blueberry': 'ATGAGTCTAGCTAGCGATT', '>kiwi': 'CGAATTGCCGACTATAGTT'}


In this case, because we still had to do the conditional, this was not useful.  It is especially not useful because now we have read all of the lines into memory whereas before we were streaming them.  

Even the first parsing is a little clunky, though.  In real life, we would import the SeqIO library from Biopython to parse our fasta file.  Notice that it even removes the `>` for us.

In [3]:
from Bio import SeqIO

for record in SeqIO.parse("fruit.fasta", "fasta"):
    print(record.id)
for record in SeqIO.parse("fruit.fasta", "fasta"):
    print(record.seq)

apple
banana
blueberry
kiwi
AGTCTATTGATCCTCAGAT
TGATTTCTGTAATCCGCCA
ATGAGTCTAGCTAGCGATT
CGAATTGCCGACTATAGTT


## Appending to a file

Let's say that we want to add one more sequence to the file.  The easiest way would be to append it to the end.

In [7]:
new_seq_name = '>guava\n'
new_seq = 'GCGTAGTACAG\n'

with open("fruit.fasta", "a") as file: # "a" stands for append
    file.write(new_seq_name)
    file.write(new_seq)

I did cheat a little by putting those newline characters on the end, but this works well wnough on the small scale, and enough for now.

## Writing to a file

Writing to a file is different than appending because it overwrites existing data.  However, the syntax will be essentially the same as appending.

The next example will make a new file called `vegetable.fasta` and add data to it.

In [8]:
veg = ['>cucumber\n', 'CGATGACG\n', '>pepper\n', 'AGCTGCAT\n', '>green bean\n', 'TAGCAGATTACGATA\n']

with open("vegetable.fasta", "w") as file:
    file.writelines(veg)

You should now have a new fasta file!