# Reading and writing files

## Index

* [Different ways of opening a file](read_write.ipynb#Different-ways-of-opening-a-file)
* [Read](read_write.ipynb#Read)
* [Write](read_write.ipynb#Write)
* [Exercise 9](read_write.ipynb#Exercise-9)
* [Write a script](read_write.ipynb#Exercise-9)
* [Exercise 9](read_write.ipynb#Exercise-9)

## Different ways of opening a file
[back to top](read_write.ipynb#Index)

Python access files using the command `open`. Files can be read or written depending of the argument we pass to the command file. The file, once manipulated, has to be closed with the command close() unless we use the `with` statement (recomended).  

Arguments available to open a file:  
```
    'r'  -->  read only  
    'w'  -->  write only (overwrite a file with the same name)  
    'a'  -->  append to the existing file (do not overwrite)  
    'r+' -->  open a file both in read and write mode  
    'b'  -->  binary mode  
    't'  -->  ascii mode  
```

## Read 
[back to top](read_write.ipynb#Index)

Let's read an example fasta file with python.

In [2]:
# Read a file (store all the file in memory)
with open('../data/input.fa', 'r') as fd:
    whole = fd.readlines()

print(whole)

['>sequence 1\n', 'GAGTGAAATTAAGGCTATTT\n', '>sequence 2\n', 'CGCTTTCTTTAAGCCCTAAC\n', '>sequence 3\n', 'AATACCATTTAAGGAGTCAA\n', '>sequence 4\n', 'GAAAATATTTAATGATGTCA\n', '>sequence 5\n', 'AGCCTCAATTAAAAAAGTAT\n']


In [3]:
# Better print without the newline character
for line in whole:
    print(line.strip())   # same as: print(line, end='') 

>sequence 1
GAGTGAAATTAAGGCTATTT
>sequence 2
CGCTTTCTTTAAGCCCTAAC
>sequence 3
AATACCATTTAAGGAGTCAA
>sequence 4
GAAAATATTTAATGATGTCA
>sequence 5
AGCCTCAATTAAAAAAGTAT


Storing a whole file in memory can be dangerous, expecially if we have to read a big file. A safer option is to read the file line by line:

In [5]:
# Read a file line by line (memory safe)
with open('../data/input.fa', 'r') as fd:
    for line in fd:
        # Here we write the instructions
        print(line.strip())

>sequence 1
GAGTGAAATTAAGGCTATTT
>sequence 2
CGCTTTCTTTAAGCCCTAAC
>sequence 3
AATACCATTTAAGGAGTCAA
>sequence 4
GAAAATATTTAATGATGTCA
>sequence 5
AGCCTCAATTAAAAAAGTAT


## Write
[back to top](read_write.ipynb#Index)

In [8]:
with open('../data/output.txt', 'w') as fd:   # Pay attention, with this instruction you will overwrite
    fd.write('Hello world!!\n')       # an existing file with the same name without any warning!

In [12]:
# Let's check if the file has been succesfully written
with open('../data/output.txt', 'r') as fd:
    for line in fd:
        print(line)

Hello world!!



## Exercise 9

Try to write a code that reads the `../data/input.fa` file and writes to an output file only the DNA sequence (without the headers `>` of the sequences).

[Solution](solutions.ipynb#Exercise-9)

## Write a script
[back to top](read_write.ipynb#Index)

Python can be used interactively (for instance in a terminal). This is fine for short task or for debugging a piece of code, but is not recommended for sophisticated tasks.  
In such cases is much easier to write a script that we can eventually execute. 

The following code calculate the number of sequences in a fasta file. Write the code in a text editor and then save it with the extention `.py` (for example 'my_script.py')

In [13]:
# This script calculate the number of sequences in a fasta file
count = 0

with open('../data/input.fa', 'r') as fd:
    for line in fd:
        if line.startswith('>'):
            count += 1

print("The file contains", count, "sequences")

The file contains 5 sequences


Save the script somewhere, for instance in the folder `scripts` with the name `my_script.py`.  
Now execute the script as following:
```
    python ../scripts/my_script.py
```

## Exercise 10
[back to top](read_write.ipynb#Index)

Repeat `Exercise 9` by writing a script.

[Solution](solutions.ipynb#Exercise-10)