<h1 id="toctitle">Working with files</h1>
<ul id="toc"/>

Files are important in bioinformatics. We have many text-based file formats:

- FASTA
- GenBank
- FASTQ
- VCF
- BLAST output
- SAM
- etc..

Often, we need to take a file and tweak its format for existing tools (e.g. fussy FASTA headers).

Other times we need to write a program that will either create input for, or accept output from, other tools.

Today we will talk about:

- reading text
- processing lines in a file
- creating new files
- appending and writing data to files

Later we will talk about:

- renaming
- moving
- copying
- deleting
- doing stuff to each file in a folder

Notes :Can use python to edit txt files instead of using R

## Getting data from a file

### Opening a file

Getting data out of a file is a **two step** process: open then read.

In [None]:
my_file = open("dna.txt")

`open()` is a function that takes one string argument - the name of the file - and returns a __File object__, which we save into a variable.

File objects are a new type of data that represent a file on disk. They have useful methods, like strings (but unlike strings we can't simply `print()` them to see the contents):

In [None]:
my_file ## just gives you a weird file name etc

### Reading file contents

`read()` is a File object method that returns the contents as a string. It has no arguments.

In [1]:
my_file = open("dna.txt")
my_file.read()

'ACTGTACGTGCACTGATC\n'

In [2]:
# usually we want to store the contents in a variable
my_file = open("dna.txt")
my_file_contents = my_file.read()
my_file_contents

'ACTGTACGTGCACTGATC\n'

In [4]:
# or in one step...
my_file_contents = open("dna.txt").read() # putting contents into the variable 
my_file_contents

'ACTGTACGTGCACTGATC\n'

Remember the special character `\n`. Every line includes this new line character at the end. If you use the `len()` function on the string, for example, then this character will be counted. Remove it with the `rstrip()` method:

In [None]:
my_file = open("dna.txt")
my_file_contents = my_file.read()

# remove the newline from the end of the file contents
my_dna = my_file_contents.rstrip("\n")
my_dna

Now the variable contains only the DNA string.


Important: files are **exhaustible**. This means that once we have read the contents of a file, doing so again will give nothing:

In [6]:
my_file = open("dna.txt")
my_file_contents1 = my_file.read()
my_file_contents2 = my_file.read()## get nothing for this variable as we have already read it in above

(my_file_contents1, my_file_contents2)

('ACTGTACGTGCACTGATC\n', '')

This can easily cause confusion, so remember that you are only allowed to read the contents of a file object once. Rather then opening the file a second time, we can just store the contents of the file in a variable then use it multiple times. 

### Finding files

If Python says it cannot read your file it may be that Python is looking in the wrong directory. In simple setups, Python will look for the file in the same place where the Python code file is saved. You can also set this directory (Python calls it the *current working directory*) manually. We'll look at this later.


## Writing to files

To write to a file we have to use a second argument to open:

In [None]:
my_file = open("out.txt", "w") #open this file for writing (this is what the w does) mode = w now 

`w` stands for write. Once we have opened a file for writing, we can use the `write()` method. Remember that lines written to a file end with a "\n" newline character:

In [None]:
my_file.write("Hello world\n")

How can we tell if this has worked? See if the file is now visible in the directory listing.

## Closing files

Once we've finished writing data to a file, we have to close it:

In [None]:
my_file = open("out.txt", "w")
my_file.write("Hello world\n")

# remember to close the file
my_file.close()

## Writing to files with print()

It's also possible to write into files using the print() function. In this case, you don't have to worry about adding the "\n" as print() will do it for you. It looks like this:

In [None]:
my_file = open("out_2_lines.txt", "w")
print("First line", file=my_file)
print("Second line", file=my_file)

# You still need to close the file.
my_file.close()

## Getting user input

Another way to get information into our programs is by using interactive input. Ask the user with the `input()` function:

In [None]:
name = input("What is your name, traveller?\n") #imediate input to the programme e.g. asking how many individuals you want to simulate
print("Greetings, " + name)

## Summary of all things!

|  __Name__ | __Job__  | __Argument__  | __Returns__  | __Type__  |
|---|---|---|---|---|
| `open()`  | opens a file for reading or writing  | filename, optional mode (both strings)  | File object  | built in function |
|  `read()` | reads the contents of a file  | none  | String  | method of File objects  |
| `rstrip()` | removes characters from end of string (usually newline)| string to remove  | string  | method of string objects |
| `write()`  | writes text to a file | string to write | nothing  | method of File objects |
|   `close()`| closes a file | none | nothing | method of File objects|



## Exercises

You'll need to use the string manipulation material from previous session, so have it open somewhere. One complication when doing exercises with file output is that you have to check your output file to see if your program has worked. There's no need to write a separate bit of code just to print the contents of your output file! Your output file will be plain text, so open it in a text editor (in Jupyter, just click on the file to view it in the browser).

### Splitting genomic DNA

Look at the file called *genomic_dna.txt* – it contains the same piece of genomic DNA that we were using in the final exercise from the previous session. The first exon runs from the start of the sequence to the sixty-third character, and the second exon runs from the ninety-first character to the end of the sequence. 

Write a program that will split the genomic DNA into coding and non-coding parts, and write these sequences to two separate files. Hint: use your solution to the last exercise from the previous session as a starting point.

In [6]:
dna_file = open("genomic_dna.txt") #need to open the file first
dna_cont = dna_file.read().rstrip('\n') #then extract the info from the file and remove the new line
dna_cont #check it

exon1=dna_cont[:64]
intron=dna_cont[64:91]
exon2=dna_cont[91:]

exon_file=open('genomic_dna_exons.txt','w') # need to open another txt file to write to
intron_file = open('genomic_dna_introns.txt','w')

print(exon1, file=exon_file) #printing to the created file 
print(exon2, file=exon_file)
print(intron, file=intron_file)

exon_file.close()
intron_file.close() #finally need to close both the files to finish

In [8]:
dna_cont ## someho deleted the file .. woops but code above is correct

''

### Writing a FASTA file

A FASTA file stores sequence data and looks like this:

```
>sequence_one
cgatcgatcatcgatgcattgtagctatcg
>sequence_two
acagtagctacgtgtgtcgta
```

Write a program that will create a FASTA file for the following three sequences – make sure that all sequences are in upper case and only contain the bases A, T, G and C.

| __Sequence header__ | __Sequence__ |
|---------------------|---------------|
| ABC123 | ATCGTACGATCGATCGATCGCTAGACGTATCG |
| DEF456 | actgatcgacgatcgatcgatcacgact |
| HIJ789 | ACTGAC-ACTGT--ACTGTA----CATGTG |

### Writing multiple FASTA files

Use the data from the previous exercise, but instead of creating a __single__ FASTA file, create __three__ new FASTA files – one per sequence. The names of the FASTA files should be the same as the sequence header names, with the extension .fasta.

In [None]:
## need to open(create) a file to write to 

#print automatically puts a new line character in


seq1_hed='ABC123'
seq2_hed='DEF456'
seq3_hed='HIJ789'

seq1='ATCGTACGATCGATCGATCGCTAGACGTATCG'
seq2='actgatcgacgatcgatcgatcacgact'
seq3='ACTGAC-ACTGT--ACTGTA----CATGTG'

fasta=open('fasta_file.txt','w') 


#fallen behind, check answers 

