<h1 id="toctitle">Working with the filesystem</h1>
<ul id="toc"/>

## Working with files

### About paths

__File paths__ are the strings that we use to describe to Python where to find the files that we want. 

On Windows we are used to writing paths like this:

`c:\path\to\a\file`

but in Python, characters preceded by \ often have a special meaning. To avoid this use / instead:

`c:/path/to/a/file` #python is fine with either

On Mac/Linux/Unix, use the / character as normal:

`/home/martin/path/to/a/file`

### Basic file manipulation

**Note:** These code snippets are not provided as 'live' cells because they refer to hypothetical files which you probably don't have on your system.

Functions for file manipulation live in the `os` module. Renaming files is straightforward:

```python
import os
os.rename("old.txt", "new.txt")
os.rename("biology/old.txt", "biology/new.txt")
os.rename("old_folder", "new_folder") #renames the first with the second
```

Moving files is the same as renaming them:

```python
os.rename("biology/old.txt", "python/old.txt")
```

We can create a folder:

```python
os.mkdir("c:/martin/python")
```

### Copying and trees

For more advanced stuff, use the `shutil` module. 

Copying is different for a file:

```python
shutil.copy("original.txt", "copy.txt")
```

vs a folder:

```python
shutil.copytree("original_folder", "copy_folder")
```

We can check if a file or folder exists:

```python
if os.path.exists("c:/martin/email.txt"):
	# do something
```

### Deleting stuff

Deleting files is dangerous in Python - no take backs! Use different functions in increasing order of danger. 

Deleting a file:
```python
os.remove("c:/martin/unwanted_file.txt")
```

Deleting an empty folder:
```python
os.rmdir("c:/martin/emtpy")
```

Deleting a folder and all its contents:
```python
shutil.rmtree("c:/martin/full")
```

### Listing folder contents

With the `os` module we can list files and folders in the current working directory:

```python
for file_name in os.listdir("."):
    print("one file name is " + file_name)
```

or in a different directory:

```python
for file_name in os.listdir("c:\martin"):
    print("one file name is " + file_name)
```



In [2]:
import os
sorted(os.listdir('.'))#list of files in the CD 
sorted(os.listdir('dna_files'))#list files in folder

['xaa.dna',
 'xab.dna',
 'xac.dna',
 'xad.dna',
 'xae.dna',
 'xaf.dna',
 'xag.dna',
 'xah.dna',
 'xai.dna',
 'xaj.dna']

## Running external programs

Sometimes it's helpful to be able to run an exising program (e.g. an analysis tool e.g. BLAST) from within a Python program. 

To run a program and display the output on the terminal:

```python
import subprocess

# run the standard Linux date program to print the current month
subprocess.run("/bin/date +%B", shell=True)
```

Note: This won't work within the Jupyter notebooks even if you are on Linux - the program runs but the terminal output gets lost.

To run a program and capture the output into a string, use `stdout` and `universal_newlines` parameters:

```python
res = subprocess.run("/bin/date +%B", shell=True, stdout=subprocess.PIPE, universal_newlines=True)
month = res.stdout
# month is, eg., 'September\n'
```

Note that the stdout comes with newline characters, just as if it was `read()` from a file. Launching long-running programs or programs that produce a lot of output gets tricky, so beware.

## Getting command line input



```python
# e.g. python3 myscript.py apple banana
import sys
print(sys.argv)      # ['myscript.py', 'apple', 'banana']
first = sys.argv[1]  # apple
second = sys.argv[2] # banana
```

Yes, that is "argv" not "args"! It comes from C programming and is short for "argument vector". This is only useful if you're working on the command line, not in Jupyter. 

## Exercises

### Binning DNA sequences

Inside the __dna_files__ folder is a collection of files that end in .dna . Each file holds a collection of DNA sequences, one per line. 

Write a program which creates 9 new files – one for sequences between 100 and 199 bases long, one for sequences between 200 and 299 bases long, etc. Write out each DNA sequence in the input files to the correct output file. You will have to:

 - get a list of all files in the folder
 - process the files one by one
 - process each file line by line
 - calculate the length of each line
 - figure out the correct output file for each line
 - create the output files in the right place
 - write the lines to the correct output file
 
There's lots to think about for this exercise:

 - how will you make sure that you don't overwrite the output files with each new line?
 - how can you generate the bin sizes without lots of code?
 - how will you make sure you only process the right input files?
 

In [69]:
file=open("dna_files/xaa.dna")#file contains file object, want to read line by line not all at once
#dna_strip=file.rstrip('\n')
#print(dna_strip)

for i in file:
    i=i.rstrip('\n')
    print(len(i))#can specify where to print using file=


333
283
380
115
753
764
234
117
906
160


In [71]:
#### yay opening each file individually     
for file_name in os.listdir("dna_files"):
    name=("dna_files/"+file_name)
    file=open(name)
    for i in file:
        i=i.rstrip('\n')
        print(len(i))
    

#print(dna_file_strip)

333
283
380
115
753
764
234
117
906
160
833
390
355
968
999
257
909
236
943
703
665
165
677
454
426
600
177
888
535
974
949
453
988
420
767
221
316
573
231
409
600
707
853
971
279
533
452
332
990
899
242
714
291
313
926
738
516
218
558
465
432
818
604
879
619
500
119
341
303
469
575
987
141
590
833
200
539
625
363
779
317
382
324
747
353
878
692
806
118
556
121
442
520
866
969
672
138
922
652
749


In [None]:
def len_to_file (seq, length):
    if length >100 and <200:
        print(seq, file=seq_100_199)
    elif length >199 and <300
        print(seq, file=seq_200_299)
    elif length >299 and <400
        print(seq, file=seq_200_299)
    elif length >399 and <500
        print(seq, file=seq_200_299)
    elif length >499 and <600
        print(seq, file=seq_200_299)
    elif length >599 and <700
        print(seq, file=seq_200_299)
    elif length >699 and <800
        print(seq, file=seq_200_299)


### Kmer counting 

Write a program that will calculate the number of all kmers of a given length across all DNA sequences in the input files and display just the ones that occur more than a given number of times. You program should take two interactive arguments – the kmer length, and the cutoff number.  

AttributeError: '_io.TextIOWrapper' object has no attribute 'rstrip'