# Lecture 11 - Files 

Today:
* Files.
  * Writing Files
  * Reading files
  * With keyword
  * For loops on file handles
  * Binary files 
  * Large worked example (Fasta file processing)
  * Directories
  * Web file retrieval



# Motivation

* Up to now we've mainly used input() to get input to the programs we've written.

* In the real world, this is not the way we generally get data into a program. Rather we read and write data to and from files.

* Files (eg documents, spreadsheets or images) are just semantically organized sequences of bits. 

* They can exist on a disk, be pulled from the internet or be created transiently as we pipe inputs and outputs from our programs. 

* They can be small, and they can be very large. 

Anyway, let's dive in... (today will use larger code samples than we've seen previously - you're leveling up)

# Writing Files

Let's first look at creating a file. You'll often want to do this when writing out the results of some analysis.

In [1]:
my_file = open("test.txt", "w") 
# Opens a file test.txt in the working directory of the Python program.
# The "w" (short for "write"), says you intend to write to the file. 

# If test.txt does not exist before this point it is created as an empty file.

# If test.txt existed before this point its contents are now overwritten - 
# so be careful opening files for writing

# my_file (the return value from open) is a "file handle", 
# you use it to handle the communication with the file

# The write method of the file_handle object is a lot like print(), 
# except (1) the string arguments are written to the file, instead of the screen, 
# and (2) write does not add newlines to the end of the string 
# (so you add them explicitly, as in this example)

my_file.write("My first file written from Python\n") 
my_file.write("---------------------------------\n")
my_file.write("Hello, world!\n")

my_file.close() 
# This closes the file. 
# You can't keep using my_file after this as the file connection is closed. 
# It is important to close the file when 
# you're done to ensure the file writing is completed successfully (although Python
# will clean things up automatically on exit, you shouldn't rely on this in cases
# where you're program gets terminated prematurely)

Note: (Do matched demo from terminal to show contents of file)


**To append to the end of an existing file:** 

In [2]:
my_file = open("test.txt", "a") 
# "a" (short for append) opens an existing file to append to it 
# If test.txt does not exist then 'a' creates the file, just like 'w'

my_file.write("Hello, again!\n") 

my_file.close() # Again, always close the file

Note: (Continue matched demo from terminal to show contents of appended to file)

# Reading Files

In [3]:
mynewhandle = open("test.txt", "r") 
# This opens the file, "r" argument means "read"

# The following loop walks through the file line by line

while True:                            # Keep reading forever
  
    theline = mynewhandle.readline()   # Try to read next line using readline()
    
    if theline == "":                  # When done, readline returns the empty string ''
        # allowing us to detect the end of the file and to leave the loop    
        break                          #     leave the loop

    # Now process the line we've just read
    print(theline, end="")

mynewhandle.close() 

My first file written from Python
---------------------------------
Hello, world!
Hello, again!


What is nice about the above loop is that regardless of the number of files, Python only stores one line in memory, allowing it to process arbitrarily large files.

**Python will complain if you try to open a file that doesn't exist...**

In [4]:
mynewhandle = open("wharrah.txt", "r")

FileNotFoundError: [Errno 2] No such file or directory: 'wharrah.txt'

**To read contents of file into memory in one go use read()**

In [11]:
fh = open("test.txt", "r") # I often use fh to denote a file handle, for some reason

content = fh.read() # Read the contents of the file

print(type(content)) # Yup, content is a string

print(content)

fh.close()

<class 'str'>
My first file written from Python
---------------------------------
Hello, world!
Hello, again!



**readlines will read all the lines in a file:**

In [12]:
fh = open("test.txt") # I often use fh to denote a file handle, for some reason
# If you don't give the second argument it defaults to read only

content = fh.readlines() # Read the contents of the file

print(type(content)) # Yup, content is now a list of strings

print(content)

fh.close()

<class 'list'>
['My first file written from Python\n', '---------------------------------\n', 'Hello, world!\n', 'Hello, again!\n']


# Challenge 1

In [26]:
l = list(range(100))

## Write the numbers in l to a file "out.txt"
## Read the numbers from out.txt into a new list, "l2"
## check that l == l2




# The with Keyword

Good programming is often about anticipating errors and dealing with them.

The 'with' keyword is both useful short-hand and good defensive programming. It cleans up the file handle for you

(see https://docs.python.org/2.5/whatsnew/pep-343.html for the details...)

In [8]:
with open("test.txt") as fh: 
    # With removes the need for the "fh.close()" statement
    # and is better because fh.close() is guaranteed to be run even if there is an error
    # in processing the file
  
    assert False # This causes an error, but fh will still be closed
    content = fh.readlines() 
    print(content)
  
  
# For files the syntax is:
# with OPEN_EXPRESSION as FILE_HANDLE:
#   STATEMENTS THAT USE FILE_HANDLE

AssertionError: 

# For loop works on file handles 

In [9]:
# If you run a for loop on a file handle open for reading 
# then it iterates through each line in the file

with open("test.txt") as fh:
    for line in fh:
        print(line, end="") # end="" stops print() from adding a newline to the end

My first file written from Python
---------------------------------
Hello, world!
Hello, again!


# Challenge 2

In [1]:
zhivago = """On they went, singing " Rest Eternal, " and whenever they 
stopped, their feet, the horses, and the gusts of wind seemed to
carry on their singing. 

Passers-by made way for the procession, counted the wreaths, 
and crossed themselves. Some joined in out of curiosity and 
asked: " Who is being buried? "  " Zhivago, " they were told. 
- " Oh, I see. That's what it is. " - "It isn't him. It's his wife.
" - " Well, it comes to the same thing. May her soul rest in 
peace. It's a fine funeral. """

# Write the zhivago string to the file "zhivago.txt". 
# Now read the contents of zhivago.txt, 
# using a for loop to iterate over the lines, 
# printing each to the screen. 
# Use 'with' to cleanup all file handles.




# Binary Files

All the files we've seen so far are assumed to be text files (i.e. composed of human readable stuff). However, Python can also happily process binary files, which are just files with arbitrary bits in them. 

On the left we've opened a binary file (a jpeg) - the bits aren't organized so that they can be decoded as text. We can't assume that the bits in such a binary file are organized into bytes that represent meaningful text characters with white space, etc. On the right is the jpeg when properly decoded as an image.


<img src="https://raw.githubusercontent.com/cormacflanagan/intro_python/main/lecture_notebooks/figures/binary%20files.jpg" width=1000 height=500 />


In [2]:
# Reading and writing binary data

with open("test.txt", "rb") as f: 
    # "rb" means read as a binary file
    # Any file can be treated as a binary file
  
    with open("test2.txt", "wb") as g: 
        # "wb" means write a binary file

        while True:
            buf = f.read(10) 
            # The argument to read tells it to read a set number of bytes into "buf"
            # A Byte is 8 bits, representing a number 0 <= n < 256
            # Each Bit is 0 or 1
            
            print(buf)
            #print(type(buf))
      
            if len(buf) == 0: # We're at the end
                break
          
            g.write(buf)

b'My first f'
b'ile writte'
b'n from Pyt'
b'hon\n------'
b'----------'
b'----------'
b'-------\nHe'
b'llo, world'
b'!\nHello, a'
b'gain!\n'
b''


Now we can see that test2.txt is just a copy of the original file:

In [7]:
with open("test2.txt") as fh:
    for line in fh:
        print(line, end="") 

My first file written from Python
---------------------------------
Hello, world!
Hello, again!


# Directories

In most systems the operating system has access to a file system in which files are kept. Getting into details
is beyond the scope of this course, because it requires us getting into operating system stuff - but let's cover the very basics.

For more see: https://docs.python.org/3/library/os.html and https://docs.python.org/3/library/os.path.html

By default when you open/write a file in Python uses the current working directory.

In [1]:
import os # This module provides all sorts of useful functions 
# for working with the operating system

os.getcwd() # getcwd() tells us what working directory we're in. 
# (If you're using Google Collaboratory, this is some directory on
# some (virtual) machine in Google Cloud Platform)

'/private/var/mobile/Library/Mobile Documents/com~apple~CloudDocs/Documents/w/teach/20-f23/intro_python/lecture_notebooks'

In [2]:
os.listdir(os.getcwd()) # listdir() tells us what files are in the working directory

['L03 More Types.ipynb',
 'L15 More Functions and Recursion.ipynb',
 'L11 Files.ipynb',
 'L09 Tuples, Lists and Dictionaries.ipynb',
 'myadder.py',
 'out.txt',
 'unbitly',
 'L07 Functions Continued before lecture.ipynb',
 'test2.txt',
 'L07 Functions Continued.ipynb',
 'fasta.fa',
 'L14 Inheritance.ipynb',
 'Mortality',
 'L18 Data Science .ipynb',
 'L06 Functions.ipynb',
 'node_modules',
 'new_directory',
 'alice_in_wonderland.txt',
 'L16 Exceptions and Unit Testing.ipynb',
 'Final exam review.ipynb',
 'While.ipynb',
 '__pycache__',
 'L02 Variables and Expressions.ipynb',
 'L08 Strings-after-lecture.ipynb',
 'temp.txt',
 'package-lock.json',
 'package.json',
 'Me.ipynb',
 'figures',
 'L10 Modules.ipynb',
 'Untitled 2.ipynb',
 'L08 Strings.ipynb',
 'Review.ipynb',
 'test.txt',
 '.ipynb_checkpoints',
 'sum.py',
 'BinPacking.ipynb',
 'zhivago.txt',
 'L01 Intro.ipynb',
 'data',
 'Untitled.ipynb',
 'L17 Search Algorithms.ipynb',
 'L13 Classes and Polymorphism.ipynb',
 'L06 Functions before-

In [16]:
# We can create a directory like so:

os.mkdir("new_directory")

os.listdir(os.getcwd() + "/new_directory")
# Now "new_directory" is a directory in our working directory

[]

In [19]:
# We can write a file in our new directory like so:

# Python uses strings to indicate file paths, generally we can indicate file
# paths with the / symbol, so new_directory/test.txt is a file "test.txt" in the 
# directory "new_directory"

with open("new_directory/test.txt", "w") as fh:
    fh.write("Hello again!")
    
os.listdir(os.getcwd() + "/new_directory") 

['test.txt']

It is worth mentioning how to delete a file, because your Python programs/scripts should try not make lots of messy files:

In [18]:
os.remove("new_directory/test.txt") # This removes the file test.txt from new_directory

os.listdir(os.getcwd() + "/new_directory") # Now the directory is empty

[]

In [18]:
# Similarly you can remove empty directories
os.rmdir("new_directory") # This will error out if the directory is not empty

os.listdir(os.getcwd()) # new_directory is gone

['L03 More Types.ipynb',
 'L11 Files.ipynb',
 '.DS_Store',
 'L18 Data Science .ipynb',
 'unbitly',
 'test2.txt',
 'L06 Functions.ipynb',
 'L10 Modules.ipynb',
 'L08 Strings.ipynb',
 'L05 Loops.ipynb',
 'L04 Conditionals and Branching.ipynb',
 'L13 Classes and Polymorphism.ipynb',
 'L02 Variables and Expressions.ipynb',
 'L07 Functions Continued.ipynb',
 'L16 Exceptions and Unit Testing.ipynb',
 'L-- Syllabus.ipynb',
 'figures',
 'L15 More Functions and Recursion.ipynb',
 'L14 Inheritance.ipynb',
 'test.txt',
 '.ipynb_checkpoints',
 'L17 Search Algorithms.ipynb',
 'data',
 'L12 Classes and Objects.ipynb',
 'tmp',
 'L09 Tuples, Lists and Dictionaries.ipynb',
 'L01 Intro.ipynb']

# Challenge 3

In [1]:
# Iterate through the files in your current working directory and print their names





# Getting Data from the Internet

In [2]:
import urllib.request

url="https://raw.githubusercontent.com/DataBiosphere/toil/master/src/toil/test/wdl/test.csv"
local_copy = "local.txt"

urllib.request.urlretrieve(url, local_copy) 
# This function copies the thing the url points at into a local file copy

with open(local_copy) as fh: # Print the file
    for line in fh:
        print(line, end="")

1,2,3
4,5,6
7,8,9


# Challenge 4: File Processing Example

Here's a fairly complete worked example of doing file writing and parsing using the simplest bioinformatics format, the fasta file.

A fasta file is a file that stores nucleotide (like DNA, RNA,) and amino-acid (protein) sequences.

This is a code comprehension exercise, i.e. to complete the task you need to read the code and understand what it is doing. 

In [27]:
"""
Example creating read and write methods for fasta files.

Fasta format is a dead simple text (ascii) file:

>HEADER_1
SEQUENCE_LINE_1
SEQUENCE_LINE_2
...
SEQUENCE_LINE_N
>HEADER_2
SEQUENCE_LINE_1
SEQUENCE_LINE_2
...
SEQUENCE_LINE_N

Where HEADER_ lines give a string descriptor of the sequence
and the SEQUENCE_LINE_ lines are concatenated together to form the actual 
amino-acid or nucleotide sequence.
"""

def writeFasta(fileHandle, header, sequence, sequenceLineWidth=100):
  """ Function writes a fasta header/sequence combination to the given file handle.
  """
  fileHandle.write('>' + header + "\n") # Write the header line
  
  # Now write the sequence
  for i in range(0,len(sequence),sequenceLineWidth): 
    # Step through sequence sequenceLineWidth characters at a time
    
    # Write the next sequenceLineWidth chars of the sequence
    fileHandle.write(sequence[i:i+sequenceLineWidth] + "\n") 
    
def readFasta(fileHandle):
    """ A generator that returns the header/sequence pairs from a fasta file.
    """
    while True:
        # Read the header line, skipping and lines not beginning with a '>'
        while True:  # Keep looping until we break
            l = fileHandle.readline()

            if len(l) == 0:  # If we have reached the end of the file, terminate
                return None  # None is a standard NULL or 0 quantity used to indicate
            # termination

            if l[0] == '>':  # Is a valid header line
                header = l[1:-1]  # Get the  header between the '>' and the '\n'
                break

        # Now read the sequence
        substrings = []
        while True:
            i = fileHandle.tell()  
            # tell() gives the index that the file handle is at in the file

            l = fileHandle.readline()

            if len(l) == 0:  # We have reached the end of the file
                break

            if l[0] == '>':  # We have encountered the start of another sequence
                fileHandle.seek(i)  
                # Roll back the file handle to the point before we called readline()
                # this is like "undoing" the readline() call
                break

            substrings.append(l[:-1])  # Add the line, minus the newline to substrings

        yield header, "".join(substrings)

  
# A few test sequences
dna_sequences = {
    "a_dummy_dna_string":"GATTACA",
    "gi 556503834 ref NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome":"AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTTGACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTTGCCCAAATAAAACATGTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGC",
"another_dummy_dna_string":"CGGCGCGCGTCTTTGCAGCGATGTCACGCGCCCGTATTTCCGTGGTGCTGATTACGCAATCATCTTCCGAATACAGCATCAGTTTCTGCGTTCCACAAAGCGACTGTGTGCGAGCTGAACGGGCAATGCAGGAAGAGTTC"
}

# Challenge: 
# (1) Write the sequences in "dna_sequences" into a file "fasta.fa" using the writeFasta function, 
# for each pair in the dictionary using the key as the header and the value as the sequennce

pass

# (2) Make an empty dictionary called dna_sequences_copy
dna_sequences_copy = {}

# (3) Read the sequences in "fasta.fa" using the readFasta function, 
# putting the header:sequence pairs into dna_sequences_copy

pass

# We can check it all worked
assert dna_sequences == dna_sequences_copy


# Homework

* ZyBook Reading 11
* Open book Chapter 13 on files: http://openbookproject.net/thinkcs/python/english3e/files.html


