# Lecture 11 - Files (https://bit.ly/intro_python_11)

Today:
* Files.
  * Writing Files
  * Reading files
  * With keyword
  * For loops on file handles
  * Binary files 
  * Large worked example (Fasta file processing)
  * Directories
  * Web file retrieval



# Motivation

* Up to now we've mainly used input() to get input to the programs we've written.

* In the real world, this is not the way we generally get data into a program. Rather we read and write data to and from files.

* Files, like documents, or spread sheets or images, are just semantically organized arrays of bits. 

* They can exist on a disk, be pulled from the internet or be created transiently as we pipe inputs and outputs from our programs. 

* They can be small, and they can be very large. Typically the data associated with a scientific analysis will be large relative to the size of the program. For this reason we have to think about processing the data in stages using streaming. 

Anyway, let's dive in... (as an aside, the examples today will show larger code samples than we've seen previously - you're leveling up)

# Writing Files

Let's first look at creating a file. You'll often want to do this when writing out the results of some analysis.

In [3]:
my_file = open("test.txt", "w") # Call to open() opens a file
# test.txt in the working directory of the Python program.
# The "w" string argument (short for "write"), tells open()
# that you intend to write to the file. 

# If test.txt does not exist before this point it is created as an empty file.

# If test.txt existed before this point its contents are now overwritten - 
# so be careful opening files for writing

# my_file (the return value from open) is a "file handle", that is you use it
# to handle the communication with the file



# The write method of the file_handle object is a lot like print(), except
# the contents of the string arguments are written to the file, instead of the
# screen, and write does not add newlines to the end of the string (so you add them
# explicitly, as in this example)
my_file.write("My first file written from Python\n") 
my_file.write("---------------------------------\n")
my_file.write("Hello, world!\n")

my_file.close() # This closes the file. You can't keep using my_file after this
# as the file connection is closed. It is important to close the file when 
# you're done to ensure the file writing is completed successfully (although Python
# will clean things up automatically on exit, you shouldn't rely on this in cases
# where you're program gets terminated prematurely)

Note: (Do matched demo from terminal to show contents of file)


**To append to the end of an existing file:** 

In [4]:
my_file = open("test.txt", "a") # The 'a' argument (short for append) opens an 
# existing file to append to it 
# If test.txt does not exist then 'a' creates the file, just like 'w'

my_file.write("Hello, again!\n") 

my_file.close() # Again, always close the file

Note: (Continue matched demo from terminal to show contents of appended to file)

# Reading Files

In [6]:
mynewhandle = open("test.txt", "r") # This opens the file, "r" argument means "read"

# The following loop walks through
# the file line by line

while True:                            # Keep reading forever
  
    theline = mynewhandle.readline()   # Try to read next line using readline()
    
    if theline == "":              # When done, readline returns the empty string ''
      # allowing us to detect the end of the file and to leave the loop
      
        break                          #     leave the loop

    # Now process the line we've just read
    print(theline, end="")

mynewhandle.close() # This might seem less important, as you're just reading 
# from the file, however most underlying operating systems have limits on
# the number of active file handles, so it is good to cleanup. It is also 
# generally a bad idea to have multiple file handles open on one file (unless they are
# all read only file handles)

My first file written from Python
---------------------------------
Hello, world!
Hello, again!


What is nice about the above loop is that regardless of the number of files, Python only stores one line in memory, allowing it to process arbitrarily large files.

**Python will complain if you try to open a file that doesn't exist...**

In [16]:
mynewhandle = open("wharrah.txt", "r")

FileNotFoundError: [Errno 2] No such file or directory: 'wharrah.txt'

**To read contents of file into memory in one go use read()**

In [17]:
fh = open("test.txt", "r") # I often use fh to denote a file handle, for some reason

content = fh.read() # Read the contents of the file

print(type(content)) # Yup, content is a string

print(content)

fh.close()

<class 'str'>
My first file written from Python
---------------------------------
Hello, world!
Hello, again!



**readlines will read all the lines in a file:**

In [18]:
fh = open("test.txt") # I often use fh to denote a file handle, for some reason
# If you don't give the second argument it defaults to read only

content = fh.readlines() # Read the contents of the file

print(type(content)) # Yup, content is now a list of strings

print(content)

fh.close()

<class 'list'>
['My first file written from Python\n', '---------------------------------\n', 'Hello, world!\n', 'Hello, again!\n']


# Challenge 1

In [8]:
# Complete the following code.

l = list(range(100))

## Write the numbers in l to a file "out.txt"

## Read the numbers from out.txt into a new list, "l2"

## check that l == l2


The original list [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
The new list [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]


# The with Keyword

Good programming is often about anticipating errors and dealing with them.

The 'with' keyword is both useful short-hand and good defensive programming. We don't need to know the details (curious: https://docs.python.org/2.5/whatsnew/pep-343.html), but basically it cleans up the file handle for you

In [19]:
with open("test.txt") as fh: # With removes the need for the "fh.close()" statement
  # and is better because fh.close() is guaranteed to be run even if there is an error
  # in processing the file
  
  assert False # This causes an error, but fh will still be closed
  
  content = fh.readlines() 
  print(content)
  
  
# For files the syntax is:
# with OPEN_EXPRESSION as FILE_HANDLE:
#   STATEMENTS THAT USE FILE_HANDLE

AssertionError: 

# For loop on file:

In [20]:
# If you run a for loop on a file handle open for reading then
# it defaults to reading lines, this is compact and memory efficient (because
# we only read a line at a time from the file)

with open("test.txt", "r") as fh:
  for line in fh:
    print(line, end="") # end="" stops print() from adding a newline to the end

My first file written from Python
---------------------------------
Hello, world!
Hello, again!


# Challenge 2

In [16]:
zhivago = """On they went, singing " Rest Eternal, " and whenever they 
stopped, their feet, the horses, and the gusts of wind seemed to
carry on their singing. 

Passers-by made way for the procession, counted the wreaths, 
and crossed themselves. Some joined in out of curiosity and 
asked: " Who is being buried? " — " Zhivago, " they were told. 
— " Oh, I see. That's what it is. " — "It isn't him. It's his wife.

" — " Well, it comes to the same thing. May her soul rest in 
peace. It ' s a fine funeral. """

# Write the zhivago string to the file "zhivago.txt". Use 'with' to cleanup the file handle.

# Now read the contents of zhivago.txt, using a for loop to iterate over the lines, printing each to the screen. 
# When reading the file again use with.



# Binary Files

All the files we've seen so far are assumed to be text files (i.e. composed of human readable stuff). However, Python can also happily process binary files, which are just files with arbitrary bits in them. 

On the left we've opened a binary file (a jpeg) - the bits aren't organized so that they can be decoded as text. We can't assume that the bits in such a binary file are organized into bytes that represent meaningful text characters with white space, etc. On the right is the jpeg when properly decoded as an image.


<img src="https://raw.githubusercontent.com/benedictpaten/intro_python/main/lecture_notebooks/figures/binary%20files.jpg" width=1000 height=500 />


In [21]:
# Reading and writing binary data

with open("test.txt", "rb") as f: # Here we treat test.txt as a binary file - any file
  # can be considered just a collection of bits
  
  with open("test2.txt", "wb") as g: # Note the nested with statements

    while True:
      buf = f.read(1024) # The argument to read tells it to read a set number of 
      # bytes - each byte is an 8 bit (8 0's or 1's) words into the "buf" object, 
      # which is a string

      #print(buf)
      
      if len(buf) == 0: # We're at the end
         break
          
      g.write(buf)

Now we can see that test2.txt is just a copy of the original file:

In [22]:
with open("test2.txt") as fh:
  for line in fh:
    print(line, end="") # end="" stops print() from adding a newline to the end

My first file written from Python
---------------------------------
Hello, world!
Hello, again!


# Directories

In most systems the operating has access to a file system in which files are kept. Getting into details
is beyond the scope of this course, because it requires us getting into operating system stuff - but let's cover the very basics.

For more see: https://docs.python.org/3/library/os.html and https://docs.python.org/3/library/os.path.html

By default when you open/write a file in Python uses the current working directory.

In [33]:
import os # This module provides all sorts of useful functions 
# for working with the operating system

os.getcwd() # getcwd() tells us what working directory we're in. 
# (If you're using Google Collaboratory, this is some directory on
# some (virtual) machine in Google Cloud Platform)

'/Users/benedictpaten/PycharmProjects/intro_python/lecture_notebooks'

In [34]:
os.listdir(os.getcwd()) # listdir() tells us what files are in the working directory

['L03 More Types.ipynb',
 'L11 Files.ipynb',
 '.DS_Store',
 'L18 Data Science .ipynb',
 'test2.txt',
 'L06 Functions.ipynb',
 'L10 Modules.ipynb',
 'L08 Strings.ipynb',
 'L05 Loops.ipynb',
 'L04 Conditionals and Branching.ipynb',
 'L13 Classes and Polymorphism.ipynb',
 'L02 Variables and Expressions.ipynb',
 'L07 Functions Continued.ipynb',
 'L16 Exceptions and Unit Testing.ipynb',
 'L-- Syllabus.ipynb',
 'figures',
 'L15 More Functions and Recursion.ipynb',
 'L14 Inheritance.ipynb',
 'test.txt',
 '.ipynb_checkpoints',
 'L17 Search Algorithms.ipynb',
 'data',
 'L12 Classes and Objects.ipynb',
 'cancer_data.csv',
 'L09 Tuples, Lists and Dictionaries.ipynb',
 'L01 Intro.ipynb',
 'local.txt']

In [35]:
# We can create a directory like so:

os.mkdir("new_directory")  # This creates a new directory in the current working directory

os.listdir(os.getcwd() + "/new_directory") # Now "new_directory" is a directory in our working directory

[]

In [36]:
# We can write a file in our new directory like so:

# Python uses strings to indicate file paths, generally we can indicate file
# paths with the / symbol, so new_directory/test.txt is a file "test.txt" in the 
# directory "new_directory"

with open("new_directory/test.txt", "w") as fh:
  fh.write("Hello again!")
    
os.listdir(os.getcwd() + "/new_directory") # Now "new_directory" is a directory in our working directory

['test.txt']

It is worth mentioning how to delete a file, because your Python programs/scripts should try not make lots of messy files:

In [27]:
os.remove("new_directory/test.txt") # This removes the file test.txt from new_directory

os.listdir(os.getcwd() + "/new_directory") # Now the directory is empty

[]

In [28]:
# Similarly you can remove empty directories
os.rmdir("new_directory") # This will error out if the directory is not empty

os.listdir(os.getcwd()) # new_directory is gone

['L03 More Types.ipynb',
 'L11 Files.ipynb',
 '.DS_Store',
 'L18 Data Science .ipynb',
 'test2.txt',
 'L06 Functions.ipynb',
 'L10 Modules.ipynb',
 'L08 Strings.ipynb',
 'L05 Loops.ipynb',
 'L04 Conditionals and Branching.ipynb',
 'L13 Classes and Polymorphism.ipynb',
 'L02 Variables and Expressions.ipynb',
 'L07 Functions Continued.ipynb',
 'L16 Exceptions and Unit Testing.ipynb',
 'L-- Syllabus.ipynb',
 'figures',
 'L15 More Functions and Recursion.ipynb',
 'L14 Inheritance.ipynb',
 'test.txt',
 '.ipynb_checkpoints',
 'L17 Search Algorithms.ipynb',
 'data',
 'L12 Classes and Objects.ipynb',
 'cancer_data.csv',
 'L09 Tuples, Lists and Dictionaries.ipynb',
 'L01 Intro.ipynb']

# Challenge 3

In [11]:
# Iterate through the files in your current working directory and print their names


Current working directory /Users/benedictpaten/PycharmProjects/intro_python/lecture_notebooks
File L03 More Types.ipynb
File out.txt
File revision_session.ipynb
File L11 Files.ipynb
File .DS_Store
File L18 Data Science .ipynb
File test2.txt
File L06 Functions.ipynb
File L10 Modules.ipynb
File new_directory
File L08 Strings.ipynb
File L05 Loops.ipynb
File L04 Conditionals and Branching.ipynb
File alice_in_wonderland.txt
File L13 Classes and Polymorphism.ipynb
File L02 Variables and Expressions.ipynb
File L07 Functions Continued.ipynb
File L16 Exceptions and Unit Testing.ipynb
File out.csv
File temp.txt
File L-- Syllabus.ipynb
File figures
File L15 More Functions and Recursion.ipynb
File L14 Inheritance.ipynb
File test.txt
File .ipynb_checkpoints
File zhivago.txt
File L17 Search Algorithms.ipynb
File Revision Session Fall 2022.ipynb
File data
File L12 Classes and Objects.ipynb
File cancer_data.csv
File L09 Tuples, Lists and Dictionaries.ipynb
File L01 Intro.ipynb
File local.txt


# Getting Data from the Internet

In [1]:
import urllib.request

url = "https://raw.githubusercontent.com/DataBiosphere/toil/master/src/toil/test/wdl/test.csv"
local_copy = "local.txt"

urllib.request.urlretrieve(url, local_copy) # This function copies the thing the url points at into
# a local file copy

with open(local_copy) as fh: # Print the file
  for line in fh:
    print(line, end="")

1,2,3
4,5,6
7,8,9


# Challenge 4: File Processing Example

Here's a fairly complete worked example of doing file writing and parsing using the simplest bioinformatics format, the fasta file.

A fasta file is a file that stores nucleotide (like DNA, RNA,) and amino-acid (protein) sequences.

This is a code comprehension exercise, i.e. to complete the task you need to read the code and understand what it is doing. 

In [14]:
"""
Example creating read and write methods for fasta files.

Fasta format is a dead simple text (ascii) file:

>HEADER_1
SEQUENCE_LINE_1
SEQUENCE_LINE_2
...
SEQUENCE_LINE_N
>HEADER_2
SEQUENCE_LINE_1
SEQUENCE_LINE_2
...
SEQUENCE_LINE_N

Where HEADER_ lines give a string descriptor of the sequence
and the SEQUENCE_LINE_ lines are concatenated together to form the actual 
amino-acid or nucleotide sequence.
"""

def writeFasta(fileHandle, header, sequence, sequenceLineWidth=100):
  """ Function writes a fasta file to the given file handle.
  """
  fileHandle.write('>' + header + "\n") # Write the header line
  
  # Now write the sequence
  for i in range(0,len(sequence),sequenceLineWidth): # Step through sequence sequenceLineWidth characters
    # at a time
    
    # Write the next sequenceLineWidth chars of the sequence
    fileHandle.write(sequence[i:i+sequenceLineWidth] + "\n") 
    
def readFasta(fileHandle):
    """ Read a single sequence and header from a fasta file.
    """
    while True:
        # Read the header line, skipping and lines not beginning with a '>'
        while True:  # Keep looping until we break
            l = fileHandle.readline()

            if len(l) == 0:  # If we have reached the end of the file, terminate
                return None  # None is a standard NULL or 0 quantity used to indicate
            # termination

            if l[0] == '>':  # Is a valid header line
                header = l[1:-1]  # Get the  header between the '>' and the '\n'
                break

        # Now read the sequence
        substrings = []
        while True:
            i = fileHandle.tell()  # tell() gives the index that the file handle is at in the file

            l = fileHandle.readline()

            if len(l) == 0:  # We have reached the end of the file
                break

            if l[0] == '>':  # We have encountered the start of another sequence
                fileHandle.seek(i)  # Roll back the file handle to the point before we called readline()
                # this is like "undoing" the readline() call
                break

            substrings.append(l[:-1])  # Add the line, minus the newline to substrings

        yield header, "".join(substrings)

  
# A few test sequences
dna_sequences = {
    "a_dummy_dna_string":"GATTACA",
    "gi 556503834 ref NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome":"AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTTGACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTTGCCCAAATAAAACATGTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGC",
    "another_dummy_dna_string":"CGGCGCGCGTCTTTGCAGCGATGTCACGCGCCCGTATTTCCGTGGTGCTGATTACGCAATCATCTTCCGAATACAGCATCAGTTTCTGCGTTCCACAAAGCGACTGTGTGCGAGCTGAACGGGCAATGCAGGAAGAGTTC"
}

# Challenge: 
# (1) Write the sequences in "dna_sequences" into a file "fasta.fa" using the writeFasta function, 
# for each pair in the dictionary using the key as the header and the value as the sequence

# (2) Make an empty dictionary called dna_sequences_copy

# (3) Read the sequences in "fasta.fa" using the readFasta function, putting the header:sequence pairs
# into dna_sequences_copy


# We can check it all worked
assert dna_sequences == dna_sequences_copy


# Reading

* Open book Chapter 13 on files: http://openbookproject.net/thinkcs/python/english3e/files.html

# Homework

* Go to Canvas and complete the lecture quiz, which involves completing each challenge problem
* ZyBook Reading 11



# Practice Problems

In [None]:
# Problem 1: Basic File Writing
# Write a function that takes a string and writes it to a file named "output.txt".
# The function should add a newline character after the string.

def write_string_to_file(content):
    pass # Code to write

# Test
test_string = "Hello, World!"
write_string_to_file(test_string)
with open("output.txt", "r") as f:
    assert f.read().strip() == test_string

In [None]:
# Problem 2: Line Counter
# Write a function, count_lines, that counts the number of lines in a text file.
# Empty lines should be included in the count.

# Code to write

# Test
with open("test.txt", "w") as f:
    f.write("Line 1\nLine 2\n\nLine 4")
assert count_lines("test.txt") == 4

In [None]:
# Problem 3: Word Counter
# Write a function, count_words, that reads a text file and returns a dictionary where:
# - keys are words (converted to lowercase)
# - values are the number of times each word appears
# Words are separated by whitespace

# Code to write

# Test
with open("words.txt", "w") as f:
    f.write("The cat and the dog and THE CAT")
expected = {"the": 3, "cat": 2, "and": 2, "dog": 1}
assert count_words("words.txt") == expected

In [None]:
# Problem 4: File Appender with Line Numbers
# Write a function, append_numbered_lines, that appends numbered lines to a file.
# If the file exists, read it first to find the number of lines and continue numbering from the last line number.
# The function should create the file if it doesn't exist.
# Return the total number of lines after appending.

# Write your code here


# Test
test_lines = ["First line", "Second line"]
total_lines = append_numbered_lines("numbered.txt", test_lines)
assert total_lines == 2
more_lines = ["Third line"]
total_lines = append_numbered_lines("numbered.txt", more_lines)
assert total_lines == 3
with open("numbered.txt", "r") as f:
    assert f.read() == "1 First Line\n2 Second Line\n3 Third Line\n"

In [None]:
# Problem 5: Binary File Copier
# Write a function, copy_binary_file, that copies a binary file in chunks and returns the total bytes copied.
# Use a chunk size of 1024 bytes.

# Code to write

# Test
# Create a test binary file
with open("source.bin", "wb") as f:
    f.write(b"Binary content" * 100)  # Create some binary content

bytes_copied = copy_binary_file("source.bin", "destination.bin")
assert bytes_copied > 0  # Should return positive number of bytes copied

# Verify files are identical
with open("source.bin", "rb") as f1, open("destination.bin", "rb") as f2:
    assert f1.read() == f2.read()  # Content should match