# Lecture 11 - Files 

Today:
* Files.
  * Writing Files
  * Reading files
  * With keyword
  * For loops on file handles
  * Binary files 
  * Directories
  * Web file retrieval



# Motivation

* Up to now we've mainly used input() to get input to the programs we've written.

* In the real world, this is not the way we generally get data into a program. Rather we read and write data to and from files.

* Files (eg documents, spreadsheets or images) are just semantically organized sequences of bits. 

* They can exist on a disk, be pulled from the internet or be created transiently as we pipe inputs and outputs from our programs. 

* They can be small, and they can be very large. 

Anyway, let's dive in... (today will use larger code samples than we've seen previously - you're leveling up)

# Writing Files

Let's first look at creating a file. You'll often want to do this when writing out the results of some analysis.

In [2]:
fh = open("test.txt", "w") 
# Opens a file test.txt in the working directory of the Python program.
# The "w" (short for "write"), says you intend to write to the file. 

# If test.txt does not exist before this point it is created as an empty file.

# If test.txt existed before this point its contents are now overwritten - 
# so be careful opening files for writing

# fh (the return value from open) is a "file handle", 
# you use it to handle the communication with the file

# The write method of the file_handle object is a lot like print(), 
# except (1) the string arguments are written to the file, instead of the screen, 
# and (2) write does not add newlines to the end of the string 
# (so you add them explicitly, as in this example)

fh.write("My first file written from Python\n") 
fh.write("---------------------------------\n")
fh.write("Hello, world!\n")

fh.close() 
# This closes the file. 
# You can't keep using my_file after this as the file connection is closed. 
# It is important to close the file when  you're done 

Note: (Do matched demo from terminal to show contents of file)


**To append to the end of an existing file:** 

In [4]:
fh = open("test.txt", "a") 
# "a" (short for append) opens an existing file to append to it 
# If test.txt does not exist then 'a' creates the file, just like 'w'

fh.write("Hello, again!\n") 

fh.close() # Again, always close the file

import os
print(os.getcwd())

/private/var/mobile/Library/Mobile Documents/com~apple~CloudDocs/Documents/w/teach/20-f24/intro_python/lecture_notebooks


Note: (Continue matched demo from terminal to show contents of appended to file)

# Reading Files

In [6]:
fh = open("test.txt", "r") 
# This opens the file, "r" argument means "read"

print(fh.readline(), end="") # readline returns a line with the \n at the end
print(fh.readline(), end="")
print(fh.readline(), end="")
print(fh.readline(), end="")


My first file written from Python
---------------------------------
Hello, world!
Hello, again!


In [3]:
fh = open("test.txt", "r") 
# This opens the file, "r" argument means "read"

# The following loop walks through the file line by line
while True:                # Keep reading forever
    line = fh.readline()   # read next line, returns a string
    if line == "":         # At end-of-file, readline returns the empty string ''
        break              # leave the loop
    # Print the line we've just read
    print(line, end="")

fh.close() 

My first file written from Python
---------------------------------
Hello, world!


What is nice about the above loop is that regardless of the number of files, Python only stores one line in memory, allowing it to process arbitrarily large files.

**Python will complain if you try to open a file that doesn't exist...**

In [4]:
fh = open("wharrah.txt", "r")

FileNotFoundError: [Errno 2] No such file or directory: 'wharrah.txt'

**To read contents of file into memory in one go use read()**

In [3]:
fh = open("test.txt", "r") 

content = fh.read() # Read the contents of the file

print(type(content)) # Yup, content is a string
print(len(content))
print(content.split('\n'))
print(content)

fh.close()

<class 'str'>
110
['My first file written from Python', '---------------------------------', 'Hello, world!', 'Hello, again!', 'Hello, again!', '']
My first file written from Python
---------------------------------
Hello, world!
Hello, again!
Hello, again!



**readlines will read all the lines in a file:**

In [6]:
fh = open("test.txt") 
# If you don't give the second argument it defaults to read only

content = fh.readlines() # Read the contents of the file

print(type(content)) # Yup, content is now a list of strings

print(content)

fh.close()

<class 'list'>
['My first file written from Python\n', '---------------------------------\n', 'Hello, world!\n']


# Challenge 1

In [6]:
l = list(range(100))

## Write the numbers in l to a file "out.txt"

fh = open("out.txt","w")
for i in l:
    fh.write( str(i)+"\n" )
fh.close()

## Read the numbers from out.txt into a new list, "l2"
fh2 = open("out.txt")
l2 = []
while True:
    line = fh2.readline()
    if line == "":
        break
    print( 1/ 0 )
    l2.append( int(line) )
fh2.close()
print(l2)
...    

## check that l == l2
assert l == l2

# Remember: fh.write(...), fh.readline(), fh.readlines(), or fh.read()

ZeroDivisionError: division by zero

# The with Keyword

Good programming is often about anticipating errors and dealing with them.

The 'with' keyword is both useful short-hand and good defensive programming. It cleans up the file handle for you

(see https://docs.python.org/2.5/whatsnew/pep-343.html for the details...)

In [8]:
with open("test.txt") as fh: 
    # With removes the need for the "fh.close()" statement
    # and is better because fh.close() is guaranteed to be run even if there is an error
    # in processing the file
  
    assert False # This causes an error, but fh will still be closed
    content = fh.readlines() 
    print(content)
  
  
# For files the syntax is:
# with OPEN_EXPRESSION as FILE_HANDLE:
#   STATEMENTS THAT USE FILE_HANDLE

AssertionError: 

# For loop works on file handles 

In [9]:
# If you run a for loop on a file handle open for reading 
# then it iterates through each line in the file

with open("test.txt") as fh:
    for line in fh:
        print(line, end="") # end="" stops print() from adding a newline to the end

My first file written from Python
---------------------------------
Hello, world!
Hello, again!


# Challenge 2

In [2]:
zhivago = """On they went, singing " Rest Eternal, " and whenever they
stopped, their feet, the horses, and the gusts of wind seemed to
carry on their singing. 

Passers-by made way for the procession, counted the wreaths, 
and crossed themselves. Some joined in out of curiosity and 
asked: " Who is being buried? "  " Zhivago, " they were told. 
- " Oh, I see. That's what it is. " - "It isn't him. It's his wife.
" - " Well, it comes to the same thing. May her soul rest in 
peace. It's a fine funeral. """

# Write the zhivago string to the file "zhivago.txt". 
# Now read the contents of zhivago.txt, 
# using a for loop to iterate over the lines, 
# printing each to the screen. 
# Use 'with' to cleanup all file handles.

with open("zhivago.txt","w") as fh:
    fh.write(zhivago)
    
with open("zhivago.txt") as fh:
    for line in fh:
        print(line, end="")


On they went, singing " Rest Eternal, " and whenever they
stopped, their feet, the horses, and the gusts of wind seemed to
carry on their singing. 

Passers-by made way for the procession, counted the wreaths, 
and crossed themselves. Some joined in out of curiosity and 
asked: " Who is being buried? "  " Zhivago, " they were told. 
- " Oh, I see. That's what it is. " - "It isn't him. It's his wife.
" - " Well, it comes to the same thing. May her soul rest in 
peace. It's a fine funeral. 

# Binary Files

All the files we've seen so far are assumed to be text files (i.e. composed of human readable stuff). However, Python can also happily process binary files, which are just files with arbitrary bits in them. 

On the left we've opened a binary file (a jpeg) - the bits aren't organized so that they can be decoded as text. We can't assume that the bits in such a binary file are organized into bytes that represent meaningful text characters with white space, etc. On the right is the jpeg when properly decoded as an image.


<img src="https://raw.githubusercontent.com/cormacflanagan/intro_python/main/lecture_notebooks/figures/binary%20files.jpg" width=1000 height=500 />


In [2]:
# Reading and writing binary data

with open("test.txt", "rb") as f: 
    # "rb" means read as a binary file
    # Any file can be treated as a binary file
  
    with open("test2.txt", "wb") as g: 
        # "wb" means write a binary file

        while True:
            buf = f.read(10) 
            # The argument to read tells it to read a set number of bytes into "buf"
            # A Byte is 8 bits, representing a number 0 <= n < 256
            # Each Bit is 0 or 1
            
            print(buf)
            #print(type(buf))
      
            if len(buf) == 0: # We're at the end
                break
          
            g.write(buf)

b'My first f'
b'ile writte'
b'n from Pyt'
b'hon\n------'
b'----------'
b'----------'
b'-------\nHe'
b'llo, world'
b'!\nHello, a'
b'gain!\n'
b''


Now we can see that test2.txt is just a copy of the original file:

In [7]:
with open("test2.txt") as fh:
    for line in fh:
        print(line, end="") 

My first file written from Python
---------------------------------
Hello, world!
Hello, again!


# Directories

In most systems the operating system has access to a file system in which files are kept. Getting into details
is beyond the scope of this course, because it requires us getting into operating system stuff - but let's cover the very basics.

For more see: https://docs.python.org/3/library/os.html and https://docs.python.org/3/library/os.path.html

By default when you open/write a file in Python uses the current working directory.

In [2]:
import os # This module provides all sorts of useful functions 
# for working with the operating system

os.getcwd() # getcwd() tells us what working directory we're in. 
# (If you're using Google Collaboratory, this is some directory on
# some (virtual) machine in Google Cloud Platform)

'/private/var/mobile/Library/Mobile Documents/com~apple~CloudDocs/Documents/w/teach/20-f24/intro_python/lecture_notebooks'

In [4]:
os.listdir()  # listdir() tells us what files are in the working directory

['L03 More Types.ipynb',
 'L15 More Functions and Recursion.ipynb',
 'L11 Files.ipynb',
 'L09 Tuples, Lists and Dictionaries.ipynb',
 'myadder.py',
 'out.txt',
 'unbitly',
 'test2.txt',
 'L07 Functions Continued.ipynb',
 'fasta.fa',
 'L14 Inheritance.ipynb',
 'Mortality',
 'L18 Data Science .ipynb',
 'L06 Functions.ipynb',
 'node_modules',
 'alice_in_wonderland.txt',
 'L16 Exceptions and Unit Testing.ipynb',
 '__pycache__',
 'L02 Variables and Expressions.ipynb',
 'L05 Loops.ipynb',
 'temp.txt',
 'package-lock.json',
 'package.json',
 'before-lecture',
 'figures',
 'L10 Modules.ipynb',
 'L08 Strings.ipynb',
 'test.txt',
 '.ipynb_checkpoints',
 'sum.py',
 'Lzz Review.ipynb',
 'zhivago.txt',
 'L01 Intro.ipynb',
 'L04 Conditionals and Branching.ipynb',
 'data',
 'L17 Search Algorithms.ipynb',
 'L13 Classes and Polymorphism.ipynb',
 'cancer_data.csv',
 'L12 Classes and Objects.ipynb',
 'local.txt']

In [4]:
os.listdir(os.getcwd()) # listdir() tells us what files are in the working directory

['L03 More Types.ipynb',
 'L15 More Functions and Recursion.ipynb',
 'L11 Files.ipynb',
 'L09 Tuples, Lists and Dictionaries.ipynb',
 'myadder.py',
 'out.txt',
 'unbitly',
 'test2.txt',
 'L07 Functions Continued.ipynb',
 'fasta.fa',
 'L14 Inheritance.ipynb',
 'Mortality',
 'L18 Data Science .ipynb',
 'L06 Functions.ipynb',
 'node_modules',
 'alice_in_wonderland.txt',
 'L16 Exceptions and Unit Testing.ipynb',
 '__pycache__',
 'L02 Variables and Expressions.ipynb',
 'L05 Loops.ipynb',
 'temp.txt',
 'package-lock.json',
 'package.json',
 'before-lecture',
 'figures',
 'L10 Modules.ipynb',
 'L08 Strings.ipynb',
 'test.txt',
 '.ipynb_checkpoints',
 'sum.py',
 'Lzz Review.ipynb',
 'zhivago.txt',
 'L01 Intro.ipynb',
 'L04 Conditionals and Branching.ipynb',
 'data',
 'L17 Search Algorithms.ipynb',
 'L13 Classes and Polymorphism.ipynb',
 'cancer_data.csv',
 'L12 Classes and Objects.ipynb',
 'local.txt']

In [1]:
# We can create a directory like so:
os.mkdir("new_directory")

os.listdir("new_directory")
# Now "new_directory" is a directory in our working directory

[]

In [9]:
# We can write a file in our new directory like so:

# Python uses strings to indicate file paths, generally we can indicate file
# paths with the / symbol, so new_directory/test.txt is a file "test.txt" in the 
# directory "new_directory"

with open("new_directory/test.txt", "w") as fh:
    fh.write("Hello again!")
    
os.listdir("new_directory") 

['test.txt']

It is worth mentioning how to delete a file, because your Python programs/scripts should try not make lots of messy files:

In [10]:
os.remove("new_directory/test.txt") # This removes the file test.txt from new_directory

os.listdir("new_directory") # Now the directory is empty

[]

In [11]:
# Similarly you can remove empty directories
os.rmdir("new_directory") # This will error out if the directory is not empty

os.listdir(os.getcwd()) # new_directory is gone

['L03 More Types.ipynb',
 'L15 More Functions and Recursion.ipynb',
 'L11 Files.ipynb',
 'L09 Tuples, Lists and Dictionaries.ipynb',
 'myadder.py',
 'out.txt',
 'unbitly',
 'test2.txt',
 'L07 Functions Continued.ipynb',
 'fasta.fa',
 'L14 Inheritance.ipynb',
 'Mortality',
 'L18 Data Science .ipynb',
 'L06 Functions.ipynb',
 'node_modules',
 'alice_in_wonderland.txt',
 'L16 Exceptions and Unit Testing.ipynb',
 '__pycache__',
 'L02 Variables and Expressions.ipynb',
 'L05 Loops.ipynb',
 'temp.txt',
 'package-lock.json',
 'package.json',
 'before-lecture',
 'figures',
 'L10 Modules.ipynb',
 'L08 Strings.ipynb',
 'test.txt',
 '.ipynb_checkpoints',
 'sum.py',
 'Lzz Review.ipynb',
 'zhivago.txt',
 'L01 Intro.ipynb',
 'L04 Conditionals and Branching.ipynb',
 'data',
 'L17 Search Algorithms.ipynb',
 'L13 Classes and Polymorphism.ipynb',
 'cancer_data.csv',
 'L12 Classes and Objects.ipynb',
 'local.txt']

# Challenge 3

In [5]:
# Iterate through the files in your current working directory and print their names
import os
for filename in os.listdir():
    print(filename)
    with open(filename) as fh: print(fh.readline()[:40])

L03 More Types.ipynb
{

L15 More Functions and Recursion.ipynb
{"cells":[{"metadata":{"id":"f28Xd6z9-lp
L11 Files.ipynb
{

L09 Tuples, Lists and Dictionaries.ipynb
{

myadder.py
def add(c):

out.txt

unbitly
foreach f (*.ipynb)

test2.txt
My first file written from Python

L07 Functions Continued.ipynb
{

fasta.fa
>a_dummy_dna_string

L14 Inheritance.ipynb
{"cells":[{"metadata":{"id":"5UyFJYXOx-W
Mortality


IsADirectoryError: [Errno 21] Is a directory: 'Mortality'

# Getting Data from the Internet

In [4]:
import urllib.request

url="https://raw.githubusercontent.com/DataBiosphere/toil/master/src/toil/test/wdl/test.csv"
local_copy = "local.txt"

urllib.request.urlretrieve(url, local_copy) 
# This function copies the thing the url points at into a local file copy

with open(local_copy) as fh: # Print the file
    print(fh.readlines())
    
r = []
with open(local_copy) as fh: # Print the file
    for line in fh: # line is '1,2,3\n' etc
        print(f"line is {line}")
        line    = line.strip()               # line    is '1,2,3'
        print(f"line is {line}")
        strings = line.split(',')            # strings is [ '1', '2', '3' ]
        print(f"strings is {strings}")
        ints    = [int(s) for s in strings]  # ints    is [1,2,3]
        print(f"ints is {ints}")
        r.append(ints)
 
r
#[ [ 1,2,3], [4,5,6], [7,8,9]  ]


['1,2,3\n', '4,5,6\n', '7,8,9\n']
line is 1,2,3

line is 1,2,3
strings is ['1', '2', '3']
ints is [1, 2, 3]
line is 4,5,6

line is 4,5,6
strings is ['4', '5', '6']
ints is [4, 5, 6]
line is 7,8,9

line is 7,8,9
strings is ['7', '8', '9']
ints is [7, 8, 9]


[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

# Challenge 4: File Processing Example

read the data in local.txt 

and then *parse* it into a list of list of ints

Should get: [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

Note: string.strip() removes whitespace (eg newlines) around a string

In [9]:
import urllib.request
url="https://raw.githubusercontent.com/DataBiosphere/toil/master/src/toil/test/wdl/test.csv"
local_copy = "local.txt"
urllib.request.urlretrieve(url, local_copy) 

lst = []
with open(local_copy) as fh:
    for line in fh:
        print(line)
        print(line.strip())
        print(line.split(","))
        ...
        lst.append(...)
print(lst)

1,2,3

1,2,3
['1', '2', '3\n']
4,5,6

4,5,6
['4', '5', '6\n']
7,8,9

7,8,9
['7', '8', '9\n']
[Ellipsis, Ellipsis, Ellipsis]


In [3]:
# Pretty complex terse code
# HARD CHALLENGE: Can you understand what's going on here?
with open(local_copy) as fh:
    lst = [  [int(s) for s in line.strip().split(",")] for line in fh ]
lst

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

# Homework

* ZyBook Reading 11
* Open book Chapter 13 on files: http://openbookproject.net/thinkcs/python/english3e/files.html


