## Chapter 14 - Files
An outline notebook for class, to accompany https://greenteapress.com/thinkpython2/html/thinkpython2015.html  
  
So far, we have designed algorithms to produce answers and output but have not yet dealt with how to store that output in files/databases/etc. - we will now learn how to do these processes in Python.  


14.1 Persistence  

Programs we have written so far have not retained their output past the point we finish running Jupyter notebook.  That is, the output data aren't persistent.  

Some programs that do produce persistent data are operating systems (OS).  In fact, the reason your jupyter notebook output can persist despite you powering down your computer is thanks to the operating system.  

But what if we don't want to leave it to the OS?  
One simple way is to read and write text files.  (An alternative is a database file; we will see this at the end of the section)

14.2 Reading/writing
A text file is a sequence of characters (chunks of 0/1's, representing individual text characters), stored on a permanent medium like a hard drive (or punch card if you're old school...or a stone tablet if you're -really- old school).

We have done a little with opening/reading a file in chapter 9.  

To write a file, you have to open it with mode 'w' as a second parameter:


In [1]:
fout = open('output.txt','w')
#Careful - if the file already exists, opening
#it in write mode clears out the existing content.
#if it doesn't already exist, it creates
# a new file.

line1 = "Hello world...again.\n"
fout.write(line1)
#the return value is the number of characters
#that were written

21

In [2]:
#The file object keeps track of where it had
#been writing, so if you call write again,
#it adds to the end of the file 
line2 = 'and again, and again...\n'
fout.write(line2)

24

In [3]:
#when you're done writing, you 
#should close the file - if you don't do it
#manually, it will become closed when your 
#program ends

fout.close()

14.3 Format Operator
The argument of write has to be a string, if we want other data types in there, we have to convert them to strings - usually casting using str() will do.

In [4]:
fout = open('output2.txt','w')
x = 42
fout.write(str(x))

2

In [5]:
#Another preferred, Pythonic way to do this is
#with the format operator.
##EH NOTE: The % (also the mod operator if
##used on integers) was used primarily when 
##this text was written but now {} is preferred.
##I will show both ways.
#the first operand is the format string,
#which contains one or more format sequences,
#which specify how the second operand is
#formatted as a string.
#e.g., %d means the second operand should be
#formatted as a decimal
dolphins = 42
'%d' % dolphins
#the result is the string 42

'42'

In [6]:
#The format sequence can appear anywhere in 
#the string.  So you can embed a value in a 
#sentence.
'I have seen %d dolphins' % dolphins

'I have seen 42 dolphins'

In [7]:
#If there's more than one format sequence in 
#the string, the second arg has to be a tuple
"In %d years I've seen %g %s." % (3, 0.1, 'dolphins')

"In 3 years I've seen 0.1 dolphins."

For more information on the format operator, see https://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting. A more powerful alternative is the string format method, which you can read about at https://docs.python.org/3/library/stdtypes.html#str.format.

In [8]:
#The new way, with {}s
txt1 = "My name is {fname}, I'm {age}".format(fname = "John", age = 36)
txt2 = "My name is {0}, I'm {1}".format("John",36)
txt3 = "My name is {}, I'm {}".format("John",36)
print(txt1)

My name is John, I'm 36


In [9]:
#with this way, we can also use formatting commands
print("It costs {price:.2f} dollars.".format(price = 7.2))
print("I saw {dolphins:d} dolphins.".format(dolphins = 25))

It costs 7.20 dollars.
I saw 25 dolphins.


# Filenames and Paths  
Files are organized into directories aka folders.  
Every running program has a "current directory," which is the default directory for most operations.  For example, when you open a file for reading, Python looks for it in the current directory.  
The os (operating system) module provies functions for working with files and directories.  

In [10]:
#import the os module
import os
#get the current working directory
cwd = os.getcwd()
cwd

'/Users/erichansen/Documents/cs2020-2021/thPyCh14'

A simple filename alone is also technically a path, but it's a relative path since it relates to the current directory.  
A path that begins with / doesn't depend on the current directory; it's called an absolute path.  To find the absolute path to a file, you can use the os.path.abspath:

In [11]:
os.path.abspath('output.txt')

'/Users/erichansen/Documents/cs2020-2021/thPyCh14/output.txt'

In [12]:
#os.path also provides other functions:
#whether a file exists in the CWD
print(os.path.exists('output.txt'))
print(os.path.exists('no_such_file.txt'))

#checks whether it's a directory:
print(os.path.isdir('output.txt'))
print(os.path.isdir(cwd))

#checks whether it's a file:
print(os.path.isfile('output.txt'))


True
False
False
True
True


In [13]:
#return a list of the files and other 
#directories in the given directory:
os.listdir(cwd)

['examples.db',
 'output2.txt',
 'wc.py',
 '__pycache__',
 'README.md',
 '.ipynb_checkpoints',
 '.git',
 'output.txt',
 'thPyCh14.ipynb']

In [14]:
#a neat little recursive function that 
#'walks' through a directory, prints all the
#files, then recursively walks through
#all subdirectories:
def walk(dirname):
    for name in os.listdir(dirname):
        path = os.path.join(dirname,name)
        
        if os.path.isfile(path):
            print(path)
        else:
            walk(path)
walk(cwd)
#careful, because git creates some hidden
#folders, this may produce some long output.

/Users/erichansen/Documents/cs2020-2021/thPyCh14/examples.db
/Users/erichansen/Documents/cs2020-2021/thPyCh14/output2.txt
/Users/erichansen/Documents/cs2020-2021/thPyCh14/wc.py
/Users/erichansen/Documents/cs2020-2021/thPyCh14/__pycache__/wc.cpython-38.pyc
/Users/erichansen/Documents/cs2020-2021/thPyCh14/README.md
/Users/erichansen/Documents/cs2020-2021/thPyCh14/.ipynb_checkpoints/thPyCh14-checkpoint.ipynb
/Users/erichansen/Documents/cs2020-2021/thPyCh14/.git/config
/Users/erichansen/Documents/cs2020-2021/thPyCh14/.git/objects/57/4c9bdcc84c8430b3b95c9b1dc38f0b40b1b23e
/Users/erichansen/Documents/cs2020-2021/thPyCh14/.git/objects/3d/176908357431033366d8f0715bb3df306b1da4
/Users/erichansen/Documents/cs2020-2021/thPyCh14/.git/objects/fc/7ca77208de801eb2a1ff1189fab028ed2bb67c
/Users/erichansen/Documents/cs2020-2021/thPyCh14/.git/HEAD
/Users/erichansen/Documents/cs2020-2021/thPyCh14/.git/info/exclude
/Users/erichansen/Documents/cs2020-2021/thPyCh14/.git/logs/HEAD
/Users/erichansen/Documents/

Exceptions  
Lots of things can go wrong when you try to read/write files.  If you try to open a file that isn't there, you get a FileNotFoundError:

In [15]:
fin = open('no such file')

FileNotFoundError: [Errno 2] No such file or directory: 'no such file'

In [16]:
#if you don't have permission to access a 
#file, you get a PermissionError. 

#if you try to open a directory for reading,
#you get IsADirectoryError

#to avoid these, you could use functions
#like os.path.exists and os.path.isfile
#incessantly...but that would take lots of time
#and code.

#So, Python takes a different approach,
#just Try it, and then deal with problems if
#they happen.
#it's much like an if..else block.

try:
    #this is all the things you want to try
    fin = open('no such file')
except:
    #this is what you should do if the
    #try code produces errors
    #this is called 'catching' an exception
    print('something went wrong')

something went wrong


This is a very Pythonic philosophy to coding.  It's referred to as "Easier to ask forgiveness than permission" or EAFP (or AFNP for 'ask forgiveness not permission').

The alternative to this would be "Look before you leap," which would be the aforementioned approach of testing each possibility with .exists and .isfile explained above.  Some programming languages prefer this philosophy!

# Databases  
A database is a file that stores organized data.  In many ways they're similar to dictionaries because they map from keys to values.

The biggest difference is that a database is typically stored on permanent storage.

The module dbm provides an interface for creating/updating database files.

In [17]:
import dbm
db = dbm.open('examples','c')
#'examples' is the name of the database,
#'c' means it should be created if it doesn't
#already exist.

#When you create a new item, dbm updates the
#database file.
db['sunrise.png'] = 'Photo of an sunrise'
#When you access an item, dbm reads it.
db['sunrise.png']
#note the result is a "bytes" object which is
#why it begins with b. This is a lot like a string.

b'Photo of an sunrise'

In [18]:
#.values and .items don't work for db objects,
#but iteration does:
for key in db.keys():
    print(key, db[key])

b'sunrise.png' b'Photo of an sunrise'


In [19]:
#it's good practice to close a database when done
db.close()

# Pickling
dbm's are limited because the keys and values have to be strings or bytes.  If you use any other type, it causes an error.
The pickle module can help.  It translates nearly any datatype into a string suitable for storage in a database, and translates strings back into objects.

pickle.dumps (for dump string) takes an object as a parameter and returns a string representation.

In [20]:
import pickle
t = [1,2,3] #remember, this is a list
pickle.dumps(t)
#this creates a format that isn't very
#human readable, but pickle can read it
#using pickle.loads() ('load string')

b'\x80\x04\x95\x0b\x00\x00\x00\x00\x00\x00\x00]\x94(K\x01K\x02K\x03e.'

In [21]:
t1 = [1,2,3] #our starting list
ps = pickle.dumps(t1) #turn the list into a pickle string
t2 = pickle.loads(ps)# turn the pickle string back to a list
t2 #investigate the result


[1, 2, 3]

In [22]:
#note that while they have the same value,
#they aren't the same object
print(t1 == t2)
print(t1 is t2)

True
False


# Pipes!
Most operating systems have a command-line interface (e.g. terminal, command prompt) aka a shell.  
We have used this fairly regularly in this course.  
Any program that you can launch from the shell can also be launched from Python using a Pipe Object, which represents a running program.  The pipe "connects" python with the other program.

In Python, you can use os.popen (deprecated, may soon not be supported) or os.subprocess (a little more awkward, but our only option in modern Python versions).


In [23]:
cmd = 'ls -l'
fp = os.popen(cmd) #this behaves like an open file
#you can read line by line with readline
#or the whole thing at once with read
res = fp.read()
res

'total 104\n-rw-r--r--  1 erichansen  staff     47 Mar  9 14:02 README.md\ndrwxr-xr-x  3 erichansen  staff     96 Mar 23 11:52 __pycache__\n-rw-r--r--  1 erichansen  staff  16384 Mar 23 11:53 examples.db\n-rw-r--r--  1 erichansen  staff     45 Mar 23 11:53 output.txt\n-rw-r--r--  1 erichansen  staff      0 Mar 23 11:53 output2.txt\n-rw-r--r--  1 erichansen  staff  23755 Mar 23 11:52 thPyCh14.ipynb\n-rw-r--r--  1 erichansen  staff    134 Mar 23 11:53 wc.py\n'

In [24]:
#it's also good practice to close your pipes
stat = fp.close()
print(stat)
#None means it closed normally with no errors

None


In [25]:
#example using subprocess instead
import subprocess
command = ["ls", "-l"]
fp = subprocess.call(command)
output = subprocess.check_output(['ls', '-1'])
print ('Have %d bytes in output' % len(output))
print (output)

Have 78 bytes in output
b'README.md\n__pycache__\nexamples.db\noutput.txt\noutput2.txt\nthPyCh14.ipynb\nwc.py\n'


One useful Unix pipe usage is for using md5sum for a checksum.

You can read about MD5 at http://en.wikipedia.org/wiki/Md5.


In [26]:
#note, this may not run on your machine 
#if you don't have md5sum installed
filename = 'output.txt'
cmd = 'md5sum ' + filename
fp = os.popen(cmd)
res = fp.read()
stat = fp.close()
print('res:',res)

print('stat:',stat)

res: 
stat: 32512


# Writing Modules  
Any file that contains python code (like a .py file) can be imported as a module.
For example, the wc.py file contains the code:

def linecount(filename):
    count = 0
    for line in open(filename):
        count += 1
    return count

print(linecount('wc.py'))

In the past we've just imported it as
import wc.  We can now run functions from it.

In [27]:
import wc
wc.linecount('wc.py')

7


7

The only problem with this example is that when you import the module it runs the test code at the bottom. Normally when you import a module, it defines new functions but it doesn’t run them.

Programs that will be imported as modules often use the following idiom:

if __name__ == '__main__':
    print(linecount('wc.py'))
__name__ is a built-in variable that is set when the program starts. If the program is running as a script, __name__ has the value '__main__'; in that case, the test code runs. Otherwise, if the module is being imported, the test code is skipped.
As an exercise, type this example into a file named wc.py and run it as a script. Then run the Python interpreter and import wc. What is the value of __name__ when the module is being imported?

Warning: If you import a module that has already been imported, Python does nothing. It does not re-read the file, even if it has changed.

If you want to reload a module, you can use the built-in function reload, but it can be tricky, so the safest thing to do is restart the interpreter and then import the module again.

Exercise 1  
Write a function called sed that takes as arguments a pattern string, a replacement string, and two filenames; it should read the first file and write the contents into the second file (creating it if necessary). If the pattern string appears anywhere in the file, it should be replaced with the replacement string.
If an error occurs while opening, reading, writing or closing files, your program should catch the exception, print an error message, and exit. Solution: http://thinkpython2.com/code/sed.py.

Exercise 2  
If you download my solution to Exercise 2 from http://thinkpython2.com/code/anagram_sets.py, you’ll see that it creates a dictionary that maps from a sorted string of letters to the list of words that can be spelled with those letters. For example, 'opst' maps to the list ['opts', 'post', 'pots', 'spot', 'stop', 'tops'].
Write a module that imports anagram_sets and provides two new functions: store_anagrams should store the anagram dictionary in a “shelf”; read_anagrams should look up a word and return a list of its anagrams. Solution: http://thinkpython2.com/code/anagram_db.py.

Exercise 3  
In a large collection of MP3 files, there may be more than one copy of the same song, stored in different directories or with different file names. The goal of this exercise is to search for duplicates.
Write a program that searches a directory and all of its subdirectories, recursively, and returns a list of complete paths for all files with a given suffix (like .mp3). Hint: os.path provides several useful functions for manipulating file and path names.
To recognize duplicates, you can use md5sum to compute a “checksum” for each files. If two files have the same checksum, they probably have the same contents.
To double-check, you can use the Unix command diff.
Solution: http://thinkpython2.com/code/find_duplicates.py.