# Chapter 14: Files

This chapter introduces the idea of 'persistent' programs that kep data in permanet storage, and shows the different types of permanent storage, such as files and databases.

## Persistence
Most programs we have seen so far are transient in the sense that they run for a short time and produce some output, but when they end, their data dissapears. Running the program again means starting with a clean slate.

Other programs are **persistent**: they run for a long time (or all the time); they keep at least some of their data in permanent storage (a hard drive, for example); and if they shut down and restart, they pick up where they left off.

One of the simplest ways for programs to maintain their data is by reading and writing text files. An alternative is to sture the state of the program in a database.

## Reading and writing
A text file is a sequence of characters stored on a permanent medium like a hard drive, flash memory, etc.

To write to a file, you have to open it with write mode enabled:

In [2]:
file = open("exercises/words.txt")

Opening a file in write mode clears out the existing data and starts fresh, so be careful!

In [3]:
file = open("exercises/words2.txt", 'w')

Open returns a file object that provides methods for working with the file. The *write* method puts data into the file:

In [4]:
file.write("This here's the wattle,\n")

24

The return value is the number of characters that were written to the file. The file object keeps track of where it is, so if you call *write* again, it adds the new data to the end of the file.

In [5]:
file.write("the emblem of our land.\n")

24

When you are done writing, you should close the file. If you don't close the file, it will close when the program ends.

In [6]:
file.close()

In [9]:
%%bash
cat exercises/words2.txt

This here's the wattle,
the emblem of our land.


## Format operator
The argument of write has to be a string, so if we want to put other values in a file, we have to convert them to strings.

The easiest way to convert to a string is to use *str*:

In [10]:
file = open("exercises/words2.txt", 'w')
x = 52
file.write(str(x))
file.close()

In [11]:
%%bash
cat exercises/words2.txt

52

An alternative is to use the **format operator, %**. When applied to integers, % is the modulus operator, but when appled to strings, it is the format operator.

The first operand is the **format string**, whcih contains one or more **format sequences**, which specify how the second operand is formatted.

For example, the format sequence '%d' means that the second operand should be formatted as a decimal integer.

In [12]:
camels = 42
'%d' % camels

'42'

A format sequence can appear anywhere in the string, so you can embed a value in a sentence:

In [13]:
'I have spotted %d camels' %camels

'I have spotted 42 camels'

Other operands include '%g' to format a floating-point number, and %s to format a string.

In [16]:
'In %d years I have spotted %g %s' % (42, 0.1, 'camels')

'In 42 years I have spotted 0.1 camels'

Multiple insertions can be made, as long as the number of format sequences matches the number of elements in the tuple

## Filenames and paths
Files are organized into **directories**, also called folders. Every running program has a **current directory**, which is the default directory for most operations. 

A string that identifies a file or a directory is called a **path**.

The *os* module provides functions for working with files and directories.

In [17]:
import os
cwd = os.getcwd()
cwd

'C:\\Users\\dhrun.lauwers\\git\\knowledge-base\\data-science\\think-python'

A path that starts at the root directory is called an **absolute path** where as a path that starts at the current directory is a **relative path**.

To find the absolute path for a file, you can use os.path.abspath

In [18]:
os.path.abspath('words2.txt')

'C:\\Users\\dhrun.lauwers\\git\\knowledge-base\\data-science\\think-python\\words2.txt'

Other path functions include:

| function | example | description |
| -------- | ------- | ----------- |
| abspath  | abspath(file) | if the file exists, returns the absolute path |
| exists   | exists(file) | if the file exists, returns True, else, returns False |
| isdir    | isdir(path) | if the path is a directory, return True, else, returns False |
| listdir  |  listdir(path) | returns a list of the files and other directories in the given directory |

In [20]:
def walk(dirname):
    for name in os.listdir(dirname):
        path = os.path.join(dirname, name)
        
        if os.path.isfile(path):
            print(path)
        else:
            walk(path)

In [21]:
walk('c:/users/dhrun.lauwers/git/')

c:/users/dhrun.lauwers/git/knowledge-base\.git\COMMIT_EDITMSG
c:/users/dhrun.lauwers/git/knowledge-base\.git\config
c:/users/dhrun.lauwers/git/knowledge-base\.git\description
c:/users/dhrun.lauwers/git/knowledge-base\.git\FETCH_HEAD
c:/users/dhrun.lauwers/git/knowledge-base\.git\HEAD
c:/users/dhrun.lauwers/git/knowledge-base\.git\hooks\applypatch-msg.sample
c:/users/dhrun.lauwers/git/knowledge-base\.git\hooks\commit-msg.sample
c:/users/dhrun.lauwers/git/knowledge-base\.git\hooks\fsmonitor-watchman.sample
c:/users/dhrun.lauwers/git/knowledge-base\.git\hooks\post-update.sample
c:/users/dhrun.lauwers/git/knowledge-base\.git\hooks\pre-applypatch.sample
c:/users/dhrun.lauwers/git/knowledge-base\.git\hooks\pre-commit.sample
c:/users/dhrun.lauwers/git/knowledge-base\.git\hooks\pre-push.sample
c:/users/dhrun.lauwers/git/knowledge-base\.git\hooks\pre-rebase.sample
c:/users/dhrun.lauwers/git/knowledge-base\.git\hooks\pre-receive.sample
c:/users/dhrun.lauwers/git/knowledge-base\.git\hooks\prepare

c:/users/dhrun.lauwers/git/rep-b\.git\COMMIT_EDITMSG
c:/users/dhrun.lauwers/git/rep-b\.git\config
c:/users/dhrun.lauwers/git/rep-b\.git\description
c:/users/dhrun.lauwers/git/rep-b\.git\HEAD
c:/users/dhrun.lauwers/git/rep-b\.git\hooks\applypatch-msg.sample
c:/users/dhrun.lauwers/git/rep-b\.git\hooks\commit-msg.sample
c:/users/dhrun.lauwers/git/rep-b\.git\hooks\fsmonitor-watchman.sample
c:/users/dhrun.lauwers/git/rep-b\.git\hooks\post-update.sample
c:/users/dhrun.lauwers/git/rep-b\.git\hooks\pre-applypatch.sample
c:/users/dhrun.lauwers/git/rep-b\.git\hooks\pre-commit.sample
c:/users/dhrun.lauwers/git/rep-b\.git\hooks\pre-push.sample
c:/users/dhrun.lauwers/git/rep-b\.git\hooks\pre-rebase.sample
c:/users/dhrun.lauwers/git/rep-b\.git\hooks\pre-receive.sample
c:/users/dhrun.lauwers/git/rep-b\.git\hooks\prepare-commit-msg.sample
c:/users/dhrun.lauwers/git/rep-b\.git\hooks\update.sample
c:/users/dhrun.lauwers/git/rep-b\.git\index
c:/users/dhrun.lauwers/git/rep-b\.git\info\exclude
c:/users/dhr

c:/users/dhrun.lauwers/git/size-matters\.git\refs\heads\master
c:/users/dhrun.lauwers/git/size-matters\.git\refs\remotes\origin\develop
c:/users/dhrun.lauwers/git/size-matters\.git\refs\remotes\origin\master
c:/users/dhrun.lauwers/git/size-matters\.gitignore
c:/users/dhrun.lauwers/git/size-matters\.idea\.gitignore
c:/users/dhrun.lauwers/git/size-matters\.idea\inspectionProfiles\profiles_settings.xml
c:/users/dhrun.lauwers/git/size-matters\.idea\misc.xml
c:/users/dhrun.lauwers/git/size-matters\.idea\modules.xml
c:/users/dhrun.lauwers/git/size-matters\.idea\size-matters.iml
c:/users/dhrun.lauwers/git/size-matters\.idea\vcs.xml
c:/users/dhrun.lauwers/git/size-matters\.idea\workspace.xml
c:/users/dhrun.lauwers/git/size-matters\benefits-analysis\notebooks\.ipynb_checkpoints\Size Matters Benefits Analysis-checkpoint.ipynb
c:/users/dhrun.lauwers/git/size-matters\benefits-analysis\notebooks\Size Matters Benefits Analysis.ipynb
c:/users/dhrun.lauwers/git/size-matters\benefits-analysis\readme.md

## Catching exceptions

To avoid errors when working with files, its best to go ahead and try, and deal with the problems if they happen. This is exactly what the *try* statement does:

Python starts by executing the *try* clause. 
* If all gues well, it skips the *except* clause and proceeds. 
* If an exception occurs, it jumps out of the *try* clause and runs the *except* clause.

Handling an exception with a *try* statement is called *catching* an exception.

In [22]:
try:
    fin = open('bad_file.txt')
except:
    print('Something went wrong.')

Something went wrong.


## Databases
A **database** is a file that is organized for storing data. Many databases are organized like a dictionary in the sense that they map from keys to values. The biggest difference between a database and a dictionary is that the database is on disk, so it persists after the program ends

The module *dbm* provides an interface for creating and updating database file.

Opening a databse is similar to opening other files. The mode 'c' means that the database should be created if it doesnt already exist. The result is a database object that can be used for most operations like a dictionary.

When you create a new item, *dbm* updates the database file.

When you access one of the items, *dbm* reads the file

The result is a **bytes object**, which is why it begins with a 'b'. A bytes object is similar to a string in many ways. 

In [24]:
import dbm
db = dbm.open('captions', 'c')

In [25]:
db['cleese.png'] = 'Photo of John Cleese.'
db['cleese.png']

b'Photo of John Cleese.'

If you make another assignment to an existing key, *dbm* replaces the existing value.

In [29]:
db['cleese.png'] = 'Photo of John Cleese doing a silly walk.'
db['cleese.png']

b'Photo of John Cleese doing a silly walk.'

Some dictionary methods, like keys and items, don’t work with database objects. But iteration with a for loop works:

In [30]:
for key in db:
    print(key, db[key])

b'cleese.png' b'Photo of John Cleese doing a silly walk.'


## Pickling

A limitation of *dbm* is that the keys and values have to be strings or bytes. If you try to use any other type, you get an error.

The *pickle* module can help. It translates almost any type of object into a string suitable for storage in a database, and then translates the string back into objects.

*pickle.dumps* takes an object as a parameter, and returns a string representation. *pickle.loads* reconstitutes the object.

In [33]:
import pickle
t = [1,2,3]
s = pickle.dumps(t)
s

b'\x80\x03]q\x00(K\x01K\x02K\x03e.'

In [34]:
pickle.loads(s)

[1, 2, 3]

## Pipes
Most operating systems provide a command line interface, also known as a **shell**. Shells usually provide commands to navigate the file system and launch applications.

Any program that you can launch from the shell can also be launched from Python using the **pipe object**, which represents a running program.

For example, the unix command *ls -l* normally displays the contents of the current directory in long format. You can launch *ls* with *os.popen*:

In [36]:
cmd = 'ls -l'
fp = os.popen(cmd)

In [38]:
res = fp.read()
print(res)




## Writing modules
Any file that contains Python code can be imported as a module. For example, suppose that you have a file named *wc.py* with some code in a sub-directory of the home directory called 'exercises'. You can import the module as follows:

In [44]:
import exercises.wc as wc

This creates a module object, which provides the function *linecount*

In [46]:
wc

<module 'exercises.wc' from 'C:\\Users\\dhrun.lauwers\\git\\knowledge-base\\data-science\\think-python\\exercises\\wc.py'>

In [48]:
wc.linecount('exercises/wc.py')

5

Programs that are imported as modules often use the following idiom:

In [49]:
if __name__ == '__main__':
    print(linecount('exercises/wc.py'))

5


*\_\_name__* is a built-in variable that is set when the program starts. 
* If the program is running as a script, *\_\_name__* has the value '\_\_main__'; which triggers the code in the body of the statement to run. Usually, this contains some test code so you can test the module separately.
* If the module is being imported, the code in the body of the statement is skipped