# [Chapter 5. Files and I/O](http://chimera.labs.oreilly.com/books/1230000000393/ch05.html)

All programs need to perform input and output.  
This chapter covers common idioms for working with different kinds of files, including text and binary files, file encodings, and other related matters.  
Techniques for manipulating filenames and directories are also covered.

## [Reading and Writing Text Data](http://chimera.labs.oreilly.com/books/1230000000393/ch05.html#_reading_and_writing_text_data)

### Problem 

You need to read or write text data, possibly in different text encodings such as ASCII, UTF-8, or UTF-16.

### Solution

Use the `open()` function with mode `rt` to read a text file.

In [62]:
# Read the entire file as a single string:
with open('python_ipsum.txt', 'rt') as f:
    data = f.read()
f.close()
print(data)

Python Ipsum: Your source for Python-flavored placeholder text.
http://pythonipsum.com/

Lambda raspberrypi beautiful test script. Kwargs integration itertools dict reduce egg import cython.

Django integration functools unit object kwargs functools dictionary cython. Cython integration exception. Lambda integration diversity bdfl. Return integration exception self dunder. Python integration mercurial bdfl python lambda generator. Kwargs raspberrypi decorator unit cython import. Cython raspberrypi exception unit future klass exception. Python integration community. Object raspberrypi community bdfl cython import method.

Method raspberrypi diversity 2to3 return yield unit yield guido. Method integration mercurial unit import python exception dictionary. Django raspberrypi functools self import. Python integration mercurial dict return klass. Lambda integration mercurial 2to3 cython zen.

Import raspberrypi community pypi reduce dunder pyladies functools. Lambda raspberrypi decorator bd

In [63]:
# Iterate over the lines of the file:
with open('python_ipsum.txt', 'rt') as f:
    for line in f:
        print(line)
f.close()

Python Ipsum: Your source for Python-flavored placeholder text.

http://pythonipsum.com/



Lambda raspberrypi beautiful test script. Kwargs integration itertools dict reduce egg import cython.



Django integration functools unit object kwargs functools dictionary cython. Cython integration exception. Lambda integration diversity bdfl. Return integration exception self dunder. Python integration mercurial bdfl python lambda generator. Kwargs raspberrypi decorator unit cython import. Cython raspberrypi exception unit future klass exception. Python integration community. Object raspberrypi community bdfl cython import method.



Method raspberrypi diversity 2to3 return yield unit yield guido. Method integration mercurial unit import python exception dictionary. Django raspberrypi functools self import. Python integration mercurial dict return klass. Lambda integration mercurial 2to3 cython zen.



Import raspberrypi community pypi reduce dunder pyladies functools. Lambda raspberrypi dec

Similarly, to write a text file, use `open()` with mode `wt` to write a file, clearing and overwriting the previous contents (if there were any, at least).

In [64]:
# Write chunks of text data:
with open('text_data.txt', 'wt') as f:
    f.write("This is the first sentence.\n")
    f.write("This is the second sentence.\n")
    f.write("Spicy Jalapeño")
f.close()

In [65]:
with open('text_data.txt', 'rt') as f:
    text_data = f.read()
f.close()
print(text_data)

This is the first sentence.
This is the second sentence.
Spicy Jalapeño


Let's try a redirected `print()` call:

In [66]:
with open('print_statement.txt', 'wt') as f:
    print("This is the first sentence.", file=f)
    print("This is the second sentence.", file=f)
f.close()

In [67]:
with open('print_statement.txt', 'rt') as f:
    print_statement = f.read()
f.close()
print(print_statement)

This is the first sentence.
This is the second sentence.



To append to the end of an existing file, use `open()` with mode `at`.  
By default, files are read/written using the system default text encoding, as can be found in `sys.getdefaultencoding()`.  
On most machines, this is set to `utf-8`.  
If you know that the text you are reading or writing is in a different encoding, supply the optional encoding parameter to `open()`.

Python understands several hundred possible text encodings.  
However, some of the more common encodings are `ascii, latin-1, utf-8, and utf-16`.  
`UTF-8` is usually a safe bet if working with web applications.  
`ascii` corresponds to the 7-bit characters in the range `U+0000` to `U+007F`.  
`latin-1` is a direct mapping of bytes 0-255 to Unicode characters `U+0000` to `U+00FF`.  
`latin-1` encoding is notable in that it will never produce a decoding error when reading text of a possibly unknown encoding.  
Reading a file as `latin-1` might not produce a completely correct text decoding, but it still might be enough to extract useful data out of it.   
Also, if you later write the data back out, the original input data will be preserved.

In [68]:
with open('text_data.txt', 'rt', encoding='latin-1') as f:
    text_data = f.read()
f.close()
print(text_data)

This is the first sentence.
This is the second sentence.
Spicy JalapeÃ±o


### Discussion

Reading and writing text files is typically very straightforward.  
However, there are a number of subtle aspects to keep in mind.  
First, the use of the `with` statement in the examples establishes a context in which the file will be used.  
<b>When control leaves the `with` block, the file will be closed automatically</b>.  
You don’t need to use the `with` statement, but if you don’t use it, make sure you remember to close the file, like so:

In [69]:
f = open('text_data.txt', 'rt')
text_data = f.read()
f.close()
print(text_data)

This is the first sentence.
This is the second sentence.
Spicy Jalapeño


Another minor complication concerns the recognition of newlines, which are different on Unix and Windows (i.e., `\n` versus `\r\n`).  
By default, Python operates in what’s known as "universal newline" mode.  
In this mode, all common newline conventions are recognized, and newline characters are converted to a single `\n` character while reading.  
Similarly, the newline character `\n` is converted to the system default newline character on output.  
If you don’t want this translation, supply the `newline=''` argument to `open()`, like this:

In [70]:
# Read a file with a disabled newline translation:
with open('text_data.txt', 'rt', newline='') as f:
    text_data = f.read()
print(text_data)

This is the first sentence.
This is the second sentence.
Spicy Jalapeño


To illustrate the difference, here’s what you will see on a Unix machine if you read the contents of a Windows-encoded text file containing the raw data `hello world!\r\n`:

A final issue concerns possible encoding errors in text files.  
When reading or writing a text file, you might encounter an encoding or decoding error.

If you get this error, it usually means that you’re not reading the file in the correct encoding.  
You should carefully read the specification of whatever it is that you’re reading and check that you’re doing it right (e.g., reading data as `UTF-8` instead of `Latin-1` or whatever it needs to be).  
If encoding errors are still a possibility, you can supply an optional `errors` argument to `open()` to deal with the errors.  
Here are two samples of common error handling schemes:

In [71]:
# Replace bad characters with Unicode U+fffd replacement characters:
with open('text_data.txt', 'rt', encoding='ascii', errors='replace') as f:
    text_data = f.read()
print(text_data)

This is the first sentence.
This is the second sentence.
Spicy Jalape��o


In [72]:
# Ignore bad characters entirely:
f = open('text_data.txt', 'rt', encoding='ascii', errors='ignore')
text_data = f.read()
f.close()
print(text_data)

This is the first sentence.
This is the second sentence.
Spicy Jalapeo


If you’re constantly fiddling with the `encoding` and `errors` arguments to `open()` and doing lots of hacks, you’re probably making life more difficult than it needs to be.  
The number one rule with text is that you simply need to make sure you’re always using the proper text encoding.  
When in doubt, use the default setting (typically UTF-8).

## [Printing to a File](http://chimera.labs.oreilly.com/books/1230000000393/ch05.html#_printing_to_a_file)

### Problem

You want to redirect the output of the `print()` function to a file.

### Solution

Use the [`file` keyword argument](https://docs.python.org/3/library/functions.html#print) to `print()`, like this:

### Discussion

There's not much more to printing to a file other than this.  
However, make sure that the file is opened in text mode.  
Printing will fail if the underlying file is in binary mode.

## Printing with a Different Separator or Line Ending

### Problem

You want to output data using `print()`, but you also want to change the separator character or the line ending.

### Solution

Use the `sep` and `end` keyword arguments to `print()` to change the output as you see fit.

In [73]:
print('ACME', 50, 91.5)
print('ACME', 50, 91.5, sep=',')
print('ACME', 50, 91.5, sep='!')
print('ACME', 50, 91.5, sep=',', end ='!!\n')
print('ACME', 50, 91.5, sep='!!\n', end =',')
# Notice how the last one messes with the notebook when it's run after the other print statements.
# If you run the next command in a cell by itself, it returns the correct output.
# Hmmmmm...
print('ACME', 50, 91.5, sep='!', end='!')

ACME 50 91.5
ACME,50,91.5
ACME!50!91.5
ACME,50,91.5!!
ACME!!
50!!
91.5,ACME!50!91.5!

Use of the `end` argument is also how you suppress the output of newlines.

In [74]:
for i in range(5):
    print(i)

0
1
2
3
4


In [75]:
for i in range(5):
    print(i, end='')

01234

In [76]:
for i in range(5):
    print(i, end='')

01234

### Discussion

Using `print()` with a different item separator is often the easiest way to output data when you need something other than a space separating the items.  
Sometimes you'll see programmers using `str.join()` to accomplish the same thing.

The problem with `str.join()` is that it only works with strings.  
You may have to resort to some creative means to achieve your end.

Instead, you will have better luck with the following:

## Reading and Writing Binary Data

### Problem

You need to write or write [binary data](https://en.wikipedia.org/wiki/Binary_data#In_computer_science), like the kind found in image or sound files.

### Solution

Use the `open()` function with mode set to `rb` or `wb` to read or write binary data.

In [77]:
# Read the entire file as a single byte string:
with open('text_data.txt', 'rb') as f:
    data = f.read()
    
# Write binary data to a file:
with open('binary.bin', 'wb') as f:
    f.write(b'Python Cookbook')

When reading binary, it is important to stress that all data returned will be in the form of [byte strings](https://docs.python.org/3/library/stdtypes.html#bytes), not text strings.  
Similarly, when writing, you must supply data in the form of objects that expose data as bytes (e.g., byte strings, bytearray objects, etc.).

### Discussion

When reading binary data, the subtle semantic differences between byte strings and text strings pose a potential gotcha.  
In particular, be aware that indexing and iteration return integer byte values instead of byte strings.

In [78]:
# Text String:
t = 'Hello World'
t[0]

'H'

In [79]:
for c in t:
    print(c)

H
e
l
l
o
 
W
o
r
l
d


In [80]:
# Byte String:
b = b'Hello World'
b[0]

72

In [81]:
for c in b:
    print(c)

72
101
108
108
111
32
87
111
114
108
100


If you ever need to read or write text froma binary-mode file, remember to decode or encode it.

In [82]:
with open('text_data.txt', 'rb') as f:
    data = f.read(17)
    text = data.decode('utf-8')
    print(text)
    
with open('binary.bin', 'wb') as f:
    text = 'Python Cookbook'
    f.write(text.encode('utf-8'))

This is the first


A lesser-known aspect of binary I/O is that objects such as arrays and C structures can be used for writing without any kind if intermediate conversion to a `bytes` object.

In [83]:
import array

nums = array.array('i', [1, 2, 3, 4])
with open('array_data.bin', 'wb') as f:
    f.write(nums)

This applies to any object that implements the so-called ["buffer interface"](https://docs.python.org/3/c-api/buffer.html), which exposes an underlying memory buffer to [operations](https://docs.python.org/3/c-api/buffer.html#buffer-related-functions) that can work with it.  
Writing binary data is one such operation.  
Many Python objects also allow binary data to be directly read into their underlying memory using the `readinto()` method of files.

In [84]:
import array

a = array.array('i', [0, 0, 0, 0, 0, 0, 0 ,0])
with open('array_data.bin', 'rb') as f:
    print(f.readinto(a))

16


In [85]:
a

array('i', [1, 2, 3, 4, 0, 0, 0, 0])

Use caution when using this technique, because it is often platform specific and may depend on such things as the word size and [byte ordering](https://docs.python.org/3/library/struct.html#struct-alignment).  
See Recipe 5.9 for another example of reading binary data into a mutable buffer.

## Writing to a File That Doesn’t Already Exist

### Problem

You want to write data toa file, but only if it doesn't already exist on the filesystem.

### Solution

This problem can be handled by using the `x` mode in the function call to `open()` instead of `w` mode.

If the file is binary, use mode `xb` instead of `xt`.

### Discussion

An alternative solution is to first test for the existence of the file like so:

In [86]:
import os

if not os.path.exists('write_something.txt'):
    with open('write_something.txt', 'wt') as f:
        f.write('Word Up!\n')
else:
    print("That file already exists, sucka.")

That file already exists, sucka.


Note that using the `open()` function in `x` mode can only be done in Python 3.  
No such mode exists in earlier Python versions or in the C libraries used in Python's implementation.

## Performing I/O Operations on a String

### Problem

You want to feed a text or binary string to code that's been written to operate on file-like objects instead.

### Solution

Use the `io.StringIO()` and `io.BytesIO()` classes to create file-like objects that operate on string data.

In [87]:
import io

s = io.StringIO()
s.write('Hello World\n')

12

In [88]:
print('This is a test', file=s)
# Get all of the data written so far: 
s.getvalue()

'Hello World\nThis is a test\n'

In [89]:
# Wrap a file interface around an existing string:
s = io.StringIO('Hello\nWorld\n')
s.read(4)

'Hell'

In [90]:
s.read()

'o\nWorld\n'

The `io.StringIO` class should only be used for text.  
If you are using binary data, use the `io.BytesIO` class instead.

In [91]:
s = io.BytesIO()
s.write(b'binary data')
s.getvalue()

b'binary data'

### Discussion

The `StringIO` and `BytesIO` classes are most useful in scenarios where you need to mimic a normal file for some reason.  
For example, in unit tests, you might use `StringIO` to create a file-like object containing test data that’s fed into a function that would otherwise work with a normal file.  
Be aware that `StringIO` and `BytesIO` instances don’t have a proper integer file-descriptor.  
Thus, they do not work with code that requires the use of a real system-level file such as a file, pipe, or socket.

## Reading and Writing Compressed Datafiles

### Problem

You need to read or write data in a file with gzip or bz2 compression.

### Solution

Well, by golly, Python has both `gzip` and `bz2` modules that you can use.  
Both modules provide an alternative implementation of `open()` that can be used.

As shown, all I/O will use text and perform Unicode encoding and decoding.  
If you would rather use binary data, use the file modes `rb` or `wb`.

### Discussion

For the most part, reading or writing compressed data is straightforward.  
However, be aware that choosing the correct file mode is critically important.  
If you don’t specify a mode, the default mode is binary, which will break programs that expect to receive text.  
Both `gzip.open()` and `bz2.open()` accept the same parameters as the built-in `open()` function, including encoding, errors, newline, and so forth.  
When writing compressed data, the compression level can be optionally specified using the `compresslevel` keyword argument.

The default level is 9, which provides the highest level of compression.  
Lower levels offer better performance, but not as much compression.  
Finally, `gzip.open()` and `bz2.open()` can be layered on top of an existing file opened in binary mode.

Using that neat little trick can come in handy when you work with various file-like objects including sockets, pipes, and in-memory files.

## Iterating Over Fixed-Sized Records

### Problem

Instead of iterating over a file by lines, you want to iterate over a collection of fixed-size records.

### Solution

Use the `iter()` function and `functools.partial()` for this little trick:

The `records` object in this example is an iterable that will produce fixed-size chunks until the end of the file is reached.  
However, be aware that the last item may have fewer bytes than expected if the file size is not an exact multiple of the record size.

### Discussion

A little-known feature of the `iter()` function is that it can create an iterator if you pass it a callable and a sentinel value.  
The resulting iterator simply calls the supplied callable over and over again until it returns the sentinel, at which point iteration stops.
In the solution, the `functools.partial` is used to create a callable that reads a fixed number of bytes from a file each time it’s called.  
The sentinel of `b''` is what gets returned when a file is read but the end of file has been reached.  
Last, but not least, the solution shows the file being opened in binary mode.  
For reading fixed-sized records, this would probably be the most common case.  
For text files, reading line by line (the default iteration behavior) is more common.

## Reading Binary Data into a Mutable Buffer

### Problem

You want to read binary data directly into a mutable buffer without any intermediate copying.  
Perhaps you want to mutate the data in-place and write it back out to a file?  

### Solution

To read data into a mutable array, use the `readinto()` method.

In [92]:
import os.path

def read_into_buffer(filename):
    buf = bytearray(os.path.getsize(filename))
    with open(filename, 'rb') as f:
        f.readinto(buf)
    return buf

Let's see what we can do with this:

In [93]:
with open('write_to_buffer.bin', 'wb') as f:
    f.write(b'Hello World')

In [94]:
a_buffer_object = read_into_buffer('write_to_buffer.bin')

In [95]:
a_buffer_object

bytearray(b'Hello World')

In [96]:
a_buffer_object[0:5] = b'Haloo'

In [97]:
a_buffer_object

bytearray(b'Haloo World')

In [98]:
with open('new_buffer_object.bin', 'wb') as f:
    f.write(a_buffer_object)

Feel free to continue the above example in any way that amuses you.

### Discussion

The `readinto()` method of files can be used to fill any preallocated array with data.  
This even includes arrays created from the array module or libraries such as `numpy`.  
Unlike the normal `read()` method, `readinto()` fills the contents of an existing buffer rather than allocating new objects and returning them.  
Thus, you might be able to use it to avoid making extra memory allocations.  
For example, if you are reading a binary file consisting of equally sized records, you can write code like this:

In [99]:
record_size = 32

buf = bytearray(record_size)
with open('example.bin', 'rb') as f:
    while True:
        n = f.readinto(buf)
        if n < record_size:
            break
        # Use the contents of buf

Another interesting feature to check out is [memoryview](https://docs.python.org/3/library/stdtypes.html#memoryview), which you can use to make zero-copy slices of an existing buffer and even change its contents.

In [100]:
buf

bytearray(b'Hello World\n\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')

In [101]:
m1 = memoryview(buf)
m1

<memory at 0x11043e1c8>

In [102]:
m2 = m1[-5:]
m2

<memory at 0x11043e288>

In [103]:
m2[:] = b'WORLD'

In [104]:
buf

bytearray(b'Hello World\n\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00WORLD')

The output you see above is different from what your interpreter will return, but you can get the idea from the example.

One caution with using `f.readinto()` is that you must always make sure to check its return code, which is the number of bytes actually read.  
If the number of bytes is smaller than the size of the supplied buffer, it might indicate truncated or corrupted data (e.g., if you were expecting an exact number of bytes to be read).  
Finally, be on the lookout for other "into" related functions in various library modules (like `recv_into()` and `pack_into()`).  
Many other parts of Python have support for direct I/O or data access that can be used to fill or alter the contents of arrays and buffers.  
See Recipe 6.12 for a significantly more advanced example of interpreting binary structures and usage of memoryviews.

## Memory Mapping Binary Files

### Problem

You want to memory map a binary file into a mutable byte array, possibly for random access to its contents or to make in-place modifications.

### Solution

Use the `mmap` module to map memory files.  
Here is a utility function that shows how to open a file and memory map it in a portable manner:

In [105]:
import os
import mmap

def memory_map(filename, access=mmap.ACCESS_WRITE):
    size = os.path.getsize(filename)
    fd = os.open(filename, os.O_RDWR)
    return mmap.mmap(fd, size, access)

To use this function, you would need to have a file already created and filled with data.  
Here is an example of how you could initially create a file and expand it to a desired size:

In [106]:
# This will create 4KB of blank space on your disk.

size = 1000000
with open('some_data.bin', 'wb') as f:
    f.seek(size-1)
    f.write(b'\x00')

Here is an example of memory mapping the contents using the `memory_map()` function:

In [107]:
m = memory_map('some_data.bin')
len(m)

1000000

In [108]:
m[0:10]

b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

In [109]:
m[0]

0

Let's see what else we can do:

In [110]:
# Reassign a slice:
m[0:11] = b'Hello World'
m.close()

In [111]:
# Let's see what we changed and how:
with open('some_data.bin', 'rb') as f:
    print(f.read(11))

b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'


Well, that failed completely.  
Maybe something to do with the Jupyter Notebook?  
Oh well, let's try something else.

The `mmap` object returned by `mmap()` can also be used as a context manager, in which case the underlying file is closed automatically.

In [112]:
with memory_map('some_data.bin') as m:
    print(len(m))
    print(m[0:10])

1000000
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'


In [113]:
m.closed

True

By default, the `memory_map()` function shown opens a file for both reading and writing.  
Any modifications made to the data are copied back to the original file.  
If read-only access is needed instead, use `mmap.ACCESS_READ` for the `access` argument.

If you intend to modify the data locally, but don't want those changes written back to the original file, use `mmap.ACCESS_COPY`:

### Discussion

Using `mmap` to map files into memory can be an efficient and elegant means for randomly accessing the contents of a file.  
For example, instead of opening a file and performing various combinations of `seek()`, `read()`, and `write()` calls, you can simply map the file and access the data using slicing operations.  
Normally, the memory exposed by `mmap()` looks like a `bytearray` object.  
However, you can interpret the data differently using a memoryview.

In [114]:
m = memory_map('some_data.bin')
# Memoryview of unsigned integers:
v = memoryview(m).cast('I')
v[0] = 7
m[0:4]

b'\x07\x00\x00\x00'

That seemed to work okay.  
Moving on ...

In [115]:
m[0:4] = b'\x07\x01\x00\x00'
v[0]

263

It should be emphasized that memory mapping a file does not cause the entire file to be read into memory.  
That is, it’s not copied into some kind of memory buffer or array.  
Instead, the operating system merely reserves a section of virtual memory for the file contents.  
As you access different regions, those portions of the file will be read and mapped into the memory region as needed.  
However, parts of the file that are never accessed simply stay on disk.  
This all happens transparently, behind the scenes.  

If more than one Python interpreter memory maps the same file, the resulting `mmap` object can be used to exchange data between interpreters.  
That is, all interpreters can read/write data simultaneously, and changes made to the data in one interpreter will automatically appear in the others.  
Obviously, some extra care is required to synchronize things, but this kind of approach is sometimes used as an alternative to transmitting data in messages over pipes or sockets.  
As shown, this recipe has been written to be as general purpose as possible, working on both Unix and Windows. Be aware that there are some platform differences concerning the use of the `mmap()` call hidden behind the scenes.  
In addition, there are options to create anonymously mapped memory regions.  
For more information, [here is a link to the Python docs](https://docs.python.org/3/library/mmap.html).

## Manipulating Pathnames

### Problem

You need to manipulate pathnames in order to find the base filename, directory name, absolute path, and so on.

### Solution

To manipulate pathnames, use the functions in the `os.path` module.  

In [116]:
import os

path = 'Users/benjamingrove/Desktop/punchcode_syllabus.pdf'

Get the last component of the path:

In [117]:
os.path.basename(path)

'punchcode_syllabus.pdf'

Get the directory name:

In [118]:
os.path.dirname(path)

'Users/benjamingrove/Desktop'

Join path components together:

In [119]:
os.path.join('tmp', 'data', os.path.basename(path))

'tmp/data/punchcode_syllabus.pdf'

I wonder if we can expand the user's home directory?

In [120]:
path = '~/Data/data.csv'
os.path.expanduser(path)

'/Users/benjamingrove/Data/data.csv'

We can even split the file extension:

In [121]:
os.path.splitext(path)

('~/Data/data', '.csv')

### Discussion

For any manipulation of filenames, you should use the `os.path` module instead of trying to cook up your own code using the standard string operations.  
In part, this is for portability.  
The `os.path` module knows about differences between Unix and Windows and can reliably deal with filenames such as `Data/data.csv` and `Data\data.csv`.  
Second, you really shouldn’t spend your time reinventing the wheel.  
It’s usually best to use the functionality that’s already provided for you.

## Testing for the Existence of a File