## Binary Bytes Files

As hinted when we met strings earlier, Python 3.X draws a sharp distinction between
text and binary data in files: text files represent content as normal str strings and perform
Unicode encoding and decoding automatically when writing and reading data,
while binary files represent content as a special bytes string and allow you to access file
content unaltered.

To illustrate, Python’ struct
module can both create and unpack
packed binary data—raw bytes that record values that are not Python objects—to be
written to a file in binary mode.

In [2]:
import struct
packed = struct.pack('>i4sh', 7, b'spam', 8) # Create packed binary data

In [3]:
packed # 10 bytes, not objects or text

b'\x00\x00\x00\x07spam\x00\x08'

In [4]:
file = open('data.bin', 'wb') # Open binary output file

In [5]:
file.write(packed) # Write packed binary data

10

In [6]:
file.close()

In [8]:
# Reading binary data back is essentially symmetric; not all programs need to tread so
# deeply into the low-level realm of bytes, but binary files make this easy in Python:

data = open('data.bin', 'rb').read() # Open/read binary data file

In [9]:
data # 10 bytes, unaltered

b'\x00\x00\x00\x07spam\x00\x08'

In [10]:
data[4:8] # Slice bytes in the middle

b'spam'

In [11]:
list(data) # A sequence of 8-bit bytes

[0, 0, 0, 7, 115, 112, 97, 109, 0, 8]

In [12]:
struct.unpack('>i4sh', data) # Unpack into objects again

(7, b'spam', 8)

## Unicode Text Files

To access files containing non-ASCII Unicode
text of the sort introduced earlier in this chapter, we simply pass in an encoding name
if the text in the file doesn’t match the default encoding for our platform. In this mode,
Python text files automatically encode on writes and decode on reads per the encoding
scheme name you provide. In Python 3.X:

In [13]:
S = 'sp\xc4m' # Non-ASCII Unicode text

In [14]:
S

'spÄm'

In [15]:
S[2]

'Ä'

In [16]:
file = open('unidata.txt', 'w', encoding='utf-8') # Write/encode UTF-8 text

In [17]:
file.write(S) # 4 characters written

4

In [18]:
file.close()

In [19]:
text = open('unidata.txt', encoding='utf-8').read() # Read/decode UTF-8 text

In [20]:
text

'spÄm'

In [21]:
len(text)

4

Because files handle
this on transfers, you may process text in memory as a simple string of characters
without concern for its Unicode-encoded origins. If needed, though, you can also see
what’s truly stored in your file by stepping into binary mode:

In [22]:
raw = open('unidata.txt', 'rb').read() # Read raw encoded bytes

In [23]:
raw

b'sp\xc3\x84m'

In [24]:
len(raw)

5

In [25]:
# You can also encode and decode manually if you get Unicode data from a source other
# than a file—parsed from an email message or fetched over a network connection, for
# example:
text.encode('utf-8') # Manual encode to bytes

b'sp\xc3\x84m'

In [26]:
raw.decode('utf-8') # Manual decode to str

'spÄm'

In [27]:
text.encode('latin-1') # Bytes differ in others

b'sp\xc4m'

In [28]:
text.encode('utf-16')

b'\xff\xfes\x00p\x00\xc4\x00m\x00'

In [29]:
len(text.encode('latin-1')), len(text.encode('utf-16'))

(4, 10)

In [30]:
b'\xff\xfes\x00p\x00\xc4\x00m\x00'.decode('utf-16') # But same string decoded

'spÄm'