# Chapter 6  
# Data Encoding and Processing

The main focus of this chapter is using Python to process data presented in different kinds of common encodings, such as CSV files, JSON, XML, and binary packed records.  
Unlike the chapter on data structures, this chapter is not focused on specific algorithms, but instead on the problem of getting data in and out of a program.

## 6.1 Reading and Writing CSV Data

If you want to read or write data encoded as a CSV file, you can use Python's `csv` library.  
We will use some stock market data from a CSV file for this example.

You can read the data as a sequence of tuples:

In [1]:
import csv

with open('stocks.csv') as f:
    f_csv = csv.reader(f)
    headers = next(f_csv)
    for row in f_csv:
        # Process row
        # ... and so forth
        pass

In the preceding code, `row` will be a tuple.  
Thus, to access certain fields, you will need to use indexing, such as `row[0]` (Symbol) and `row[4]` (Change).  
Since such indexing can often be confusing, this is one place where you might want to consider the use of named tuples.

In [2]:
from collections import namedtuple
with open('stocks.csv') as f:
    f_csv = csv.reader(f)
    headings = next(f_csv)
    Row = namedtuple('Row', headings)
    for r in f_csv:
        row = Row(*r)
        # Process row
        # ... and so forth
        pass

This would allow you to use the column headers such as `row.Symbol` and `row.Change` instead of indices.  
It should be noted that this only works if the column headers are valid Python identifiers.  
If not, you might have to massage the initial headings (e.g., replacing nonidentifier characters with underscores or similar).  
Another approach allows you to read the data as a sequence of dictionaries instead.

In [3]:
import csv

with open('stocks.csv') as f:
    f_csv = csv.DictReader(f)
    for row in f_csv:
        # Do something ...
        pass

In this version, youo would access the elements of each row using the row headers.  
For example, `row['Symbol']` or `row['Change']`.  
To write CSV data, you also use the `csv` module, but you create a writer object.

In [4]:
headers = ['Symbol','Price','Date','Time','Change','Volume']
rows = [('AA', 39.48, '6/11/2007', '9:36am', -0.18, 181800),
            ('AIG', 71.38, '6/11/2007', '9:36am', -0.15, 195500),
            ('AXP', 62.58, '6/11/2007', '9:36am', -0.46, 935000),]

In [5]:
with open('stocks.csv', 'w') as f:
    f_csv = csv.writer(f)
    f_csv.writerow(headers)
    f_csv.writerows(rows)

If you have the data as a sequence of dictionaries, like so:

In [6]:
headers = ['Symbol', 'Price', 'Date', 'Time', 'Change', 'Volume']
rows = [{'Symbol':'AA', 'Price':39.48, 'Date':'6/11/2007',
         'Time':'9:36am', 'Change':-0.18, 'Volume':181800},
        {'Symbol':'AIG', 'Price': 71.38, 'Date':'6/11/2007',
         'Time':'9:36am', 'Change':-0.15, 'Volume': 195500},
        {'Symbol':'AXP', 'Price': 62.58, 'Date':'6/11/2007',
         'Time':'9:36am', 'Change':-0.46, 'Volume': 935000},]

In [7]:
with open('stocks.csv', 'w') as f:
    f_csv = csv.DictWriter(f, headers)
    f_csv.writeheader()
    f_csv.writerows(rows)

### 6.1 Discussion

Using Python's `csv` module can save you quite a bit of time over parsing, splitting, and cleaning the data manually by yourself.  
Here is an example:

In [8]:
with open('stocks.csv') as f:
    for line in f:
        row = line.split(',')
        # Do something ...
        pass

The problem with this approach is that you’ll still need to deal with some nasty details.  
For example, if any of the fields are surrounded by quotes, you’ll have to strip the quotes.  
In addition, if a quoted field happens to contain a comma, the code will break by producing a row with the wrong size.  
By default, the `csv` library is programmed to understand CSV encoding rules used by Microsoft Excel.  
This is probably the most common variant, and will likely give you the best compatibility.  
However, if you consult the documentation for csv, you’ll see a few ways to tweak the encoding to different formats (e.g., changing the separator character, etc.).  
For example, if you want to read tab-delimited data instead, use this:

In [9]:
with open('stocks.csv') as f:
    f_tsv = csv.reader(f, delimiter='\t')
    for row in f_tsv:
        # Do something ...
        pass

If you're reading CSV data and converting it into named tuples, use caution when validating column headers.  
For example, a CSV file could have a header line containing nonvalid identifier characters like this:

`Street Address,Num-Premises,Latitude,Longitude`  
`5412 N CLARK,10,41.980262,-87.668452`  

This will actually cause the creation of a `namedtuple` to fail with a `ValueError` exception.  
To work around this, you might have to scrub the headers first.  
For instance, carrying a regex substitution on nonvalid identifier characters like this:

In [10]:
import re

with open('stocks.csv') as f:
    f_csv = csv.reader(f)
    headers = [ re.sub('[^a-zA-Z_]', '_', h) for h in next(f_csv) ]
    Row = namedtuple('Row', headers)
    for r in f_csv:
        row = Row(*r)
        # do something
        pass

It's important to note that `csv` does not try to interpret the data or convert it to a type other than a string.  
The following example performs extra type conversions on CSV data:

In [11]:
col_types = [str, float, str, str, float, int]
with open('stocks.csv') as f:
    f_csv = csv.reader(f)
    headers = next(f_csv)
    for row in f_csv:
        # Apply conversions to the row items
        row = tuple(convert(value) for convert, value in zip(col_types, row))
        # And so forth ...
        pass

You can also convert selected fields of dictionaries:

In [12]:
print('Reading as dicts with type conversion')
field_types = [ ('Price', float),
                ('Change', float),
                ('Volume', int) ]

with open('stocks.csv') as f:
    for row in csv.DictReader(f):
        row.update((key, conversion(row[key])) for key, conversion in field_types)
        print(row)

Reading as dicts with type conversion
OrderedDict([('Symbol', 'AA'), ('Price', 39.48), ('Date', '6/11/2007'), ('Time', '9:36am'), ('Change', -0.18), ('Volume', 181800)])
OrderedDict([('Symbol', 'AIG'), ('Price', 71.38), ('Date', '6/11/2007'), ('Time', '9:36am'), ('Change', -0.15), ('Volume', 195500)])
OrderedDict([('Symbol', 'AXP'), ('Price', 62.58), ('Date', '6/11/2007'), ('Time', '9:36am'), ('Change', -0.46), ('Volume', 935000)])


In general, you’ll probably want to be a bit careful with such conversions, though.  
In the real world, it’s common for CSV files to have missing values, corrupted data, and other issues that would break type conversions.  
So, unless your data is guaranteed to be error free, that’s something you’ll need to consider (you might need to add suitable exception handling).  
Finally, if your goal in reading CSV data is to perform data analysis and statistics, you might want to look at the `pandas` package.  
`pandas` includes a convenient `pandas.read_csv()` function that will load CSV data into a `DataFrame` object.  
From there, you can generate various summary statistics, filter the data, and perform other kinds of high-level operations.

## 6.2 Reading and Writing JSON Data

### Problem  
You want to read or write data encoded as JavaScript Object Notation (JSON)

### Solution  
The `json` module provides an easy way to encode and decode data in JSON.  
The two main functions are `json.dumps()` and `json.loads()`, mirroring the interface used in other serialization libraries, such as `pickle`.  
Here is how you turn a Python data structure into JSON:

In [13]:
import json

data = {
    'name' : 'ACME',
    'shares' : 100,
    'price': 542.23
}

json_str = json.dumps(data)
json_str

'{"name": "ACME", "shares": 100, "price": 542.23}'

In [14]:
type(json_str)

str

Now we can turn the JSON-encoded string back into a Python data structure:

In [15]:
data = json.loads(json_str); data

{'name': 'ACME', 'shares': 100, 'price': 542.23}

In [16]:
type(data)

dict

If you are working with files instead of strings, you can also use `json.dump()` and `json.load()` to encode and decode JSON data.

In [17]:
# Write the data
with open ('data.json', 'w') as f:
    json.dump(data, f)
    
# Read data back
with open('data.json', 'r') as f:
    data = json.load(f)
    
data

{'name': 'ACME', 'shares': 100, 'price': 542.23}

### Discussion  
JSON encoding supports the basic types of `None, bool, int, float,` and `str`, as well as lists, tuples, and dictionaries containing those types.  
For dictionaries, keys are assumed to be strings (any non-string keys in a dictionary are converted to strings during encoding).  
To be compliant with the JSON specification, you should only encode Python lists and dictionaries.  
Note that in web applications, it is also conventional for the top-level object to be a dictionary.  
The format of JSON encoding is almost identical to Python syntax except for a few minor changes.  
for instance, `True` is mapped to `true`, `False` is mapped to `false`, and `None` is mapped to `null`.

In [18]:
json.dumps(False)

'false'

In [19]:
d = {
    'a' : True,
    'b' : 'Hello',
    'c': None
}

json.dumps(d)

'{"a": true, "b": "Hello", "c": null}'

If you are trying to examine data you have decoded from JSON, it can often be hard to ascertain its structure simply by printing it out, especially if the data contains a deep level of nested structures or a lot of fields.  
To assist with this, consider using the `pprint()` function in the pprint module.  
This will alphabetize the keys and output a dictionary in a more sane way.  

Normally, JSON decoding will create dicts or lists from the supplied data.  
If you want to create different kinds of objects, supply the `object_pairs_hook` or `object_hook` to `json.loads()`.  
Here is one way you can encode JSON data that preserves its order in an `OrderedDict`:

In [20]:
s = '{"name": "ACME", "shares": 50, "price": 490.1}'

from collections import OrderedDict

data = json.loads(s, object_pairs_hook=OrderedDict); data

OrderedDict([('name', 'ACME'), ('shares', 50), ('price', 490.1)])

You can also turn a JSON dictionary into a Python object:

In [21]:
class JSONObject:
    def __init__(self, d):
        self.__dict__ = d
        
        
data = json.loads(s, object_hook=JSONObject)
data.name, data.shares, data.price

('ACME', 50, 490.1)

In this last example, the dictionary created by decoding the JSON data is passed as a single argument to `__init__()`.  
From there, you can use it directly as the instance dictionary of the object.

There are a few options that can be useful for encoding JSON.  
If you would like the output to be nicely formatted, you can use the indent argument to `json.dumps()`.  
This causes the output to be pretty printed in a format similar to that with the `pprint()` function.  

In [22]:
with open('data.json', 'r') as f:
    data = json.load(f)
    
print(json.dumps(data))
print(json.dumps(data, indent=4))

{"name": "ACME", "shares": 100, "price": 542.23}
{
    "name": "ACME",
    "shares": 100,
    "price": 542.23
}


You can use the `sort_keys` argument to sort the keys alphabetically on output:

In [23]:
print(json.dumps(data, sort_keys=True))

{"name": "ACME", "price": 542.23, "shares": 100}


Instances are not normally serializable as JSON.  
The following code breaks down:

In [24]:
class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y
        
p = Point(2, 3)

If you want to serialize instances, you can supply a function that takes an instance as input and returns a dictionary that can be serialized.

In [25]:
def serialize_instance(obj):
    d = { '__classname__' : type(obj).__name__ }
    d.update(vars(obj))
    return d

If you want to get an instance back, you could do this:

In [26]:
# Dictionary mapping names to known classes
classes = { 'Point' : Point }

def unserialize_object(d):
    clsname = d.pop('__classname__', None)
    if clsname:
        cls = classes[clsname]
        obj = cls.__new__(cls)  # Creates an instance without calling the __init__() method
        for key, value in d.items():
            setattr(obj, key, value)
            return obj
    else:
        return d

In [27]:
p = Point(2,3)
s = json.dumps(p, default=serialize_instance); s

'{"__classname__": "Point", "x": 2, "y": 3}'

In [28]:
a = json.loads(s, object_hook=unserialize_object); a

<__main__.Point at 0x10825e748>

In [29]:
a.x

2

The `json` module has a variety of other options for controlling the low-level interpretation of numbers, special values such as `NaN`, and more.  
[The JavaScript Object Notation (JSON) Data Interchange Format](https://tools.ietf.org/html/rfc8259)  
[`json` — JSON encoder and decoder](https://docs.python.org/3.7/library/json.html)

## 6.3 Parsing Simple XML Data

The `xml.etree.ElementTree` module can be used to extract data from simple XML documents.  
To illustrate, suppose you want to parse and make a summary of the RSS feed on [Planet Python](https://planetpython.org/).  
The following code will do that.

In [30]:
from urllib.request import urlopen
from xml.etree.ElementTree import parse

# Download the RSS feed and parse it:
u = urlopen('https://planet.python.org/rss20.xml')
doc = parse(u); doc

<xml.etree.ElementTree.ElementTree at 0x1082bb198>

Now we can extract and output the tags that interest us:

In [31]:
for item in doc.iterfind('channel/item'):
    title = item.findtext('title')
    date = item.findtext('pubDate')
    link = item.findtext('link')
    print(title)
    print(date)
    print(link)
    print()

Tryton News: Newsletter March 2019
Fri, 01 Mar 2019 07:00:00 +0000
https://discuss.tryton.org/t/newsletter-march-2019/1104

Toshio Kuratomi: Managing vim8 plugins
Fri, 01 Mar 2019 02:02:50 +0000
https://anonbadger.wordpress.com/2019/02/28/managing-vim8-plugins/

Test and Code: 67: Teaching Python in Middle School
Thu, 28 Feb 2019 18:00:00 +0000
https://testandcode.com/67

Stack Abuse: Doubly Linked List with Python Examples
Thu, 28 Feb 2019 17:24:00 +0000
https://stackabuse.com/doubly-linked-list-with-python-examples/

PyCharm: PyCharm 2019.1 EAP 6
Thu, 28 Feb 2019 09:47:33 +0000
http://feedproxy.google.com/~r/Pycharm/~3/23NYxhDg7V4/

gamingdirectional: The mana detection mechanism
Thu, 28 Feb 2019 07:04:26 +0000
http://gamingdirectional.com/blog/2019/02/28/the-mana-detection-mechanism/

codingdirectional: Include the currency name into the forex application
Thu, 28 Feb 2019 04:30:39 +0000
http://codingdirectional.info/2019/02/28/include-the-currency-name-into-the-forex-application/

P

### Discussion

Working with data encoded as XML is commonplace in many applications.  
Not only is XML widely used as a format for exchanging data on the Internet, it is a common format for storing application data (e.g., word processing, music libraries, etc.).  
The discussion that follows already assumes the reader is familiar with XML basics.

In many cases, when XML is simply being used to store data, the document structure is compact and straightforward.  
The `xml.etree.ElementTree.parse()` function parses the entire XML document into a document object.  
From there, you use methods such as `find()`, `iterfind()`, and `findtext()` to search for specific XML elements.  
The arguments to these functions are the names of a specific tag, such as channel/item or title.
When specifying tags, you need to take the overall document structure into account.  
Each find operation takes place relative to a starting element. 
Likewise, the tagname that you supply to each operation is also relative to the start.  
In the example, the call to `doc.iterfind('channel/item')` looks for all "item" elements under a "channel" element. doc represents the top of the document (the top-level "rss" element).  
The later calls to `item.findtext()` take place relative to the found "item" elements.  
Each element represented by the `ElementTree` module has a few essential attributes and methods that are useful when parsing.  
The tag attribute contains the name of the tag, the text attribute contains enclosed text, and the `get()` method can be used to extract attributes (if any).

In [32]:
doc

<xml.etree.ElementTree.ElementTree at 0x1082bb198>

In [33]:
e = doc.find('channel/title'); e

<Element 'title' at 0x1082eaf98>

In [34]:
e.tag

'title'

In [35]:
e.text

'Planet Python'

It should be noted that `xml.etree.ElementTree` is not the only option for XML parsing.  
For more advanced applications, you might consider `lxml`.  
It uses the same program‐ ming interface as ElementTree, so the example shown in this recipe works in the same manner.  
You simply need to change the first import to:  
`from lxml.etree import parse`.  
`lxml` provides the benefit of being fully compliant with XML standards.  
It is also extremely fast, and provides support for features such as validation, XSLT, and XPath.

## 6.4 Parsing Huge XML Files Incrementally 

### Problem

You need to extract data from a huge XML document while using as little memory as possible.

### Solution

Any time you are faced with the problem of incremental data processing, you should think of iterators and generators.  
Here is a simple function that can be used to incrementally process huge XML files using a very small memory footprint:  

In [36]:
from xml.etree.ElementTree import iterparse

def parse_and_remove(filename, path):
    path_parts = path.split('/')
    doc = iterparse(filename, ('start', 'end'))
    # Skip the root element:
    next(doc)
    
    tag_stack = []
    elem_stack = []
    for event, elem in doc:
        if event == 'start':
            tag_stack.append(elem.tag)
            elem_stack.append(elem)
        elif even == 'end':
            if tag_stack == path_parts:
                yield elem
                elem_stack[-2].remove(elem)
            try:
                tag_stack.pop()
                elem_stack.pop()
            except IndexError:
                pass

To test the function, you now need to find a large XML file to work with.  
You can often find such files on government and open data websites.  
For example, you can download [Chicago’s pothole database](https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Pot-Holes-Reported/7as2-ds3y) as XML.  
At the time of this writing, the downloaded file consists of more than 100,000 rows of data, which are encoded like this:

You could write a script that ranks ZIP codes by the number of pothole reports:

The only problem with this script is that it reads and parses the entire XML file into memory.  
On our machine, it takes about 450 MB of memory to run.  
Using this recipe’s code, the program changes only slightly:

This version of the program has a memory footprint of only 7MB.

### Discussion

This recipe relies on two core features of the `ElementTree` module.  
First, the `iterparse()` method allows incremental processing of XML documents.  
To use it, you supply the filename along with an event list consisting of one or more of the following:  
`start, end, start-ns,` and `end-ns`.  
The iterator created by `iterparse()` produces tuples of the form `(event, elem)`, where `event` is one of the listed events and `elem` is the resulting XML element.

`start` events are created when an element is first created but not yet populated with any other data (e.g., child elements).  
`end` events are created when an element is completed.  
Although not shown in this recipe, `start-ns` and `end-ns` events are used to handle XML namespace declarations.  
In this recipe, the start and end events are used to manage stacks of elements and tags.  
The stacks represent the current hierarchical structure of the document as it’s being parsed, and are also used to determine if an element matches the requested path given to the `parse_and_remove()` function.  
If a match is made, `yield` is used to emit it back to the caller.  
The following statement after the yield is the core feature of ElementTree that makes this recipe save memory:  

`elem_stack[-2].remove(elem)`

This statement causes the previously yielded element to be removed from its parent.  
Assuming that no references are left to it anywhere else, the element is destroyed and memory reclaimed.  
The end effect of the iterative parse and the removal of nodes is a highly efficient incremental sweep over the document.  
At no point is a complete document tree ever constructed.  
Yet, it is still possible to write code that processes the XML data in a straightforward manner.  
The primary downside to this recipe is its runtime performance.  
When tested, the version of code that reads the entire document into memory first runs approximately twice as fast as the version that processes it incrementally.  
However, it requires more than 60 times as much memory.  
So, if memory use is a greater concern, the incremental version is a big win.

## 6.5 Turning A Dictionary into XML

### Problem

Take the data in a Python dictionary and convert it to XML.

### Solution

Although the `xml.etree.ElementTree` library is commonly used for parsing, it can also be used to create XML documents.

In [37]:
from xml.etree.ElementTree import Element

def dict_to_xml(tag, d):
    """
    Turn a dict into XML
    """
    elem = Element(tag)
    for key, val in d.items():
        child = Element(key)
        child.text = str(val)
        elem.append(child)
    return elem

s = { 'name': 'GOOG', 'shares': 100, 'price':490.1 }
e = dict_to_xml('stock', s)
e

<Element 'stock' at 0x109430b88>

The result of this conversion is an `Element` instance.  
For I/O, it's easy to convert this instance to a byte string using the `tostring()` function in `xml.etree.ElementTree`.

In [38]:
from xml.etree.ElementTree import tostring 

tostring(e)

b'<stock><name>GOOG</name><shares>100</shares><price>490.1</price></stock>'

You can also attach attributes to an element using its `set()` method:

In [39]:
e.set('_id', '1234')
tostring(e)

b'<stock _id="1234"><name>GOOG</name><shares>100</shares><price>490.1</price></stock>'

If the order of the elements matters, you might make an `OrderedDict` instead of a normal dictionary, like in Recipe 1.7.

### Discussion

When creating XML, you might be inclined to just make strings instead:

In [40]:
def dict_to_xml_str(tag, d):
    """
    Turn a simple dict of key/value pairs into XML
    """
    parts = ['<{}>'.format(tag)]
    for key, val in d.items():
        parts.append('<{0}>{1}</{0}>'.format(key, val))
    parts.append('</{}>'.format(tag))
    return ''.join(parts)

However, if you try to do things manually, things can become messy.  
How do you deal with special characters?

In [41]:
d = { 'name' : '<spam>'}
# String creation:
dict_to_xml_str('item', d)

'<item><name><spam></name></item>'

In [42]:
# Proper XML creation:
e = dict_to_xml('item', d)
tostring(e)

b'<item><name>&lt;spam&gt;</name></item>'

Notice how in the latter example, the characters `<` and `>` got replaced with `&lt;` and `&gt;`.  
Just for reference, if you ever need to manually escape or unescape such characters, you can use the `escape()` and `unescape()` functions in `xml.sax.saxutils`.

In [43]:
from xml.sax.saxutils import escape, unescape

escape('<spam>')

'&lt;spam&gt;'

In [44]:
unescape(_)

'<spam>'

Aside from creating correct output, the other reason why it’s a good idea to create `Element` instances instead of strings is that they can be more easily combined together to make a larger document.  
The resulting `Element` instances can also be processed in various ways without ever having to worry about parsing the XML text.  
Essentially, you can do all of the processing of the data in a more high-level form and then output it as a string at the very end.

## 6.6. Parsing, Modifying, and Rewriting XML

### Problem

You want to read an XML document, make changes to it, and then write it back out as XML.

### Solution

The `xml.etree.ElementTree` module makes it easy to perform such tasks.  
Essentially, you start out by parsing the document in the usual way.  
For example, suppose you have a document named `pred.xml` that looks like this:

We can use `ElementTree` to read it and make changes to the structure.

In [45]:
from xml.etree.ElementTree import parse, Element

doc = parse('pred.xml')
root = doc.getroot()
root

<Element 'stop' at 0x10945c958>

Let's make some changes to our XML file and see what happens:

In [46]:
# Remove a few elements
root.remove(root.find('sri'))
root.remove(root.find('cr'))
# Insert a new element after <nm>...</nm>
root.getchildren().index(root.find('nm'))

1

We can create a simple element that will be added to the file.

In [47]:
e = Element('spam')
e.text = 'This is a test'
root.insert(2, e)
# Write it to the file:
doc.write('newpred.xml', xml_declaration=True)

We have created a new XML file that looks like this:

### Discussion

Modifying the structure of an XML document is straightforward, but you must remember that all modifications are generally made to the parent element, treating it as if it were a list.  
For example, if you remove an element, it is removed from its immediate parent using that parent’s `remove()` method.  
If you insert or append new elements, you also use `insert()` and `append()` methods on the parent.  
Elements can also be manipulated using indexing and slicing operations, such as `element[i]` or `element[i:j]`.  
If you need to make new elements, use the `Element` class, as shown in this recipe’s solution.  
A further description is available in Recipe 6.5.

## 6.7. Parsing XML Documents with Namespaces

### Problem

You need to parse an XML document, but it uses XML namespaces.

### Solution

Look at how the following document uses namespaces:

If you parse this document and try to perform the usual queries, you'll find that it doesn't work so easily:

In [48]:
doc = parse('namespaces.xml')

Let's begin with some queries that actually work:

In [49]:
doc.findtext('author')

'David Beazley'

In [50]:
doc.find('content')

<Element 'content' at 0x109467638>

Now let's try some queries that don't go so well:

In [51]:
# A query involving a namespace:
doc.find('content/html')

Only a fully qualified query will work:

In [52]:
doc.find('content/{http://www.w3.org/1999/xhtml}html')

<Element '{http://www.w3.org/1999/xhtml}html' at 0x109467688>

In [53]:
# This one doesn't work either:
doc.findtext('content/{http://www.w3.org/1999/xhtml}html/head/title')

In [54]:
# Fully qualified:
doc.findtext('content/{http://www.w3.org/1999/xhtml}html/'\
             '{http://www.w3.org/1999/xhtml}head/{http://www.w3.org/1999/xhtml}title')

'Hello World'

One way that you can simplify things is to wrap namespace handling up into a utility class:

In [55]:
class XMLNamespaces:
    def __init__(self, **kwargs):
        self.namespaces = {}
        for name, uri in kwargs.items():
            self.register(name, uri)
    def register(self, name, uri):
        self.namespaces[name] = '{'+uri+'}'
    def __call__(self, path):
        return path.format_map(self.namespaces)

Now let's put our class to work making our lives easier:

In [56]:
ns = XMLNamespaces(html='http://www.w3.org/1999/xhtml')
doc.find(ns('content/{html}html'))

<Element '{http://www.w3.org/1999/xhtml}html' at 0x109467688>

In [57]:
doc.findtext(ns('content/{html}html/{html}head/{html}title'))

'Hello World'

### Discussion

Parsing XML documents that contain namespaces can be messy.  
The `XMLNamespaces` class is really just meant to clean it up slightly by allowing you to use the shortened namespace names in subsequent operations as opposed to fully qualified URIs.  
Unfortunately, there is no mechanism in the basic `ElementTree` parser to get further information about namespaces.  
However, you can get a bit more information about the scope of namespace processing if you’re willing to use the `iterparse()` function instead.

In [58]:
from xml.etree.ElementTree import iterparse

for evt, elem in iterparse('namespaces.xml'):
    print(evt, elem)

end <Element 'author' at 0x10946f9a8>
end <Element '{http://www.w3.org/1999/xhtml}title' at 0x10946fae8>
end <Element '{http://www.w3.org/1999/xhtml}head' at 0x10946fa98>
end <Element '{http://www.w3.org/1999/xhtml}h1' at 0x10946fb88>
end <Element '{http://www.w3.org/1999/xhtml}body' at 0x10946fb38>
end <Element '{http://www.w3.org/1999/xhtml}html' at 0x10946fa48>
end <Element 'content' at 0x10946f9f8>
end <Element 'top' at 0x10946f958>


In [59]:
# The top-most element:
elem

<Element 'top' at 0x10946f958>

As a final note, if the text you are parsing makes use of namespaces in addition to other advanced XML features, you’re really better off using the `lxml` library instead of `ElementTree`.  
For instance, `lxml` provides better support for validating documents against a DTD, more complete XPath support, and other advanced XML features.  
This recipe is really just a simple fix to make parsing a little easier.

## 6.8. Interacting with a Relational Database

### Problem

You need to select, insert, or delete rows in a relational database.

### Solution

A common way of representing rows of data in Python is as a sequence of tuples.

In [60]:
stocks = [
        ('GOOG', 100, 490.1),
        ('AAPL', 50, 545.75),
        ('FB', 150, 7.45),
        ('HPQ', 75, 33.2),
]
stocks

[('GOOG', 100, 490.1),
 ('AAPL', 50, 545.75),
 ('FB', 150, 7.45),
 ('HPQ', 75, 33.2)]

Given data in this form, it is relatively straightforward to interact with a relational database using Python’s standard database API, as described in PEP [249](https://www.python.org/dev/peps/pep-0249/).  
If you want to take a closer look at interfacing with databases, check out the [Database Topic Guide](https://wiki.python.org/moin/DatabaseProgramming)
The gist of the API is that all operations on the database are carried out by SQL queries.  
Each row of input or output data is represented by a tuple.  
To illustrate, you can use the `sqlite3` module that comes with Python.  
If you are using a different database like MySql, Postgres, or ODBC, you’ll have to install a third-party module to support it.  
The first step is to connect to the database.  
You can start by calling the `connect()` function, supplying parameters such as the name of the database, hostname, username, password, and other details as needed.

In [61]:
import sqlite3

db = sqlite3.connect('database.db')

Now you can create a cursor and begin working with the data.  
Let's execute a few SQL queries.

To insert a sequence of rows into the data, use a statement like this:

To perform a query, use a statement like this:

In [62]:
for row in db.execute('select * from portfolio'):
    print(row)

('GOOG', 100, 490.1)
('AAPL', 50, 545.75)
('FB', 150, 7.45)
('HPQ', 75, 33.2)


If you want to perform queries that accept user-supplied input parameters, make sure you escape the parameters using `?` like this:

In [63]:
min_price = 100
for row in db.execute('select * from portfolio where price >= ?', (min_price,)):
    print(row)

('GOOG', 100, 490.1)
('AAPL', 50, 545.75)


### Discussion

At a low level, interacting with a database is an extremely straightforward thing to do.  
You simply form SQL statements and feed them to the underlying module to either update the database or retrieve data.  
That said, there are still some tricky details you’ll need to sort out on a case-by-case basis.  
One complication is the mapping of data from the database into Python types.  
For entries such as dates, it is most common to use datetime instances from the date time module, or possibly system timestamps, as used in the time module.  
For numerical data, especially financial data involving decimals, numbers may be represented as `Decimal` instances from the decimal module.  
Unfortunately, the exact mapping varies by database backend so you’ll have to read the associated documentation. 

Another extremely critical complication concerns the formation of SQL statement strings.  
You should never use Python string formatting operators (`%`) or the `.format()` method to create such strings.  
If the values provided to such formatting operators are derived from user input, this opens up your program to a(n) [SQL-injection attack](https://xkcd.com/327/).  
The special `?` wildcard in queries instructs the database backend to use its own string substitution mechanism, which *should* do it safely.
However, there is some inconsistency across database backends with respect to the wildcard.  
Many modules use `?` or `%s`, while others may use a different symbol, such as `:0` or `:1`, to refer to parameters.  
Again, you’ll have to consult the documentation for the database module you’re using.  
The `paramstyle` attribute of a database module also contains information about the quoting style.
For simply pulling data in and out of a database table, using the database API is usually simple enough.  
If you’re doing something more complicated, it may make sense to use a higher-level interface, such as that provided by an object-relational mapper.  
Libraries such as [SQLAlchemy](https://www.sqlalchemy.org/) allow database tables to be described as Python classes and for database operations to be carried out while hiding most of the underlying SQL.

## 6.9. Decoding and Encoding Hexadecimal Digits

### Problem

You need to decode a string of hexadecimal digits into a byte string or encode a byte string as hex.

### Solution

If you just need to decode or encode a raw string of hex digits, use the `binascii` module.

In [64]:
s = b'hello'
# Encode as hex
import binascii
h = binascii.b2a_hex(s)
h

b'68656c6c6f'

In [65]:
# Decode to bytes
binascii.a2b_hex(h)

b'hello'

You can also find similar functionality in the `base64` module.

In [66]:
import base64

h = base64.b16encode(s)
h

b'68656C6C6F'

In [67]:
base64.b16decode(h)

b'hello'

### Discussion

For the most part, converting to and from hex is straightforward using the functions shown.  
The main difference between the two techniques is in case folding.  
The `base64.b16decode()` and `base64.b16encode()` functions only operate with uppercase hexadecimal letters, whereas the functions in `binascii` work with either case.
It’s also important to note that the output produced by the encoding functions is always a byte string.  
To coerce it to Unicode for output, you may need to add an extra decoding step.

In [68]:
h = base64.b16encode(s)
h

b'68656C6C6F'

In [69]:
h.decode('ascii')

'68656C6C6F'

When decoding hex digits, the `b16decode()` and `a2b_hex()` functions accept either bytes or unicode strings.  
However, those strings must only contain ASCII-encoded hexadecimal digits.

## 6.10. Decoding and Encoding Base64

### Problem

You need to decode or encode binary data using Base64 encoding.

### Solution

The `base64` module has two functions -- `b64encode()` and `b64decode()` -- that do exactly what you want.

In [70]:
s = b'experiment'
import base64
a = base64.b64encode(s)
a

b'ZXhwZXJpbWVudA=='

In [71]:
base64.b64decode(a)

b'experiment'

### Discussion

Base64 encoding is only meant to be used on byte-oriented data such as byte strings and byte arrays.  
Moreover, the output of the encoding process is always a byte string.  
If you are mixing Base64-encoded data with Unicode text, you may have to perform an extra decoding step.

In [72]:
a = base64.b64encode(s).decode('ascii')
a

'ZXhwZXJpbWVudA=='

When decoding Base64, both byte strings and Unicode text strings can be supplied.  
However, Unicode strings can only contain ASCII characters.

## 6.11. Reading and Writing Binary Arrays of Structures

### Problem

You want to read or write data encoded as a binary array of uniform structures into Python tuples.

### Solution

If you are working with binary data, use the `struct` module.  
Here is an example that writes a list of Python tuples to a binary file, encoding each tuple as a structure using `struct`.

There are several approaches for reading this file back into a list of tuples.  
Here is one way to read the file:

You can also read the file entirely into a byte string with a single read and convert it piece by piece:

In both cases the result is an iterable that produces the tuples originally stored when the file was created.

### Discussion

For programs that must encode and decode binary data, it is common to use the `struct` module.  
To declare a new structure, simply create an instance of `Struct` such as:

Structures are always defined using a set of structure codes such as i, d, f, and so forth.  
These codes correspond to specific binary data types such as 32-bit integers, 64-bit floats, 32-bit floats, and so forth.  
The `<` in the first character specifies the byte ordering.  
In this example, it is indicating little endian.  
Change the character to `>` for big endian or `!` for network byte order.  
The resulting `Struct` instance has various attributes and methods for manipulating structures of that type.  
The size attribute contains the size of the structure in bytes, which is useful to have in I/O operations.  
`pack()` and `unpack()` methods are used to pack and unpack data.

In [73]:
from struct import Struct
record_struct = Struct('<idd')
record_struct.size

20

In [74]:
record_struct.pack(1, 2.0, 3.0)

b'\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x08@'

In [75]:
record_struct.unpack(_)

(1, 2.0, 3.0)

Sometimes you’ll see the `pack()` and `unpack()` operations called as module-level functions:

In [76]:
import struct
struct.pack('<idd', 1, 2.0, 3.0)

b'\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x08@'

In [77]:
struct.unpack('<idd', _)

(1, 2.0, 3.0)

This works, but feels less elegant than creating a single `Struct` instance, especially if the same structure appears in multiple places in your code.  
By creating a Struct instance, the format code is only specified once and all of the useful operations are grouped together. This certainly makes it easier to maintain your code if you need to fiddle with the structure code because you only have to change it in one place.
The code for reading binary structures involves a number of programming idioms.  
In the `read_records()` function, `iter()` is being used to make an iterator that returns fixed-sized chunks.  
This iterator repeatedly calls a user-supplied callable (e.g., `lambda: f.read(record_struct.size))` until it returns a specified value (e.g., `b`), at which point iteration stops.

In [78]:
f = open('data.b', 'rb')
chunks = iter(lambda: f.read(20), b'')
chunks

<callable_iterator at 0x10825eb00>

In [79]:
for chk in chunks:
    print(chk)

b'\x01\x00\x00\x00ffffff\x02@\x00\x00\x00\x00\x00\x00\x12@'
b'\x06\x00\x00\x00333333\x1f@\x00\x00\x00\x00\x00\x00"@'
b'\x0c\x00\x00\x00\xcd\xcc\xcc\xcc\xcc\xcc*@\x9a\x99\x99\x99\x99YL@'


One reason for creating an iterable is that it nicely allows records to be created using a generator comprehension, as shown in the solution.  
If you didn’t use this approach, the code might look like this:

In [80]:
def read_records(format, f):
    record_struct = Struct(format)
    while True:
        chk = f.read(record_struct.size)
        if chk == b'':
            break
        yield record_struct.unpack(chk)
    return records

In the `unpack_records()` function, a different approach using the `unpack_from()` method is used.  
`unpack_from()` is a useful method for extracting binary data from a larger binary array, because it does so without making any temporary objects or memory copies.  
You just give it a byte string (or any array) along with a byte offset, and it will unpack fields directly from that location.  
If you used `unpack()` instead of `unpack_from()`, you would need to modify the code to make a lot of small slices and offset calculations.

In [81]:
def unpack_records(format, data):
    record_struct = Struct(format)
    return (record_struct.unpack(data[offset:offset + record_struct.size])
           for offset in range(0, len(data), record_struct.size))

In addition to being more complicated to read, this version also requires a lot more work, as it performs various offset calculations, copies data, and makes small slice objects.  
If you’re going to be unpacking a lot of structures from a large byte string you’ve already read, `unpack_from()` is a more elegant approach.  
Unpacking records is one place where you might want to use `namedtuple` objects from the `collections` module.  
This allows you to set attribute names on the returned tuples.

If you’re writing a program that needs to work with a large amount of binary data, you may be better off using a library such as `numpy`.  
For example, instead of reading a binary into a list of tuples, you could read it into a structured array, like this:

In [82]:
import numpy as np
f = open('data.b', 'rb')
records = np.fromfile(f, dtype='<i,<d,<d')
records

array([( 1,  2.3,  4.5), ( 6,  7.8,  9. ), (12, 13.4, 56.7)],
      dtype=[('f0', '<i4'), ('f1', '<f8'), ('f2', '<f8')])

Last, but not least, if you’re faced with the task of reading binary data in some known file format (i.e., image formats, shape files, HDF5, etc.), check to see if a Python module already exists for it.  
There’s no reason to reinvent the wheel if you don’t have to.

## 6.12. Reading Nested and Variable-Sized Binary Structures

### Problem

You need to read complicated binary-encoded data that contains a collection of nested and/or variable-sized records.  
Such data might include images, video, shapefiles, and so on.  

### Solution

The struct module can be used to decode and encode almost any kind of binary data structure.  
To illustrate the kind of data in question here, suppose you have this Python data structure representing a collection of points that make up a series of polygons:

In [83]:
polys = [
        [ (1.0, 2.5), (3.5, 4.0), (2.5, 1.5) ],
        [ (7.0, 1.2), (5.1, 3.0), (0.5, 7.5), (0.8, 9.0) ],
        [ (3.4, 6.3), (1.2, 0.5), (4.6, 9.2) ],
        ]

Now suppose this data was to be encoded into a binary file where the file started with the following header:

| Byte | Type   | Description                        |
|------|--------|------------------------------------|
| 0    | int    | File code (0x1234 little endian)   |
| 4    | double | Minimum x (little endian)          |
| 12   | double | Minimum y (little endian)          |
| 20   | double | Maximum x (little endian           |
| 28   | double | Maximum y (little endian)          |
| 36   | int    | Number of polygons (little endian) |

Following the header, a series of polygon records follow, each encoded as follows:

| Byte | Type   | Description                           |
|------|--------|---------------------------------------|
| 0    | int    | Record length including len(N bytes)  |
| 4-N  | Points | Pairs of (X,Y) coordinates as doubles |

Let's see what we can build:

In [84]:
import struct 
import itertools

def write_polys(filename, polys):
    # Determine bounding box
    flattened = list(itertools.chain(*polys))
    min_x = min(x for x, y in flattened)
    max_x = max(x for x, y in flattened)
    min_y = min(y for x, y in flattened)
    max_y = max(y for x, y in flattened)
    
    with open(filename, 'wb') as f:
        f.write(struct.pack('<iddddi',
                           0x1234,
                           min_x, min_y,
                           max_x, max_y,
                           len(polys)))
        for poly in polys:
            size = len(poly) * struct.calcsize('<dd')
            f.write(struct.pack('<i', size+4))
            for pt in poly:
                f.write(struct.pack('<dd', *pt))
                
# Call it with our polygon data:
write_polys('polys.bin', polys)

We can use the `struct.unpack` function to read back the data.  
Basically we reverse the operations performed during writing:

In [85]:
import struct

def read_polys(filename):
    with open(filename, 'rb') as f:
        # Read the header:
        header = f.read(40)
        file_code, min_x, min_y, max_x, max_y, num_polys = \
            struct.unpack('<iddddi', header)
        
        polys = []
        for n in range(num_polys):
            pbytes, = struct.unpack('<i', f.read(4))
            poly = []
            for m in range(pbytes // 16):
                pt = struct.unpack('<dd', f.read(16))
                poly.append(pt)
            polys.append(poly)
    return polys


Although this code works, it’s also a rather messy mix of small reads, struct unpacking, and other details.  
If code like this is used to process a real datafile, it can quickly become even worse.  
Thus, it’s an obvious candidate for an alternative solution that might simplify some of the steps and free the programmer to focus on more important matters.  
In the remainder of this recipe, a rather advanced solution for interpreting binary data will be built up in pieces.  
The goal will be to allow a programmer to provide a high-level specification of the file format, and to simply have the details of reading and unpacking all of the data worked out under the covers.  
As a forewarning, the code that follows may be the most advanced example in this entire book, utilizing various object-oriented programming and metaprogramming techniques.  
Be sure to carefully read the discussion section as well as cross-references to other recipes.  

First, when reading binary data, it is common for the file to contain headers and other data structures.  
Although the struct module can unpack this data into a tuple, another way to represent such information is through the use of a class.  
Here’s some code that allows just that:

In [86]:
import struct

class StructField:
    """
    Descriptor that represents a simple structure field
    """
    def __init__(self, format, offset):
        self.format = format
        self.offset = offset
    def __get__(self, instance, cls):
        if instance is None:
            return self
        else:
            r = struct.unpack_from(self.format, instance._buffer, self.offset)
            return r[0] if len(r) == 1 else r

class Structure:
    def __init__(self, bytedata):
        self._buffer = memoryview(bytedata)

This code uses a descriptor to represent each structure field.  
Each descriptor contains a `struct`-compatible format code along with a byte offset into an underlying memory buffer.  
In the `__get__()` method, the `struct.unpack_from()` function is used to unpack a value from the buffer without having to make extra slices or copies.  
The `Structure` class just serves as a base class that accepts some byte data and stores it as the underlying memory buffer used by the `StructField` descriptor.  
The use of a `memoryview()` in this class serves a purpose that will become clear later.  
Using this code, you can now define a structure as a high-level class that mirrors the information found in the tables that described the expected file format.

In [87]:
class PolyHeader(Structure):
    file_code = StructField('<i', 0)
    min_x = StructField('<d', 4)
    min_y = StructField('<d', 12)
    max_x = StructField('<d', 20)
    max_y = StructField('<d', 28)
    num_polys = StructField('<i', 36)

We can use this class to read the header from the polygon data written earlier:

In [88]:
f = open('polys.bin', 'rb')
phead = PolyHeader(f.read(40))

In [89]:
phead.file_code == 0x1234

True

In [90]:
phead.min_x

0.5

In [91]:
phead.min_y

0.5

In [92]:
phead.max_x

7.0

In [93]:
phead.max_y

9.2

In [94]:
phead.num_polys

3

This is interesting, but there are a number of annoyances with this approach.  
For one, even though you get the convenience of a class-like interface, the code is rather verbose and requires the user to specify a lot of low-level detail with repeated uses of Struct Field and specification of offsets.  
The resulting class is also missing common conveniences such as providing a way to compute the total size of the structure.  
Any time you are faced with class definitions that are overly verbose like this, you might consider the use of a class decorator or metaclass.  
One of the features of a metaclass is that it can be used to fill in a lot of low-level implementation details, taking that burden off of the user.  
As an example, consider this metaclass and slight reformulation of the `Structure` class:

In [95]:
class StructureMeta(type):
    """
        Metaclass that automatically creates StructField descriptors.
    """
    def __init__(self, clsname, bases, clsdict):
        fields = getattr(self, '_fields_', [])
        byte_order = ''
        offset = 0
        for format, fieldname in fields:
            if format.startswith(('<','>','!','@')):
                byte_order = format[0]
                format = format[1:]
            format = byte_order + format
            setattr(self, fieldname, StructField(format, offset))
            offset += struct.calcsize(format)
        setattr(self, 'struct_size', offset)
        
class Structure(metaclass=StructureMeta):
    def __init__(self, bytedata):
        self._buffer = bytedata
        
    @classmethod
    def from_file(cls, f):
        return cls(f.read(cls.struct_size))

Using this new `Structure` class, you can now write a structure definition:

In [96]:
class PolyHeader(Structure):
    _fields_ = [
        ('<i', 'file_code'),
        ('d', 'min_x'),
        ('d', 'min_y'),
        ('d', 'max_x'),
        ('i', 'num_polys')
    ]

As you can see, the specification is a lot less verbose.  
The added `from_file()` class method also makes it easier to read the data from a file without knowing any details about the size or structure of the data.

In [97]:
f = open('polys.bin', 'rb')
phead = PolyHeader.from_file(f)
phead.file_code == 0x1234

True

In [98]:
phead.min_x

0.5

In [99]:
phead.min_y

0.5

In [100]:
phead.max_x

7.0

In [101]:
phead.num_polys

1717986918

Once you introduce a metaclass into the mix, you can build more intelligence into it. 
For example, suppose you want to support nested binary structures.  
Here's a reformulation of the metaclass along with a new supporting descriptor that allows it:

In [103]:
class NestedStruct:
    """
    Descriptor that represents a nested structure.
    """
    def __init__(self, name, struct_type, offset):
        self.name = name
        self.struct_type = struct_type
        self.offset = offset
    def __get__(self, instance, cls):
        if instance is None:
            return self
        else:
            data = instance._buffer[self.offset: self.offset+self.struct_type.struct_size]
            result = self.struct_type(data)
            # Save the resulting structure for later:
            setattr(instance, self.name, result)
            return result
        
class StructureMeta(type):
    """
    This metaclass automatically creates StructField descriptors.
    """
    def __init__(self, clsname, bases, clsdict):
        fields = getattr(self, '_fields_', [])
        byte_order = ''
        offset = 0
        for format, fieldname in fields:
            if isinstance(format, StructureMeta):
                setattr(self, fieldname, NestedStruct(fieldname, format, offset))
                offset += format.struct_size
            else:
                if format.startswith(('<','>','!','@')):
                    byte_order = format[0]
                    format = format[1:]
                format = byte_order + format
                setattr(self, fieldname, StructField(format, offset))
                offset += struct.calcsize(format)
        setattr(self, 'struct_size', offset)

In this code, the `NestedStruct` descriptor is used to overlay another structure definition over a region of memory.  
It does this by taking a slice of the original memory buffer and using it to instantiate the given structure type.  
Since the underlying memory buffer was initialized as a memoryview, this slicing does not create a copy in a different memory address.  
Instead, it's placed on top of the original memory.  
In order to avoid repeated instantiations, the descriptor then stores the resulting inner structure object on the instance using the same technique described in Recipe 8.10.

Now we can write the code for our `Point` and `PolyHeader` classes:

At this point, a framework for dealing with fixed-sized records has been developed, but what about the variable-sized components?  
For example, the remainder of the polygon files contain sections of variable size.  
One way to handle this is to write a class that simply represents a chunk of binary data along with a utility function for interpreting the contents in different ways.  
This is closely related to the code in Recipe 6.11: