# Chapter 6  
# Data Encoding and Processing

The main focus of this chapter is using Python to process data presented in different kinds of common encodings, such as CSV files, JSON, XML, and binary packed records.  
Unlike the chapter on data structures, this chapter is not focused on specific algorithms, but instead on the problem of getting data in and out of a program.

## 6.1 Reading and Writing CSV Data

If you want to read or write data encoded as a CSV file, you can use Python's `csv` library.  
We will use some stock market data from a CSV file for this example.

You can read the data as a sequence of tuples:

In [1]:
import csv

with open('stocks.csv') as f:
    f_csv = csv.reader(f)
    headers = next(f_csv)
    for row in f_csv:
        # Process row
        # ... and so forth
        pass

In the preceding code, `row` will be a tuple.  
Thus, to access certain fields, you will need to use indexing, such as `row[0]` (Symbol) and `row[4]` (Change).  
Since such indexing can often be confusing, this is one place where you might want to consider the use of named tuples.

In [2]:
from collections import namedtuple
with open('stocks.csv') as f:
    f_csv = csv.reader(f)
    headings = next(f_csv)
    Row = namedtuple('Row', headings)
    for r in f_csv:
        row = Row(*r)
        # Process row
        # ... and so forth
        pass

This would allow you to use the column headers such as `row.Symbol` and `row.Change` instead of indices.  
It should be noted that this only works if the column headers are valid Python identifiers.  
If not, you might have to massage the initial headings (e.g., replacing nonidentifier characters with underscores or similar).  
Another approach allows you to read the data as a sequence of dictionaries instead.

In [3]:
import csv

with open('stocks.csv') as f:
    f_csv = csv.DictReader(f)
    for row in f_csv:
        # Do something ...
        pass

In this version, youo would access the elements of each row using the row headers.  
For example, `row['Symbol']` or `row['Change']`.  
To write CSV data, you also use the `csv` module, but you create a writer object.

In [4]:
headers = ['Symbol','Price','Date','Time','Change','Volume']
rows = [('AA', 39.48, '6/11/2007', '9:36am', -0.18, 181800),
            ('AIG', 71.38, '6/11/2007', '9:36am', -0.15, 195500),
            ('AXP', 62.58, '6/11/2007', '9:36am', -0.46, 935000),]

In [5]:
with open('stocks.csv', 'w') as f:
    f_csv = csv.writer(f)
    f_csv.writerow(headers)
    f_csv.writerows(rows)

If you have the data as a sequence of dictionaries, like so:

In [6]:
headers = ['Symbol', 'Price', 'Date', 'Time', 'Change', 'Volume']
rows = [{'Symbol':'AA', 'Price':39.48, 'Date':'6/11/2007',
         'Time':'9:36am', 'Change':-0.18, 'Volume':181800},
        {'Symbol':'AIG', 'Price': 71.38, 'Date':'6/11/2007',
         'Time':'9:36am', 'Change':-0.15, 'Volume': 195500},
        {'Symbol':'AXP', 'Price': 62.58, 'Date':'6/11/2007',
         'Time':'9:36am', 'Change':-0.46, 'Volume': 935000},]

In [7]:
with open('stocks.csv', 'w') as f:
    f_csv = csv.DictWriter(f, headers)
    f_csv.writeheader()
    f_csv.writerows(rows)

### 6.1 Discussion

Using Python's `csv` module can save you quite a bit of time over parsing, splitting, and cleaning the data manually by yourself.  
Here is an example:

In [8]:
with open('stocks.csv') as f:
    for line in f:
        row = line.split(',')
        # Do something ...
        pass

The problem with this approach is that you’ll still need to deal with some nasty details.  
For example, if any of the fields are surrounded by quotes, you’ll have to strip the quotes.  
In addition, if a quoted field happens to contain a comma, the code will break by producing a row with the wrong size.  
By default, the `csv` library is programmed to understand CSV encoding rules used by Microsoft Excel.  
This is probably the most common variant, and will likely give you the best compatibility.  
However, if you consult the documentation for csv, you’ll see a few ways to tweak the encoding to different formats (e.g., changing the separator character, etc.).  
For example, if you want to read tab-delimited data instead, use this:

In [9]:
with open('stocks.csv') as f:
    f_tsv = csv.reader(f, delimiter='\t')
    for row in f_tsv:
        # Do something ...
        pass

If you're reading CSV data and converting it into named tuples, use caution when validating column headers.  
For example, a CSV file could have a header line containing nonvalid identifier characters like this:

`Street Address,Num-Premises,Latitude,Longitude`  
`5412 N CLARK,10,41.980262,-87.668452`  

This will actually cause the creation of a `namedtuple` to fail with a `ValueError` exception.  
To work around this, you might have to scrub the headers first.  
For instance, carrying a regex substitution on nonvalid identifier characters like this:

In [10]:
import re

with open('stocks.csv') as f:
    f_csv = csv.reader(f)
    headers = [ re.sub('[^a-zA-Z_]', '_', h) for h in next(f_csv) ]
    Row = namedtuple('Row', headers)
    for r in f_csv:
        row = Row(*r)
        # do something
        pass

It's important to note that `csv` does not try to interpret the data or convert it to a type other than a string.  
The following example performs extra type conversions on CSV data:

In [11]:
col_types = [str, float, str, str, float, int]
with open('stocks.csv') as f:
    f_csv = csv.reader(f)
    headers = next(f_csv)
    for row in f_csv:
        # Apply conversions to the row items
        row = tuple(convert(value) for convert, value in zip(col_types, row))
        # And so forth ...
        pass

You can also convert selected fields of dictionaries:

In [12]:
print('Reading as dicts with type conversion')
field_types = [ ('Price', float),
                ('Change', float),
                ('Volume', int) ]

with open('stocks.csv') as f:
    for row in csv.DictReader(f):
        row.update((key, conversion(row[key])) for key, conversion in field_types)
        print(row)

Reading as dicts with type conversion
OrderedDict([('Symbol', 'AA'), ('Price', 39.48), ('Date', '6/11/2007'), ('Time', '9:36am'), ('Change', -0.18), ('Volume', 181800)])
OrderedDict([('Symbol', 'AIG'), ('Price', 71.38), ('Date', '6/11/2007'), ('Time', '9:36am'), ('Change', -0.15), ('Volume', 195500)])
OrderedDict([('Symbol', 'AXP'), ('Price', 62.58), ('Date', '6/11/2007'), ('Time', '9:36am'), ('Change', -0.46), ('Volume', 935000)])


In general, you’ll probably want to be a bit careful with such conversions, though.  
In the real world, it’s common for CSV files to have missing values, corrupted data, and other issues that would break type conversions.  
So, unless your data is guaranteed to be error free, that’s something you’ll need to consider (you might need to add suitable exception handling).  
Finally, if your goal in reading CSV data is to perform data analysis and statistics, you might want to look at the `pandas` package.  
`pandas` includes a convenient `pandas.read_csv()` function that will load CSV data into a `DataFrame` object.  
From there, you can generate various summary statistics, filter the data, and perform other kinds of high-level operations.

## 6.2 Reading and Writing JSON Data

### Problem  
You want to read or write data encoded as JavaScript Object Notation (JSON)

### Solution  
The `json` module provides an easy way to encode and decode data in JSON.  
The two main functions are `json.dumps()` and `json.loads()`, mirroring the interface used in other serialization libraries, such as `pickle`.  
Here is how you turn a Python data structure into JSON:

In [13]:
import json

data = {
    'name' : 'ACME',
    'shares' : 100,
    'price': 542.23
}

json_str = json.dumps(data)
json_str

'{"name": "ACME", "shares": 100, "price": 542.23}'

In [14]:
type(json_str)

str

Now we can turn the JSON-encoded string back into a Python data structure:

In [15]:
data = json.loads(json_str); data

{'name': 'ACME', 'shares': 100, 'price': 542.23}

In [16]:
type(data)

dict

If you are working with files instead of strings, you can also use `json.dump()` and `json.load()` to encode and decode JSON data.

In [17]:
# Write the data
with open ('data.json', 'w') as f:
    json.dump(data, f)
    
# Read data back
with open('data.json', 'r') as f:
    data = json.load(f)
    
data

{'name': 'ACME', 'shares': 100, 'price': 542.23}

### Discussion  
JSON encoding supports the basic types of `None, bool, int, float,` and `str`, as well as lists, tuples, and dictionaries containing those types.  
For dictionaries, keys are assumed to be strings (any non-string keys in a dictionary are converted to strings during encoding).  
To be compliant with the JSON specification, you should only encode Python lists and dictionaries.  
Note that in web applications, it is also conventional for the top-level object to be a dictionary.  
The format of JSON encoding is almost identical to Python syntax except for a few minor changes.  
for instance, `True` is mapped to `true`, `False` is mapped to `false`, and `None` is mapped to `null`.

In [18]:
json.dumps(False)

'false'

In [19]:
d = {
    'a' : True,
    'b' : 'Hello',
    'c': None
}

json.dumps(d)

'{"a": true, "b": "Hello", "c": null}'

If you are trying to examine data you have decoded from JSON, it can often be hard to ascertain its structure simply by printing it out, especially if the data contains a deep level of nested structures or a lot of fields.  
To assist with this, consider using the `pprint()` function in the pprint module.  
This will alphabetize the keys and output a dictionary in a more sane way.  

Normally, JSON decoding will create dicts or lists from the supplied data.  
If you want to create different kinds of objects, supply the `object_pairs_hook` or `object_hook` to `json.loads()`.  
Here is one way you can encode JSON data that preserves its order in an `OrderedDict`:

In [20]:
s = '{"name": "ACME", "shares": 50, "price": 490.1}'

from collections import OrderedDict

data = json.loads(s, object_pairs_hook=OrderedDict); data

OrderedDict([('name', 'ACME'), ('shares', 50), ('price', 490.1)])

You can also turn a JSON dictionary into a Python object:

In [21]:
class JSONObject:
    def __init__(self, d):
        self.__dict__ = d
        
        
data = json.loads(s, object_hook=JSONObject)
data.name, data.shares, data.price

('ACME', 50, 490.1)

In this last example, the dictionary created by decoding the JSON data is passed as a single argument to `__init__()`.  
From there, you can use it directly as the instance dictionary of the object.

There are a few options that can be useful for encoding JSON.  
If you would like the output to be nicely formatted, you can use the indent argument to `json.dumps()`.  
This causes the output to be pretty printed in a format similar to that with the `pprint()` function.  

In [22]:
with open('data.json', 'r') as f:
    data = json.load(f)
    
print(json.dumps(data))
print(json.dumps(data, indent=4))

{"name": "ACME", "shares": 100, "price": 542.23}
{
    "name": "ACME",
    "shares": 100,
    "price": 542.23
}


You can use the `sort_keys` argument to sort the keys alphabetically on output:

In [23]:
print(json.dumps(data, sort_keys=True))

{"name": "ACME", "price": 542.23, "shares": 100}


Instances are not normally serializable as JSON.  
The following code breaks down:

In [24]:
class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y
        
p = Point(2, 3)

If you want to serialize instances, you can supply a function that takes an instance as input and returns a dictionary that can be serialized.

In [25]:
def serialize_instance(obj):
    d = { '__classname__' : type(obj).__name__ }
    d.update(vars(obj))
    return d

If you want to get an instance back, you could do this:

In [26]:
# Dictionary mapping names to known classes
classes = { 'Point' : Point }

def unserialize_object(d):
    clsname = d.pop('__classname__', None)
    if clsname:
        cls = classes[clsname]
        obj = cls.__new__(cls)  # Creates an instance without calling the __init__() method
        for key, value in d.items():
            setattr(obj, key, value)
            return obj
    else:
        return d

In [27]:
p = Point(2,3)
s = json.dumps(p, default=serialize_instance); s

'{"__classname__": "Point", "x": 2, "y": 3}'

In [28]:
a = json.loads(s, object_hook=unserialize_object); a

<__main__.Point at 0x110499d68>

In [29]:
a.x

2

The `json` module has a variety of other options for controlling the low-level interpretation of numbers, special values such as `NaN`, and more.  
[The JavaScript Object Notation (JSON) Data Interchange Format](https://tools.ietf.org/html/rfc8259)  
[`json` — JSON encoder and decoder](https://docs.python.org/3.7/library/json.html)

## 6.3 Parsing Simple XML Data

The `xml.etree.ElementTree` module can be used to extract data from simple XML documents.  
To illustrate, suppose you want to parse and make a summary of the RSS feed on [Planet Python](https://planetpython.org/).  
The following code will do that.

In [30]:
from urllib.request import urlopen
from xml.etree.ElementTree import parse

# Download the RSS feed and parse it:
u = urlopen('https://planet.python.org/rss20.xml')
doc = parse(u); doc

<xml.etree.ElementTree.ElementTree at 0x110499a90>

Now we can extract and output the tags that interest us:

In [31]:
for item in doc.iterfind('channel/item'):
    title = item.findtext('title')
    date = item.findtext('pubDate')
    link = item.findtext('link')
    print(title)
    print(date)
    print(link)
    print()

Mike Driscoll: Jupyter Notebook Extension Basics
Tue, 02 Oct 2018 05:05:08 +0000
http://www.blog.pythonlibrary.org/2018/10/02/jupyter-notebook-extension-basics/

Kay Hayen: Nuitka this week #8
Tue, 02 Oct 2018 04:05:00 +0000
http://nuitka.net/posts/nuitka-this-week-8.html

Podcast.__init__: Managing Application Secrets with Brian Kelly
Tue, 02 Oct 2018 02:12:21 +0000
https://www.podcastinit.com/managing-application-secrets-with-brian-kelly-episode-181/

Anarcat: October 2018 report: LTS, Mastodon, Firefox privacy, etc
Mon, 01 Oct 2018 20:28:22 +0000
https://anarc.at/blog/2018-10-01-report/

Bill Ward / AdminTome: Install Python 3.7.0 on Ubuntu 18.04 / Debian 9.5
Mon, 01 Oct 2018 18:37:19 +0000
https://www.admintome.com/blog/install-python-3-7-0-on-ubuntu-18-04/

Bruno Rocha: Hacktoberfest 2018
Mon, 01 Oct 2018 18:20:20 +0000
http://brunorocha.org/hacktoberfest-2018.html

Made With Mu: PyWeek - Make a Game with Mu
Mon, 01 Oct 2018 18:00:00 +0000
https://madewith.mu/mu/games/2018/10/01/p

### Discussion

Working with data encoded as XML is commonplace in many applications.  
Not only is XML widely used as a format for exchanging data on the Internet, it is a common format for storing application data (e.g., word processing, music libraries, etc.).  
The discussion that follows already assumes the reader is familiar with XML basics.

In many cases, when XML is simply being used to store data, the document structure is compact and straightforward.  
The `xml.etree.ElementTree.parse()` function parses the entire XML document into a document object.  
From there, you use methods such as `find()`, `iterfind()`, and `findtext()` to search for specific XML elements.  
The arguments to these functions are the names of a specific tag, such as channel/item or title.
When specifying tags, you need to take the overall document structure into account.  
Each find operation takes place relative to a starting element. 
Likewise, the tagname that you supply to each operation is also relative to the start.  
In the example, the call to `doc.iterfind('channel/item')` looks for all "item" elements under a "channel" element. doc represents the top of the document (the top-level "rss" element).  
The later calls to `item.findtext()` take place relative to the found "item" elements.  
Each element represented by the `ElementTree` module has a few essential attributes and methods that are useful when parsing.  
The tag attribute contains the name of the tag, the text attribute contains enclosed text, and the `get()` method can be used to extract attributes (if any).

In [32]:
doc

<xml.etree.ElementTree.ElementTree at 0x110499a90>

In [33]:
e = doc.find('channel/title'); e

<Element 'title' at 0x1104cfea8>

In [34]:
e.tag

'title'

In [35]:
e.text

'Planet Python'

It should be noted that `xml.etree.ElementTree` is not the only option for XML parsing.  
For more advanced applications, you might consider `lxml`.  
It uses the same program‐ ming interface as ElementTree, so the example shown in this recipe works in the same manner.  
You simply need to change the first import to:  
`from lxml.etree import parse`.  
`lxml` provides the benefit of being fully compliant with XML standards.  
It is also extremely fast, and provides support for features such as validation, XSLT, and XPath.

## 6.4 Parsing Huge XML Files Incrementally 