# Data Formats

## Overview

This notebook examines some of the different ways that data is organised within files. We'll call these *data formats*.

First, we're going to rule out a couple of very widely-used data formats, namely HTML and PDF. The reason is that we are interested in data that we can manipulate using computer code. In order to make this process easy, we need the data to be organised into a predictable structure. HTML documents &mdash; that is, the raw form of ordinary web pages &mdash; are generally not structured in a predictable way, which means that extracting data out of them is often complex and time-consuming. Web pages are designed to make them easy for humans to understand, rather than for computers to extract data from. PDF documents are similar, in that processing them with code is tricky and hard to do in general way. 

To sum up, HTML and PDF documents fail as *machine-readable* formats, since they cannot be imported straightforwardly into an application or computer program that deals with data.

## <a name="excel">Excel</a>

Microsoft Excel is a machine-readable format used for creating spreadsheets. Here is an example:
![](../images/messages.pdf)

Excel is frequently used by large organisations for publising tabular data. However, the data format belongs to Microsoft rather than being in the public domain, and we prefer not to use proprietary data formats. There are also issues with different versions of Excel, and differences depening on the platform on which they run (e.g., Windows vs. MacOS).

## <a name="csv">CSV</a>

CSV (short for "Comma Separated Values") is a simple data format for tables that can be read and written by any text editor. Each row of the table is represented as a line in the file, and the values of the cells in the row are separated by a comma (","). The next example shows what happens if we export the Excel data shown above into a CSV file:

In [9]:
%%bash
cat ../data/open_data/messages.csv

To,From,Heading,Body,Date"Arno, Ewan",James,Reminder,Cycling to Cramond today!,13/10/2015"James, Ewan",Arno,Re: Reminder,Let's walk instead,14/10/2015

The `pandas` library in Python is designed to make it easy to process tabular data, and we can use it display the CSV file so that it looks more like a table. In the next example, we import the library (and give it the short name `pd`), and then use it's `read_cv()` method to slurp up the CSV file.

In [13]:
import pandas as pd
table = pd.read_csv("../data/formats/messages.csv")
table

Unnamed: 0,To,From,Heading,Body,Date
0,"Arno, Ewan",James,Reminder,Cycling to Cramond today!,13/10/2015
1,"James, Ewan",Arno,Re: Reminder,Let's walk instead,14/10/2015


One issue to note is that if a value in the CSV file contains a comma, then we have to wrap that value with quote signs, as in `"James, Ewan"`.

## <a name="xml">XML</a>

XML (short for "eXtensible Markup Language") is a W3C open standard, used widely for representing documents and and also of arbitrary data structures. XML representations are tree-shaped, in the sense that there is a single node (the root) from which all the branches spring. We can use a text document to give us a simple example.

XML is considerably more verbose than CSV, particularly when the data is tabular in nature. Each row


In [11]:
from lxml.etree import Element, SubElement, tostring
top = Element('data')
msg1 = SubElement(top, 'message')
to11 = SubElement(msg1, 'to')
to11.text = "James"
to12 = SubElement(msg1, 'to')
to12.text = "Ewan"
from1 = SubElement(msg1, 'from')
from1.text = "Arno"
heading1 = SubElement(msg1, 'heading')
heading1.text = "Reminder"
body1 = SubElement(msg1, 'body')
body1.text = "Cycling to Cramond today!"
date1 = SubElement(msg1, 'date')
date1.text = "13/10/2015"
msg2 = SubElement(top, 'message')
to21 = SubElement(msg2, 'to')
to21.text = "Arno"
to22 = SubElement(msg2, 'to')
to22.text = "Ewan"
from2 = SubElement(msg2, 'from')
from2.text = "James"
heading2 = SubElement(msg2, 'heading')
heading2.text = "Re: Reminder"
body2 = SubElement(msg2, 'body')
body2.text = "Let's walk instead"
date2 = SubElement(msg2, 'date')
date2.text = "14/10/2015"
print(tostring(top, pretty_print = True, encoding="unicode"))

<data>
  <message>
    <to>James</to>
    <to>Ewan</to>
    <from>Arno</from>
    <heading>Reminder</heading>
    <body>Cycling to Cramond today!</body>
    <date>13/10/2015</date>
  </message>
  <message>
    <to>Arno</to>
    <to>Ewan</to>
    <from>James</from>
    <heading>Re: Reminder</heading>
    <body>Let's walk instead</body>
    <date>14/10/2015</date>
  </message>
</data>



In [19]:
from lxml import etree
tree = etree.parse("../data/formats/messages.xml")
print(etree.tostring(top, pretty_print = True, encoding="unicode"))

<data>
  <message>
    <to>James</to>
    <to>Ewan</to>
    <from>Arno</from>
    <heading>Reminder</heading>
    <body>Cycling to Cramond today!</body>
    <date>13/10/2015</date>
  </message>
  <message>
    <to>Arno</to>
    <to>Ewan</to>
    <from>James</from>
    <heading>Re: Reminder</heading>
    <body>Let's walk instead</body>
    <date>14/10/2015</date>
  </message>
</data>

