# Data Formats

## Overview

This notebook examines some of the different ways that data is organised within files. We'll call these *data formats*.

First, we're going to rule out a couple of very widely-used data formats, namely HTML and PDF. The reason is that we are interested in data that we can manipulate using computer code. In order to make this process easy, we need the data to be organised into a predictable structure. HTML documents &mdash; that is, the raw form of ordinary web pages &mdash; are generally not structured in a predictable way, which means that extracting data out of them is often complex and time-consuming. Web pages are designed to make them easy for humans to understand, rather than for computers to extract data from. PDF documents are similar, in that processing them with code is tricky and hard to do in general way. 

To sum up, HTML and PDF documents fail as *machine-readable* formats, since they cannot be imported straightforwardly into an application or computer program that deals with data.

## <a name="excel">Excel</a>

Microsoft Excel is a machine-readable format used for creating spreadsheets. Here is an example:
![](../images/excel_example.pdf)

Excel is frequently used by large organisations for publising tabular data. However, the data format belongs to Microsoft rather than being in the public domain, and we prefer not to use proprietary data formats. There are also issues with different versions of Excel, and differences depening on the platform on which they run (e.g., Windows vs. MacOS).

## <a name="csv">CSV</a>



In [12]:
%%bash
cat ../data/open_data/messages.csv

To,From,Heading,Body,Date"James, Ewan",Arno,Reminder,Cycling to Cramond today!,13/10/2015Arno,"James, Ewan",Re: Reminder,Let's walk instead,14/10/2015

In [11]:
import pandas as pd
table = pd.read_csv("../data/open_data/excel_example.csv")
table

Unnamed: 0,Messages,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,To,From,Heading,Body,Date
1,James,Arno,Reminder,Cycling to Cramond today!,13/10/2015
2,Arno,James,Re: Reminder,Let's walk instead.,14/10/2015
