## Tidy Data Introduction

Based on [Tidy Data](http://vita.had.co.nz/papers/tidy-data.html), by Hadley Wickham  
*The Journal of Statistical Software, vol. 59, 2014.*

Download: [pre-print](http://vita.had.co.nz/papers/tidy-data.pdf) | [from publisher](http://www.jstatsoft.org/v59/i10/)

---

In [1]:
import pandas as pd

### Data structure

We focus here on tabluar data, which are rectangular "spreadsheets" made up of **rows** and **columns**.

[Pandas](https://pandas.pydata.org/) is the Python module that lets you store and manipulate tablular data. There are a lot of great online documentation and tutorials on Pandas, including the [10 minutes to Pandas tutorial on the offical documentation site](https://pandas.pydata.org/pandas-docs/stable/10min.html).

In Pandas this data is stored in a `DataFrame`. There are many ways to create DataFrames. In this workshop we will mostly load data in from commma-separated value (CSV) files, but here I will use one of the many methods to create them manually, just so you've seen an example.

All columns have a name, and each row has an Index. Here we start with two columns, "treatment a" and "treatment b", each containing results of measurements that were made during the two treatments. The names of the people treated are used as the row indices.

In [2]:
df = pd.DataFrame({'treatment a':[None, 16, 3],
                   'treatment b':[2, 11, 1]},
                 index=["John", "Jane", "Mary"])
df

Unnamed: 0,treatment a,treatment b
John,,2
Jane,16.0,11
Mary,3.0,1


If we transpose the rows and columns, it's somehow the same data, but we need a way to talk about the information contained here, and understand what structure will lead us to the easiest analysis and visualization

In [3]:
df.T

Unnamed: 0,John,Jane,Mary
treatment a,,16.0,3.0
treatment b,2.0,11.0,1.0


### Data semantics

Datasets contain **values**

- Numbers (quantitative, or numerical)
- Strings (qualitative, or categorical)
- Dates + times (quantitative, or parts of dates and times can be categorical)

Think of a **table** as a spreadsheet of **values**, describing the attributes of a type of thing which can be named, like people, corporations, phone numbers, trips people took, recipes, ingredients, experimental measurements, etc.

- The rows are the individual instances of that thing the table is recording
- The columns describe the various attributes of that individual instance of the thing

So, every value belongs to a **variable** (column) and an **observation** (row)

- A **variable** contains all **values** of the same attribute (height, temperature, duration...)
- An **observation** contains all **values** recorded on one instance or unit (one person, one day, one trip, one company, one letter someone wrote...)

---

We will reorganize the previous table to make the **values**, **variables** and **observations** more clear. 


In [4]:
df2 = pd.DataFrame({"person":['John', 'Jane', 'Mary', 'John', 'Jane', 'Mary'],
                   "treatment":['a', 'a', 'a', 'b', 'b', 'b'],
                   "result":[None, 16, 3, 2, 11, 1]})
df2

Unnamed: 0,person,treatment,result
0,John,a,
1,Jane,a,16.0
2,Mary,a,3.0
3,John,b,2.0
4,Jane,b,11.0
5,Mary,b,1.0


In [5]:
df2.groupby("treatment").sum()

Unnamed: 0_level_0,result
treatment,Unnamed: 1_level_1
a,19.0
b,14.0


In [6]:
df2.person.unique()

array(['John', 'Jane', 'Mary'], dtype=object)

The dataset contains 18 values representing 3 variables and 6 observations

1. **person**, with three possible values (John, Mary and Jane)
2. **treatment**, with two possible values (a and b)
3. **result**, with five or six possible values, depending on how you think of the missing value (-, 16, 3, 2, 11, 1)


### Tidy data

Tidy data is a standard way of mapping the meaning of a dataset to its structure. *A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types.*

In **tidy data**

1. Each **variable** forms a **column**
2. Each **observation** forms a **row**
3. Each **type of observational unit** forms a **table**

**messy data** is any other arrangement of the data.

---

*Fixed variables* describe the experimental design and are known in advance. These are often called (as in Tableau) **dimensions**.

*Measured varibles* are what we actually measure in the experiment/study. These are called **measures**.