# Tidy Data

**Learning Objective:** Understand the basics of *tidy data*.

## Overview

In our [Theory of Data](../Introduction/03-TheoryofData.ipynb) section, we covered some basic aspects of data:

- **Data types:** ordinal, nominal, quantitative, date/time, goegraphic
- **Variables:** a single thing that is measured
- **Observations:** multiple variables that are measured for a single entity
- **Dataset:** a set of records

The idea of *tidy data* is this:

1. There are many possible ways one can organize variables and observations into a dataset;
2. However, not all ways are equal; and
3. A particular way or organizing a dataset, called *tidy data* is particularly useful in working with data.

The idea of tidy data was first formalized by Hadley Wickham in his [Tidy Data](https://www.jstatsoft.org/article/view/v059i10) paper from 2010. Later in this course we will describe tidy data in more detail. However, it is useful to take a short tidy data detour before diving into data visualization. The reason for our pausing to describe *tidy data* at this point is that our first rule in data visualization is this:

> Start all data visualizations with a tidy dataset.

Thus, if you want to visualize a dataset, your first task will be to put it into a tidy form. For now, we will be working with datasets that are already tidy; the often painful process of tidying a dataset will be covered later.

## Defining tidy data

A tidy dataset has the following properties:

1. Each variable forms a column
2. Each observation forma  row
3. Each type of observational unit forms a table

*Messy data* is any other arrangement of the data.

## Example: cars dataset

The cars dataset, which comes with the Altair visualization library, is an example of a tidy dataset. Let's load that dataset and look at it:

In [2]:
import altair as alt
alt.enable_mime_rendering()

In [3]:
cars = alt.load_dataset('cars')

In [4]:
cars.head()

Unnamed: 0,Acceleration,Cylinders,Displacement,Horsepower,Miles_per_Gallon,Name,Origin,Weight_in_lbs,Year
0,12.0,8,307.0,130.0,18.0,chevrolet chevelle malibu,USA,3504,1970-01-01
1,11.5,8,350.0,165.0,15.0,buick skylark 320,USA,3693,1970-01-01
2,11.0,8,318.0,150.0,18.0,plymouth satellite,USA,3436,1970-01-01
3,12.0,8,304.0,150.0,16.0,amc rebel sst,USA,3433,1970-01-01
4,10.5,8,302.0,140.0,17.0,ford torino,USA,3449,1970-01-01


The cars dataset above is stored as a table-like object called a `DataFrame`. In Python, the [Pandas](http://pandas.pydata.org/) library provides this data structure:

In [5]:
type(cars)

pandas.core.frame.DataFrame

We will learn more about Pandas and `DataFrame`s later in this course. For now, we cover a few of their commonly used attributes and methods. The `.columns` attribute returns a one dimensional sequence of the column names. These are the *variables* in the dataset:

In [6]:
cars.columns

Index(['Acceleration', 'Cylinders', 'Displacement', 'Horsepower',
       'Miles_per_Gallon', 'Name', 'Origin', 'Weight_in_lbs', 'Year'],
      dtype='object')

The rows (observations) are labeled by another one dimensional sequence called the index (`.index`):

In [7]:
cars.index

Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
            ...
            396, 397, 398, 399, 400, 401, 402, 403, 404, 405],
           dtype='int64', length=406)

The length of the dataset is the number of rows:

In [8]:
len(cars)

406

Lastly, the `DataFrame` acts like a specialized dictionary, where the keys are the column names and the values are the columns:

In [9]:
cars['Acceleration'].head()

0    12.0
1    11.5
2    11.0
3    12.0
4    10.5
Name: Acceleration, dtype: float64

We will be using this cars dataset to cover the basics of data visualization with Altair.