### Pandas: From messy to tidy datasets

The Pandas library for Python was build around the dataframe idea taken from R, the statistical programming language. Wes McKinney is the driving force behind the library (O'Reilly book: Python for Data Analysis).

Hadley Wickham is his R counterpart working on RStudio, the free programming environment for R, and author of some important R libraries.

Hardly any flame wars between the R and Python communities. McKinney and Wickham sometimes work together closely, the fruits of which find their way into both languages. R is real strong in hard core statistical libraries and has a kind of functional twist to it and, at least for me, a bit of a quirky syntax; Python is the more broad programming language with strong support, through its libraries, for scientific programming.

Both languages have "notebooks", and it is possible in the Jupyter ([JU]lia[PYT]hone[R]) noteboooks to incorporate both Python and R snippets. CSV files are the "lingua franca" between the languages.

In 2014 Hadley Wickham wrote an important article in the Journal of Statistical Software: "Tidy Data".

In it he argued for a certain way of structuring data in order to make it more easy and effective to clean and work with the data: Using consistent data structures and matching tools. These matching tools are now kept in the so-called Tidyverse library.

A tidy structure has the following attributes:

  - Each variable forms a column and contains values
  - Each observation forms a row
  - Each observational unit forms a table (aka cell)
  
  where:
  
  - variable is a measurement or an attribute (height, weight, sex, etc.)
  - value is the actual measurement or attribute (152 cm, 80 kg, female, etc.)
  - observation: all values measure on the same unit
  
A dataset that is not tidy is messy.

Why are there messy datasets? Well, life is messy in a way. Often datasets get messy because they are used for presentation purposes and values of variables tend to creep into column headers. Or, in order to facilitate the input of data, one stores multiple variables into one column.

In order to get some working experience with Pandas we will start to struggle a bit with tidy and messy datasets.

Let's start with a tidy dataset. We open the CSV file in our preferred editor, like so:

In [1]:
!aquamacs /Users/peter/Documents/bootcamps/data/cash.csv

In [4]:
import pandas as pd
cash = pd.read_csv('../data/cash.csv', sep = ',')

In [5]:
cash.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 3 columns):
date       12 non-null object
person     12 non-null object
dollars    12 non-null float64
dtypes: float64(1), object(2)
memory usage: 368.0+ bytes


In [6]:
cash.head()

Unnamed: 0,date,person,dollars
0,2000-01-03,Michael,200.0
1,2000-01-03,George,500.0
2,2000-01-03,Lisa,450.0
3,2000-01-04,Michael,180.5
4,2000-01-04,George,450.0


A good example of a tidy dataset:

- each variable is a column
- each observation is a row
- each type of observational unit forms a table (cell)

This format can be referred to as:

- stacked format (stack of observations)
- record format (each row is a single record)
- long format (this kind of format will be vertical)

It seems rather silly to re-format a tidy dataset, but for presentation purposes it can be "better" to change things a bit (and this is precisely the reason that we encounter so many untidy datasets in the wild).

Let's pivot our dataset:

In [8]:
cash.pivot(index='date', columns='person', values='dollars')

person,George,Lisa,Michael
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000-01-03,500.0,450.0,200.0
2000-01-04,450.0,448.0,180.5
2000-01-05,420.0,447.0,177.0
2000-01-06,300.0,344.6,150.0


So, now we have a wide format, and an unstacked format: Messy data as such, but fine in a presentation context for example. I guess this is one of these fundamental rules that is overlooked very often: Always de-couple presentation from data. Whether you are writing, coding or preparing data. Concentrate on the content (data, words) and the structure (chapters, etc. or data structure) and the rest will follow "automatically".

Apart from pivot there are the stack and unstack methods in the Pandas library.
Let's use our cash dataframe to prepare a view with multiple indices:

In [10]:
cash_multi = cash.set_index(['date', 'person'])
cash_multi

Unnamed: 0_level_0,Unnamed: 1_level_0,dollars
date,person,Unnamed: 2_level_1
2000-01-03,Michael,200.0
2000-01-03,George,500.0
2000-01-03,Lisa,450.0
2000-01-04,Michael,180.5
2000-01-04,George,450.0
2000-01-04,Lisa,448.0
2000-01-05,Michael,177.0
2000-01-05,George,420.0
2000-01-05,Lisa,447.0
2000-01-06,Michael,150.0


Stacked data. Rows are observations. For presentation purposes, we can "unstack" the dataframe.

In [11]:
cash_multi.unstack()

Unnamed: 0_level_0,dollars,dollars,dollars
person,George,Lisa,Michael
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
2000-01-03,500.0,450.0,200.0
2000-01-04,450.0,448.0,180.5
2000-01-05,420.0,447.0,177.0
2000-01-06,300.0,344.6,150.0


Transpose is one character away:

In [12]:
cash.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
date,2000-01-03,2000-01-03,2000-01-03,2000-01-04,2000-01-04,2000-01-04,2000-01-05,2000-01-05,2000-01-05,2000-01-06,2000-01-06,2000-01-06
person,Michael,George,Lisa,Michael,George,Lisa,Michael,George,Lisa,Michael,George,Lisa
dollars,200,500,450,180.5,450,448,177,420,447,150,300,344.6


On to practice. We encounter a wide dataset in the wild. And we prefer to tidy it using the melt method.

We read in the csv file.

In [3]:
df = pd.read_csv("./data/treatment.csv", sep=";")
df

Unnamed: 0.1,Unnamed: 0,Treatment A,Treatment B
0,John Smith,-,2
1,Jane Doe,16,11
2,Mary Johnson,3,1


The first column containing name values is not named (has no header); the other two column headers contain values. The 5 or 6 values (depending on how we count the "-") in the cells are not given a proper variable name (header), they are just framed by the other values.

In [2]:
melted_df = pd.melt(df,
                   ["Name"],
                   var_name = "Treatment",
                   value_name = "Result")
melted_df

Unnamed: 0,Name,Treatment,Result
0,John Smith,Treatment A,-
1,Jane Doe,Treatment A,16
2,Mary Johnson,Treatment A,3
3,John Smith,Treatment B,2
4,Jane Doe,Treatment B,11
5,Mary Johnson,Treatment B,1
