## Tidy data with Pandas

[Pandas](http://pandas.pydata.org/) is the Python module that lets you efficiently deal with tabular data, which is data in columns, where each column has a certain type, and each row is a specific record of that combination of columns of data.

Pandas can do many things, but what I end up using it for most often is to load in and manipulate data that I want to transform for another program, or that I want to run machine learning routines on. Those manipulations can be cleaning, joining with other data tables, or changing the form so that it's "tidy" or "tall", which is required for programs like Tableau.

Hadley Wickham, who is very influential in the R world, wrote [this paper](http://vita.had.co.nz/papers/tidy-data.pdf) spelling out what "tidy" data is. There was also a [recent blog post](http://www.jeannicholashould.com/tidy-data-in-python.html) showing how to do many of those examples using Python and Pandas, which taught me a lot. From that post:

#### Defining tidy data

The structure Wickham defines as tidy has the following attributes:

* Each variable forms a column and contains values
* Each observation forms a row
* Each type of observational unit forms a table

A few definitions:

* Variable: A measurement or an attribute. Height, weight, sex, etc.
* Value: The actual measurement or attribute. 152 cm, 80 kg, female, etc.
* Observation: All values measure on the same unit. Each person.


In [None]:
# In Python we need to import any modules we're going to use
# and we can give them (standard) shorter names

import pandas as pd

In [None]:
# This is a bash shell command to see what's in our current directory
!ls

### Splitting lists into columns

This first example is jumping right in to a complicated situation, but one I haven't seen documented very many places, and one I run into all the time.

In [None]:
# The data is in a sub-folder called "data"
# read_excel will read the first sheet in the workbook if you don't specify another

ps = pd.read_excel('./data/PeopleStates.xlsx')
ps

In [None]:
# string operations will be applied to each row
# will end up with a single column of lists if don't put expand=True

psplit = ps.states.str.split(',', expand=True)
psplit

In [None]:
# concat will use the index to align rows

pexp = pd.concat([ps.name, psplit], axis=1)
pexp

### Un-pivoting into tall format

Tableau calls this pivoting, but many call this un-pivoting since a pivot table in Excel puts things from the tall format into wide. In Pandas you do a "melt". In `tidyr` this is a "gather". In OpenRefine it's a Transpose Colunns into Rows operation.

In [None]:
# id_vars will be repeated and not un-pivoted
# all others will be melted down into a single column (values)
# with the column names as a separate column (variables)

ptidy = pd.melt(pexp, id_vars=['name'], value_name='state')
ptidy.head(8)

In [None]:
# since we didn't specify a var_name for melt(), it defaulted to "variable"
# can specify a list to select only certain columns, dropping others not needed

ptidy = ptidy[['name','state']]
ptidy

In [None]:
# many methods include an "inplace" argument, so it won't make a copy
# NOTE: you're writing over your data in place!

ptidy.dropna(inplace=True)

In [None]:
ptidy.sort_values(by='name', inplace=True)
ptidy

### Merging (joining) two data sets

Here we'll read in a second sheet out of that same workbook and join this state-level data with the people/states data we just modified

In [None]:
sp = pd.read_excel('./data/PeopleStates.xlsx', sheetname='Sheet2')
sp.tail(5)

In [None]:
ppop = pd.merge(ptidy, sp, how='left', left_on='state', right_on='state')
ppop.sort_values('population_2010', ascending=False, inplace=True)
ppop

### Saving table out to a file

Usually we can save to an Excel file, but we'd need to install another module
so, save as JSON for now. There are multiple "orientations":
[to_json docs](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html)

In [None]:
# 'records' orientation will make a list of rows, each an object/dictionary

ppop.to_json('./data/PeopleStates_Merged.json', orient='records')