# Tidy Data in Python
by [Jean-Nicholas Hould](http://www.jeannicholashould.com/)


from the blog post of the same name
[http://www.jeannicholashould.com/tidy-data-in-python.html](http://www.jeannicholashould.com/tidy-data-in-python.html)

## Tidying messy datasets (Intro)

Two very common problems with messy datasets are:

- A column contains multiple-valued lists
- Column headers are values, not variable names

We'll run through how to fix these problems in the examples below. We use the Python module `Pandas` for dealing with tablular data.

We'll also learn how to join two tables, both by concatenation, and by the equivalent of an SQL JOIN statment in Pandas.

Plus, we'll see how to save tables to CSV and JSON files.


In [None]:
import pandas as pd

---

## Splitting lists into columns

This wasn't part of the original paper, but it's an example I run into all the time and I haven't seen it documented very many places.

The data is in a sub-folder called `data`. The `read_excel()` function will read the first sheet in the workbook if you don't specify another

In [None]:
ps = pd.read_excel('./data/PeopleStates.xlsx')
ps

#### Referring to columns with quoted name in side square brackets

We can refer to a specific column either with square brackets with the name in quotes (which is the necessary form if the column name has spaces in it). Each column by itself, taken out of the DataFrame, is not a DataFrame, but a "Series".

In [None]:
ps["name"]

#### Referring to columns with dataframe.name

If the name doesn't have spaces in it, we can use the "dot notation", with the dataframe variable name "." column name

In [None]:
ps.name

#### String `.str` operations will be applied to each row

Here we do a "splitting" operation on the column to split what is currently a single string containing commas, into a list of the items between the commas.

*Note, you will end up with a single column of lists if don't put `expand=True`, which denotes that you're intending to "expand the dimensionality" of the data set.*

*Notice, also, that the DataFrame will expand to enough columns to accomodate the list with the most elements, unless you specify a limit, and lists without enough elements will have `None` in the extra columns.*

In [None]:
psplit = ps.states.str.split(',', expand=True)
psplit

#### Concatenation – `concat()`

Pandas will use the Index to align rows of the original `names` Series and the `psplit` DataFrame that are being concatenated. 

- `axis=0` is down the rows
- `axis=1` is across the columns.

Let's put the expanded states and the names back together into one table.

In [None]:
pexp = pd.concat([ps.name, psplit], axis=1)
pexp

---

## Column headers are values, not variable names

*One of the more common manipulations*

### Un-pivoting into tall format

Many call this process of going from a wide data set to tall "un-pivoting" since a pivot table in Excel converts data from the tall format into wide. 

The situation when you need this is that you have data in the column headers that you want in their own column. You also want the values that are spread across the multiple rows and columns to end up in a single measurement column.

- **In Pandas you do a "melt"**
- In `tidyr` this is a "gather"
- In OpenRefine it's a "Transpose->Transpose cells across columns into rows..." operation
- In Tableau this is called a "Pivot"

Let's first define a simple, small data frame:

In [None]:
df = pd.DataFrame({'label':['A','B','C'],
                  'x':[1,2,3],
                  'y':[4,5,6],
                  'z':[7,8,9]})
df

Minimally, you need to specify the DataFrame to "melt", and a list of which columns don't get "un-pivoted". The latter will get repeated.

In [None]:
df2 = pd.melt(df, ['label'])
df2

### Now back to the States dataset

- id_vars will be repeated and not un-pivoted
- all others will be melted down into a single column (values)
- with the column names as a separate column (variables)

When we don't specify a `var_name=` for `melt()`, it will default to "variable"

In [None]:
ptidy = pd.melt(pexp, id_vars=['name'], value_name='state')
ptidy

#### Drop columns

In this case we don't need the "variable" column. We can specify a list of column names to select only certain columns, dropping others that aren't needed.


In [None]:
ptidy = ptidy[['name','state']]
ptidy

#### NOTE: "inplace"

- Most functions create a copy of the DataFrame instead of changing the original
- Many methods include an "inplace" argument, so it won't make a copy
- **Be careful! You're writing over your data in place!**
- `dropna()` defaults to dropping any row that has a null/None in any column. You can specify a subset of colunns to look in instead.

#### `dropna()` to drop nulls


In [None]:
ptidy.dropna(inplace=True)
ptidy

#### `sort_values()` to sort by values

Again, the default is to make a copy, so you either have to reassign, or change "inplace".

In [None]:
ptidy.sort_values(by='name', inplace=True)
ptidy

---

### Merging (joining) two data sets

Here we'll read in a second sheet out of the same Excel workbook and join this state-level data with the people/states data we just modified. 

**This is the Pandas equivalent of an SQL JOIN command**

We'll start by loading in a table of the US states, their populations, and the number of congessional house seats they are represented by.

In [None]:
sp = pd.read_excel('./data/PeopleStates.xlsx', sheet_name='Sheet2')
sp.tail(5)

We'll do a LEFT JOIN by using the `merge()` function, specifying which DataFrame is on the "left" and which is on the "right" for the JOIN. It's just the order in which you list them as the first two arguments to `merge()`.

We also need to specify which column contains the ID fields / keys to join on. We put these in the "left_on" and "right_on" arguments.

Then, we'll sort the rows "descending" and "in place" by the populatin column.

In [None]:
ppop = pd.merge(ptidy, sp, how='left', left_on='state', right_on='state')

ppop.sort_values('population_2010', ascending=False, inplace=True)
ppop

---

## Saving table out to a CSV file

Usually we can save to an Excel file, but we'd need to install another module
so, we'll save as CSV file for now, which is a very useful format.

- It's good practice to specify the `encoding`, which is the method used for recording characters beyond the 256 ASCII character set. 

- In this case we also don't need to save the `index` column to the file, so we'll turn that option off

In [None]:
ppop.to_csv('./data/PeopleStates_Merged.csv', encoding='utf-8', index=False)

#### Save to JSON

Another option is to save as a JSON file. There are multiple "orientations":
[to_json docs](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html)

`records` orientation will make a list of rows, each an object/dictionary


In [None]:
ppop.to_json('./data/PeopleStates_Merged.json', orient='records')

---

## Stop here and try the Pew Research Center Dataset exercise!

**Click here to open:** [10_PewExercise.ipynb](10_PewExercise.ipynb)

*Don't look yet, but solutions are in:* [11_PewExerciseSolutions.ipynb](11_PewExerciseSolutions.ipynb)

---