# Tidy Data in Python
by [Jean-Nicholas Hould](http://www.jeannicholashould.com/)


from the blog post of the same name
[http://www.jeannicholashould.com/tidy-data-in-python.html](http://www.jeannicholashould.com/tidy-data-in-python.html)

## Tidying messy datasets (Intro)

Two very common problems with messy datasets are:

- A column contains multiple-valued lists
- Column headers are values, not variable names

We'll run through how to fix these problems in the examples below. We use the Python module `Pandas` for dealing with tablular data.

We'll also learn how to join two tables, both by concatenation, and by the equivalent of an SQL JOIN statment in Pandas.

Plus, we'll see how to save tables to CSV and JSON files.


In [3]:
import pandas as pd

---

## Splitting lists into columns

This wasn't part of the original paper, but it's an example I run into all the time and I haven't seen it documented very many places.

The data is in a sub-folder called `data`. The `read_excel()` function **will read the first sheet in the workbook by default if you don't specify another**

In [4]:
ps = pd.read_excel('./data/PeopleStates.xlsx')
ps

Unnamed: 0,name,states
0,Bobby,"Wyoming,Michigan"
1,Sue,"Wisconsin,Nevada,California"
2,Tamika,"Florida,Washington"
3,Cale,South Dakota
4,Iris,"Washington,Oregon,California"


#### Referring to columns with quoted name in side square brackets

We can refer to a specific column either with square brackets with the name in quotes (which is the necessary form if the column name has spaces in it). Each column by itself, taken out of the DataFrame, is not a DataFrame, but a "Series".

In [5]:
ps["name"]

0     Bobby
1       Sue
2    Tamika
3      Cale
4      Iris
Name: name, dtype: object

#### Referring to columns with dataframe.name

If the name doesn't have spaces in it, we can use the "dot notation", with the dataframe variable name "." column name. 

*Note: Although this works the same as the bracket notation, there are good arguments for sticking consistently with the bracket notation rather than dot. See the [Minimally Sufficient Pandas article](https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428) for more details.

In [6]:
ps.name

0     Bobby
1       Sue
2    Tamika
3      Cale
4      Iris
Name: name, dtype: object

#### String `.str` operations will be applied to each row

A simple string function is to make everything lowercase

In [7]:
ps["name"].str.lower()

0     bobby
1       sue
2    tamika
3      cale
4      iris
Name: name, dtype: object

#### Splitting strings on a delimiter character

Here we do a "splitting" operation on the column to split what is currently a single string containing commas, into a list of the items between the commas.

*Note, you will end up with a single column of lists if don't put `expand=True`, which denotes that you're intending to "expand the dimensionality" of the data set.*

*Notice, also, that the DataFrame will expand to enough columns to accomodate the list with the most elements, unless you specify a limit, and lists without enough elements will have `None` in the extra columns.*

In [8]:
split_states = ps["states"].str.split(',', expand=True)
split_states

Unnamed: 0,0,1,2
0,Wyoming,Michigan,
1,Wisconsin,Nevada,California
2,Florida,Washington,
3,South Dakota,,
4,Washington,Oregon,California


#### Concatenation – `concat()`

Pandas will use the Index to align rows of the original `names` Series and the `psplit` DataFrame that are being concatenated. 

- `axis=0` is down the rows
- `axis=1` is across the columns.

Let's put the expanded states and the names back together into one table.

In [9]:
pexp = pd.concat([ps.name, split_states], axis=1)
pexp

Unnamed: 0,name,0,1,2
0,Bobby,Wyoming,Michigan,
1,Sue,Wisconsin,Nevada,California
2,Tamika,Florida,Washington,
3,Cale,South Dakota,,
4,Iris,Washington,Oregon,California


---

## Column headers are values, not variable names

*One of the more common manipulations*

### Un-pivoting into tall format

Many call this process of going from a wide data set to tall "un-pivoting" since a pivot table in Excel converts data from the tall format into wide. 

The situation when you need this is that you have data in the column headers that you want in their own column. You also want the values that are spread across the multiple rows and columns to end up in a single measurement column.

- **In Pandas you do a "melt"**
- In `tidyr` this is a "gather"
- In OpenRefine it's a "Transpose->Transpose cells across columns into rows..." operation
- In Tableau this is called a "Pivot"

Let's first define a simple, small data frame:

In [10]:
df = pd.DataFrame({'label':['A','B','C'],
                  'x':[1,2,3],
                  'y':[4,5,6],
                  'z':[7,8,9]})
df

Unnamed: 0,label,x,y,z
0,A,1,4,7
1,B,2,5,8
2,C,3,6,9


Minimally, you need to specify the DataFrame to "melt", and a list of which columns don't get "un-pivoted". The latter will get repeated.

In [11]:
df2 = pd.melt(df, ['label'])
df2

Unnamed: 0,label,variable,value
0,A,x,1
1,B,x,2
2,C,x,3
3,A,y,4
4,B,y,5
5,C,y,6
6,A,z,7
7,B,z,8
8,C,z,9


### Now back to the States dataset

- id_vars will be repeated and not un-pivoted
- all others will be melted down into a single column (values)
- with the column names as a separate column (variables)

When we don't specify a `var_name=` for `melt()`, it will default to "variable"

In [12]:
ptidy = pd.melt(pexp, id_vars=['name'], value_name='state')
ptidy

Unnamed: 0,name,variable,state
0,Bobby,0,Wyoming
1,Sue,0,Wisconsin
2,Tamika,0,Florida
3,Cale,0,South Dakota
4,Iris,0,Washington
5,Bobby,1,Michigan
6,Sue,1,Nevada
7,Tamika,1,Washington
8,Cale,1,
9,Iris,1,Oregon


#### Drop columns

In this case we don't need the "variable" column. There are a couple ways we can get rid of, or *drop*, unwanted columns. We can

- Specify a list of column names to select only certain columns to keep, dropping others that aren't needed (we'll cover a strange point about this method in a second)
- Use the `drop()` method

Since the `drop()` method can drop either rows or columns from a DataFrame, we need to either 

- tell Pandas what values to drop, plus the axis along which to drop (0=rows, 1=columns)
- or we can explicitly say `columns=` or `rows=` **<- I think this way is more straightforward**


In [14]:
pnamestate = ptidy.drop(columns=['variable'])
pnamestate.head()

Unnamed: 0,name,state
0,Bobby,Wyoming
1,Sue,Wisconsin
2,Tamika,Florida
3,Cale,South Dakota
4,Iris,Washington


---

### Side note on dropping columns by specifying a subset to keep

A common way to drop columns is to select the subset of columns you'd like to keep and assign that to a new or the same variable:

In [15]:
p_temp = ptidy[['name','state']]
p_temp.head()

Unnamed: 0,name,state
0,Bobby,Wyoming
1,Sue,Wisconsin
2,Tamika,Florida
3,Cale,South Dakota
4,Iris,Washington


An annoying situation arises, though, when you then try to change that new DataFrame: you get a `SettingWithCopyWarning`. *(Note that it actually performs the operation, but the warnings distract me when looking at the notebook because they look like an error.)*

In [16]:
p_temp.loc[0,'name'] = "YinJi"
p_temp.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,name,state
0,YinJi,Wyoming
1,Sue,Wisconsin
2,Tamika,Florida
3,Cale,South Dakota
4,Iris,Washington


This is to avoid a problem that can arise unexpectedly in certain situations where people think they're changing a value in a "view" of their original DataFrame, and really they're changing a value in a *copy* of a piece of their original DataFrame.

So, if you drop columns by taking a subset, **you can get around this warning by explicitly telling Pandas you know you want to make a copy** during that subsetting operation.

In [17]:
p_temp = ptidy[['name','state']].copy()
p_temp.loc[0,'name'] = "YinJi"
p_temp.head()

Unnamed: 0,name,state
0,YinJi,Wyoming
1,Sue,Wisconsin
2,Tamika,Florida
3,Cale,South Dakota
4,Iris,Washington


---

### Back to our names, states example

#### *NOTE: "inplace"*

- Most functions create a copy of the DataFrame instead of changing the original
- Many methods include an "inplace" argument, so it won't make a copy
- **Be careful! You're writing over your data in place!**

#### `dropna()` to drop nulls

- Defaults to dropping any row that has a null/None in **any** column
- You can specify a subset of colunns to test instead.

In [18]:
pnamestate.dropna(inplace=True)
pnamestate

Unnamed: 0,name,state
0,Bobby,Wyoming
1,Sue,Wisconsin
2,Tamika,Florida
3,Cale,South Dakota
4,Iris,Washington
5,Bobby,Michigan
6,Sue,Nevada
7,Tamika,Washington
9,Iris,Oregon
11,Sue,California


#### `sort_values()` to sort by values

**Again, the default is to make a copy, and just print that out**, so unless you reassign, or change "inplace", the funciton won't change the original values!

In [19]:
pnamestate.sort_values(by='name')

Unnamed: 0,name,state
0,Bobby,Wyoming
5,Bobby,Michigan
3,Cale,South Dakota
4,Iris,Washington
9,Iris,Oregon
14,Iris,California
1,Sue,Wisconsin
6,Sue,Nevada
11,Sue,California
2,Tamika,Florida


In [20]:
pnamestate

Unnamed: 0,name,state
0,Bobby,Wyoming
1,Sue,Wisconsin
2,Tamika,Florida
3,Cale,South Dakota
4,Iris,Washington
5,Bobby,Michigan
6,Sue,Nevada
7,Tamika,Washington
9,Iris,Oregon
11,Sue,California


---

## Merging (joining) two data sets

Here we'll read in a second sheet out of the same Excel workbook and join this state-level data with the people/states data we just modified. 

**This is the Pandas equivalent of an SQL JOIN command**

We'll start by loading in a table of the US states, their populations, and the number of congessional house seats they are represented by.

In [21]:
sp = pd.read_excel('./data/PeopleStates.xlsx', sheet_name='Sheet2')
sp.tail(5)

Unnamed: 0,state,population_2010,house_seats
45,South Dakota,814191,1
46,North Dakota,672591,1
47,Alaska,710249,1
48,Vermont,625745,1
49,Wyoming,563767,1


#### LEFT JOIN

We'll do a LEFT JOIN by using the `merge()` function, specifying which DataFrame is on the "left" and which is on the "right" for the JOIN. It's just the order in which you list them as the first two arguments to `merge()`.

We also need to specify which column contains the ID fields / keys to join on. We put these in the "left_on" and "right_on" arguments.

Then, we'll sort the rows "descending" and "in place" by the populatin column.

In [22]:
ppop = pd.merge(ptidy, sp, how='left', left_on='state', right_on='state')

ppop.sort_values('population_2010', ascending=False, inplace=True)
ppop

Unnamed: 0,name,variable,state,population_2010,house_seats
11,Sue,2,California,37254503.0,53.0
14,Iris,2,California,37254503.0,53.0
2,Tamika,0,Florida,18804623.0,27.0
5,Bobby,1,Michigan,9884129.0,14.0
4,Iris,0,Washington,6724543.0,10.0
7,Tamika,1,Washington,6724543.0,10.0
1,Sue,0,Wisconsin,5687289.0,8.0
9,Iris,1,Oregon,3831073.0,5.0
6,Sue,1,Nevada,2700691.0,4.0
3,Cale,0,South Dakota,814191.0,1.0


---

## Saving table out to a CSV file

Usually we can save to an Excel file, but we'd need to install another module
so, we'll save as CSV file for now, which is a very useful format.

- It's good practice to specify the `encoding`, which is the method used for recording characters beyond the 256 ASCII character set. 

- In this case we also don't need to save the `index` column to the file, so we'll turn that option off

In [23]:
ppop.to_csv('./data/PeopleStates_Merged.csv', encoding='utf-8', index=False)

#### Save to JSON

Another option is to save as a JSON file. There are multiple "orientations":
[to_json docs](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html)

`records` orientation will make a list of rows, each an object/dictionary


In [24]:
ppop.to_json('./data/PeopleStates_Merged.json', orient='records')

---

## Stop here and try the Pew Research Center Dataset exercise!

**Click here to open:** [10_PewExercise.ipynb](10_PewExercise.ipynb)

*Don't look yet, but solutions are in:* [11_PewExerciseSolutions.ipynb](11_PewExerciseSolutions.ipynb)

---