# Tidy Data with stack and melt

### Objectives
After this lesson you should be able to...
+ Explain what tidy data is
+ Spot messy data
+ Transform a simple messy dataset into a tidy data set
+ Master the reshaping methods: **`melt, stack`**
+ Know the equivalence of **`stack/melt`**
+ Use the Series rename method to change the **`name`** attribute of a Series
+ Use the **`rename_axis`** method to rename the levels of an index

### Prepare for this lesson by...
+ Read Hadley Wickham's paper on [tidy data](http://vita.had.co.nz/papers/tidy-data.pdf)
+ Watch Hadley Wickham's talk on [tidy data](https://vimeo.com/33727555)
+ Watch Jeff Leek's video on [tidy data](https://www.youtube.com/watch?v=whDilsFoLVY)
+ Read the [reshaping pandas documentation page](http://pandas.pydata.org/pandas-docs/stable/reshaping.html)

### Datasets until now
Thus far, we have analyzed several datasets but have not done much work to change their structure or do any preprocessing before computation. We immediately began generating results and answering questions. Producing results is typically not the first step of a data analysis. The vast majority of datasets 'in the wild' will need some amount of inspection and preprocessing. And in some cases, the entire project will just be about cleaning the data so that it can be further processed by someone else. 

For all the work that goes into data preparation for machine learning, there is surprisingly sparse coverage on how to do it. This notebook will use many ideas formulated by Hadley Wickham to **tidy** data before introducing a few more steps in order to prepare it for machine learning and visualization.

There's an infamous data science saying goes something like this: "data scientists spend 80% of their time cleaning data and the other 20% complaining about cleaning the data."

### The genesis of data
Do you know where and how data is generated? Many introductory courses such as this one will use premade csv files. Loading this data into your workspace is not the genesis of this data. The data from these sources must come from somewhere. It wasn't just magically put in a csv file or on a website or in a database used by an API. 

Some original sources of data might be:
+ While playing a mobile game, your smart phone sends game data to a small sqlite instance on your local phone and to a large remote Amazon S3 server.
+ You keep track of all your golf scores on paper and copy them to an excel file after each round
+ Censors on industrial equipment continually pour data into an on-premise hadoop cluster
+ Facebook quickly writing all it's interactions to hbase
+ City of Houston employees enter in personal information in an online web app.

Yes, non-electronic data does exist and is valuable (that was all there was before the 20th century) but for obvious reasons we will only deal with electronic data that can be read by modern computers.

### Tidy Data
Tidy data is a term coined by Hadley Wickham, the creator of many useful R packages, to describe data that is in a form for easy analysis. It is highly recommended that you read [his paper](http://vita.had.co.nz/papers/tidy-data.pdf) to get a fuller understanding of tidy data. The basics will be covered below.

Tidy data is a specific structure of data that makes analysis easier. A dataset is tidy when:
1. Each variable forms a column
2. Each observation forms a row
3. Each type of observational unit forms a table

Any dataset that does not meet this definition is considered messy. This definition is simple but useful and something that will take you a long way in your data exploration analyses. 

### First example of messy data
Messy data can appear deceptively clean and tidy, especially if you have not been exposed to it before. In the table below we have some data about the weight of some fruit owned by some people.

In [2]:
import pandas as pd
import numpy as np

In [3]:
# looks so nice and clean!
df = pd.DataFrame(data=[[12, 10, 40], [9, 7, 12], [0, 14, 190]], 
                  columns=['Apple', 'Orange', 'Banana'],
                  index=['Texas', 'Arizona', 'Florida'])
df

Unnamed: 0,Apple,Orange,Banana
Texas,12,10,40
Arizona,9,7,12
Florida,0,14,190


### What's wrong?
Even though the dataset returns perfectly readable and acceptable information it is not technically a tidy data set and although machine learning would be uninteresting with this dataset, visualization would be made easier if the data were tidy. More on this in the plotting notebooks.

The main issue with the above dataset is that the column names are variables themselves. At this point, you might be confused as to what exactly is meant by a 'variable'. A simple definition of a variable is anything that is liable to change.

### What are the variable names?
None of the variable names are actually part of the DataFrame above. You must infer them from the context of the problem. The variables are:
+ States
+ Types of fruit 
+ Weight of fruit

### Actual Tidying
To tidy, we simply need to make sure the three tidy rules are followed. Let's start with forcing each variable into a column. The states already appear to be in a single column, though they are actually in the pandas **`index`**. We will remove it from the index later.

The types of fruit are column names and need to be transposed to a column.

The weight of the fruit is a total mess and comprises a three by three square.

### Stacking
The pandas **`stack`** method, restructures the DataFrame by taking every data value (not columns names and not the index) and forcing them into one column of data. The result is a pandas **`Series`** that adds a label to all the values as the original column names.

In [4]:
# stacking the data into a Series
df.stack()

Texas    Apple      12
         Orange     10
         Banana     40
Arizona  Apple       9
         Orange      7
         Banana     12
Florida  Apple       0
         Orange     14
         Banana    190
dtype: int64

### Finish Tidying
With one command, the above data is much closer to being tidy but the Series index is now comprised of two levels (a MultiIndex). The **`reset_index`** will push all these values back out as normal DataFrame columns.

In [5]:
df_tidy = df.stack().reset_index()
df_tidy

Unnamed: 0,level_0,level_1,0
0,Texas,Apple,12
1,Texas,Orange,10
2,Texas,Banana,40
3,Arizona,Apple,9
4,Arizona,Orange,7
5,Arizona,Banana,12
6,Florida,Apple,0
7,Florida,Orange,14
8,Florida,Banana,190


### Column Names
The 'columns' in the **`index`** are technically called **`levels`** which can have names (more on this later) but do not here. By default they are referenced as integers beginning from 0 on the left. The index can have any number of levels.

Let's rename the columns directly with a list.

In [6]:
df_tidy.columns = ['State', 'Fruit', 'Weight']

df_tidy

Unnamed: 0,State,Fruit,Weight
0,Texas,Apple,12
1,Texas,Orange,10
2,Texas,Banana,40
3,Arizona,Apple,9
4,Arizona,Orange,7
5,Arizona,Banana,12
6,Florida,Apple,0
7,Florida,Orange,14
8,Florida,Banana,190


In [7]:
# All steps together
df_tidy = df.stack().reset_index()
df_tidy.columns = ['State', 'Fruit', 'Weight']
df_tidy

Unnamed: 0,State,Fruit,Weight
0,Texas,Apple,12
1,Texas,Orange,10
2,Texas,Banana,40
3,Arizona,Apple,9
4,Arizona,Orange,7
5,Arizona,Banana,12
6,Florida,Apple,0
7,Florida,Orange,14
8,Florida,Banana,190


### Our first tidy dataset
By ensuring that each variable forms its own column, each observation is also in its own row.

### Alternate way of renaming the levels and the Series before `reset_index`
It's possible to do the tidying and column renaming in a single line of code. When the **`rename_axis`** method is passed a list (or a scalar) it renames the levels. Let's see the result of this step.

In [9]:
df.stack().rename_axis(['Texas', 'Fruit'])

Texas    Fruit 
Texas    Apple      12
         Orange     10
         Banana     40
Arizona  Apple       9
         Orange      7
         Banana     12
Florida  Apple       0
         Orange     14
         Banana    190
dtype: int64

Notice the level names directly above each index level. We can give the Series itself a name by passing a string to the **`rename`** method.

In [10]:
df.stack().rename_axis(['Texas', 'Fruit']).rename('Weight')

Texas    Fruit 
Texas    Apple      12
         Orange     10
         Banana     40
Arizona  Apple       9
         Orange      7
         Banana     12
Florida  Apple       0
         Orange     14
         Banana    190
Name: Weight, dtype: int64

Now, the levels have names and the Series itself has a name. When we use the **`reset_index`** then the old level names become column names and the Series name becomes the column name for the Series values.

In [12]:
df.stack()\
  .rename_axis(['State', 'Fruit'])\
  .rename('Weight')\
  .reset_index()

Unnamed: 0,State,Fruit,Weight
0,Texas,Apple,12
1,Texas,Orange,10
2,Texas,Banana,40
3,Arizona,Apple,9
4,Arizona,Orange,7
5,Arizona,Banana,12
6,Florida,Apple,0
7,Florida,Orange,14
8,Florida,Banana,190


This is a pretty cumbersome way of renaming columns but it allows you to do it in one line.

# Focus on `melt, stack, pivot, unpivot`
We will shift focus for the moment by mastering **`melt`, `stack`, `pivot`** and **`unpivot`** on this simple dataset. These will be your primary tools from moving from messy to tidy and back to messy data again. We will return our focus to tidy data after these basic commands are covered.

### Accomplishing the same task with `melt`
Like most large Python libraries, pandas has many different ways to accomplish the same task. A large percentage of the pandas questions on stackoverflow have multiple answers that produce the same successful output with different commands. The differences usually being readability and performance.

pandas contains a DataFrame method named **`melt`** which works similarly to the **`stack`** method but gives a bit more flexibility. **`melt`** takes up to 5 parameters with two of them being more important. 
+ **`id_vars`** - a list of column names that you want to keep as columns.
+ **`value_vars`** - a list of column names that you would like to move into one column

This 'moving' into one column is usually referred to as 'melting' or 'stacking'. The **`id_vars`** will stay in the same column they are currently in but repeat to align with all the newly stacked values in the **`value_vars`** columns. 

One other important note: **`melt`** works when there are no columns in the **`index`**. To get started we first reset the index.

In [13]:
df2 = df.reset_index()
df2

Unnamed: 0,index,Apple,Orange,Banana
0,Texas,12,10,40
1,Arizona,9,7,12
2,Florida,0,14,190


In [14]:
# rename that ugly column
df2 = df2.rename(columns={'index':'State'})
df2

Unnamed: 0,State,Apple,Orange,Banana
0,Texas,12,10,40
1,Arizona,9,7,12
2,Florida,0,14,190


In [15]:
# id_vars are the columns you don't want to stack/melt. 
# value_vars are the columns you do want to stack/melt
df_melt = df2.melt(id_vars='State', 
                   value_vars=['Apple', 'Orange', 'Banana'])
df_melt

Unnamed: 0,State,variable,value
0,Texas,Apple,12
1,Arizona,Apple,9
2,Florida,Apple,0
3,Texas,Orange,10
4,Arizona,Orange,7
5,Florida,Orange,14
6,Texas,Banana,40
7,Arizona,Banana,12
8,Florida,Banana,190


### Renaming with `melt`
**`melt`** contains two other handy-dandy parameters that let you name the melted and value columns.

In [16]:
df_melt = df2.melt(id_vars='State', 
                   value_vars=['Apple', 'Orange', 'Banana'],
                   var_name='Fruit', 
                   value_name='Weight')
df_melt

Unnamed: 0,State,Fruit,Weight
0,Texas,Apple,12
1,Arizona,Apple,9
2,Florida,Apple,0
3,Texas,Orange,10
4,Arizona,Orange,7
5,Florida,Orange,14
6,Texas,Banana,40
7,Arizona,Banana,12
8,Florida,Banana,190


In [18]:
# all in one step
df.reset_index()\
  .rename(columns={'index':'State'})\
  .melt(id_vars='State', 
        value_vars=['Apple', 'Orange', 'Banana'],
        var_name='Fruit', 
        value_name='Weight')

Unnamed: 0,State,Fruit,Weight
0,Texas,Apple,12
1,Arizona,Apple,9
2,Florida,Apple,0
3,Texas,Orange,10
4,Arizona,Orange,7
5,Florida,Orange,14
6,Texas,Banana,40
7,Arizona,Banana,12
8,Florida,Banana,190


By default all the columns not named in **`id_vars`** will be be melted.

In [19]:
df2.melt(id_vars='State', 
         var_name='Fruit', 
         value_name='Weight')

Unnamed: 0,State,Fruit,Weight
0,Texas,Apple,12
1,Arizona,Apple,9
2,Florida,Apple,0
3,Texas,Orange,10
4,Arizona,Orange,7
5,Florida,Orange,14
6,Texas,Banana,40
7,Arizona,Banana,12
8,Florida,Banana,190


In [20]:
df2

Unnamed: 0,State,Apple,Orange,Banana
0,Texas,12,10,40
1,Arizona,9,7,12
2,Florida,0,14,190


### `stack` vs `melt`
The primary purpose of both **`stack`** and **`melt`** is to take multiple columns and put them in a single column. Think of columns being stacked one on top of one another or columns literally melting their data down into one common place. Each value in this long column will be labeled by it's original column name.

The **`stack`** method takes every column of the DataFrame and stacks all the values into a single column. You do not get to choose a subset of columns. The column names also get put into the **`index`** and create a MultiIndex.

The **`melt`** method gives you more control and allows you to choose which columns will be stacked and which ones will remain as labels. Any values in the index must be first reset if they are going to be used with **`melt`**.

**Terminology**: For the sake of brevity 'stacked' and 'melted' will refer to the same exact data operation. You will also will hear this called **unpivoting**.

### Set the index before using `stack`
When using the **`stack`** method, all the column names get put into the index. The previous index gets 'pushed' one level out. Therefore the current index does not get stacked and it remains as a row identifier.

In order to tidy data without overly stacking your data, you need to put the identifying column(s) into the index. For instance, see the example below. If you have a column like **`State`** that you don't want to stack, put it in the index first.

In [27]:
df3 = pd.DataFrame(data=[['Texas', 12, 10, 40], ['Arizona', 9, 7, 12], ['Florida', 0, 14, 190]], 
                   columns=['State', 'Apple', 'Orange', 'Banana'])

In [28]:
df3

Unnamed: 0,State,Apple,Orange,Banana
0,Texas,12,10,40
1,Arizona,9,7,12
2,Florida,0,14,190


If you don't put **State** in the index then the data becomes 'overly-stacked'

In [29]:
df3.stack()

0  State       Texas
   Apple          12
   Orange         10
   Banana         40
1  State     Arizona
   Apple           9
   Orange          7
   Banana         12
2  State     Florida
   Apple           0
   Orange         14
   Banana        190
dtype: object

Put **State** in the index first and then stack.

In [30]:
df3.set_index('State').stack()

State          
Texas    Apple      12
         Orange     10
         Banana     40
Arizona  Apple       9
         Orange      7
         Banana     12
Florida  Apple       0
         Orange     14
         Banana    190
dtype: int64

# Your Turn

### Problem 1
<span  style="color:green; font-size:16px">Calculate the total weight of all the fruit data by reshaping it first and then summing it up.</span>

In [25]:
fruit = pd.DataFrame(data=[[12, 10, 40], [9, 7, 12], [0, 14, 190]], 
                  columns=['Apple', 'Orange', 'Banana'],
                  index=['Texas', 'Arizona', 'Florida'])
fruit

Unnamed: 0,Apple,Orange,Banana
Texas,12,10,40
Arizona,9,7,12
Florida,0,14,190


In [18]:
# your code here

### Problem 2
<span  style="color:green; font-size:16px">There are three columns with actor names in them. Reshape the data so that you may count the frequency of all actors together regardless of the column their original column.</span>

In [33]:
movie = pd.read_csv('data/movie.csv')

In [19]:
# your code here

### Problem 3
<span  style="color:green; font-size:16px">There are three columns with actor Facebook likes in them. Reshape the data and then sum up all the actor facebook likes for the entire dataset.</span>

In [20]:
# your code here

### Problem 4
<span  style="color:green; font-size:16px">Tidy the dataset in the **`employee_messy1.csv`** file. It contains the count of all employees by race and gender. Do it with **`melt`** and again with **`stack`**.</span>

In [21]:
# your code here

### Problem 5
<span  style="color:green; font-size:16px">Tidy the dataset in the **`employee_messy2.csv`** file. It contains the count of all employees by department, race and gender. Do it with **`melt`** and again with **`stack`**.</span>

In [22]:
# your code here