<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Long and Wide Data

_Authors: Dave Yerrington (SF)_

---

`pandas` is the most popular python package for managing datasets and is used extensively by data scientists.

### Learning Objectives

- Use **melt** to transform `wide` dataset to `long` dataset

---
    - Consider this sample scenario:
        - You have an employee dataset where each row represents one employee's employment attributes
        - You didn't have pay in that dataset, but you need it for your analysis project, so you extract it separately
        - The format in which you could extract pay dataset is however slightly different: you have each month's pay in separate columns. However for your analysis, you need 'Month' as one column and 'Pay' as another column to merge with your primary employee dataset
        - This is where you will need to *transform* your pay dataset and melt() function will do just that!

In [1]:
import pandas as pd

### Going from _wide_ to _long_ format of data.

> _While this is important, you may choose to skip over this in the event you would like more time, or have other priorities for your class.  Almost every student will have a need for this at some point in the course, we will not have an immediate use for this enough to practice it with real-world datasets._

Sometimes you'll get data encoded in a specific format that isn't very condusive for modeling.  Ideally, we would like to have each entity described by a row.  There are exceptions depending on how we're modeling.  We might not even care about modeling and we want to aggregate very specific aspects or we just want our data in a format for our specific wants and desires.


Let's ease into this a little bit first.  Maybe we only want to melt one feature, just to see what this does.


In [2]:
# creating a dataframe
data = [
    (24, 25, 100, "Pikachu"),
    (55, 55, 120, "Bulbasaur"),
    (33, 35, 100, "Charmander"),
    (22, 25, 105, "Geodude"),
    (12, 15, 90,  "JigglyPuff"),
    (55, 55, 115, "Paul"),
]

columns = ["Power", "Speed", "HP", "Name"]

df = pd.DataFrame(data, columns = columns)

print(df.shape)
df

(6, 4)


Unnamed: 0,Power,Speed,HP,Name
0,24,25,100,Pikachu
1,55,55,120,Bulbasaur
2,33,35,100,Charmander
3,22,25,105,Geodude
4,12,15,90,JigglyPuff
5,55,55,115,Paul


In [3]:
# melting that dataframe --> notice closely that if df is a 'pivoted' table, melt transforms that into an 'unpivoted' table
# thus going from wide to long listing values one row after the other
print(pd.melt(df).shape)
pd.melt(df) # notice how the dataframe has transformed similar to example scenario covered above

(24, 2)


Unnamed: 0,variable,value
0,Power,24
1,Power,55
2,Power,33
3,Power,22
4,Power,12
5,Power,55
6,Speed,25
7,Speed,55
8,Speed,35
9,Speed,25


Google - Pandas documentation for melt function [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html)
Key points to note:
- defaulted columns if entire dataframe is passed to melt() are variable, value
- power to control with most commonly activated arguments as below:
    - id_vars--> Column(s) to use as identifier variables
    - value_vars--> Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars
    - var_name--> Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’
    - value_name--> Name to use for the ‘value’ column

In [4]:
pd.melt(df, id_vars=['Name'])

Unnamed: 0,Name,variable,value
0,Pikachu,Power,24
1,Bulbasaur,Power,55
2,Charmander,Power,33
3,Geodude,Power,22
4,JigglyPuff,Power,12
5,Paul,Power,55
6,Pikachu,Speed,25
7,Bulbasaur,Speed,55
8,Charmander,Speed,35
9,Geodude,Speed,25


By specifying the id_variables for _Name_, we're telling the `melt()` function that we wish to keep this column from melting (like putting our ice cream in the freezer).  However, `Power, Speed and HP` weren't specified so they became part of the _DataFrame_, in addition to their values being shifted in a new variable called `value`.  

- Everytime we perform a `melt()`, the columns _variable_ and _value_ are created.
- Columns not specifed by the `id_vars=` parameter are unpacked into the _variable_ and _value_ columns.
- The **name** of the `column` that is being melted becomes the value for each row inside of the `variable` column.
- The **data** of the `column` that is being melted is put inside of the `value` column for each row.

In [5]:
# Let's modify the id_vars and try again if there are 2 columns from df left out from being melt identifiers
pd.melt(df, id_vars=['Name', 'Speed'])

Unnamed: 0,Name,Speed,variable,value
0,Pikachu,25,Power,24
1,Bulbasaur,55,Power,55
2,Charmander,35,Power,33
3,Geodude,25,Power,22
4,JigglyPuff,15,Power,12
5,Paul,55,Power,55
6,Pikachu,25,HP,100
7,Bulbasaur,55,HP,120
8,Charmander,35,HP,100
9,Geodude,25,HP,105


### Visually explained

Melting can be thought of as transposing the row data with it's coresponding column name into a 2d space for each column and row value.  By default, `melt()` with no parameters will throw everything into a 2 column space (`variable` for the name of the column, and `value` for their cooresponding values.)

![](https://snag.gy/j7J4tz.jpg)

![](https://snag.gy/myP41e.jpg)

## From _long_ to _wide_ again
We can always convert from _long_ to _wide_ using `pivot()`.  Because? Remember, what we just did with `melt()` was `unpivot`. Now, we'll just use `pivot` to revert back changes. 

Pivot, needs to know `index`, `columns`, and `values` parameters in order to convert a given _DataFrame_ to a new pivoted one. Relate these in spreadsheet context, when you need to create a pivot table:

    - index-->columns passed to 'Rows' for spreadsheet pivot
    - columns-->columns passed to 'Columns' for spreadsheet pivot
    - values-->columns passed to 'Values' for spreadsheet pivot

> _This may be a little confusing at first, but it'll get easier with practise as does everything in coding and this certainly isn't the biggest problem we'll face._

Further read resources: 
- [Reshaping concept](https://pandas.pydata.org/docs/user_guide/reshaping.html)
- [Pivot_table when the transformation involves aggregation](https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html)

In [6]:
# Here's our default "long, flat" melted dataset
melted = df.melt()
melted

Unnamed: 0,variable,value
0,Power,24
1,Power,55
2,Power,33
3,Power,22
4,Power,12
5,Power,55
6,Speed,25
7,Speed,55
8,Speed,35
9,Speed,25


In [8]:
# recapping our original unmelted df for reference to perform pivot()
df

Unnamed: 0,Power,Speed,HP,Name
0,24,25,100,Pikachu
1,55,55,120,Bulbasaur
2,33,35,100,Charmander
3,22,25,105,Geodude
4,12,15,90,JigglyPuff
5,55,55,115,Paul


In [9]:
# pivoting melted dataframe 
melted.pivot(columns = 'variable', values = 'value')

variable,HP,Name,Power,Speed
0,,,24.0,
1,,,55.0,
2,,,33.0,
3,,,22.0,
4,,,12.0,
5,,,55.0,
6,,,,25.0
7,,,,55.0
8,,,,35.0
9,,,,25.0


- Pivoting on flat melted df does not give us the expected result that is like df
- Note that melted has only 2 columns, that can only correspondingly be passed to 2 arguments in pivot()
- Inference from [Pandas pivot documentation](https://pandas.pydata.org/pandas-docs/version/1.3.1/reference/api/pandas.DataFrame.pivot.html) is if **No index is passed to `pivot`, it uses existing index**, thus we see the resulting output from `pivot()` has index running from 0 to 23, matching with melted dataframe
- We need to rectify pivot by specifying an index to serve as reference for pivoting/reshaping

In [10]:
# created a new melted df where we will use 'Name' as the moving forward index reference for pivoting
melted_name = pd.melt(df, id_vars="Name")
melted_name

Unnamed: 0,Name,variable,value
0,Pikachu,Power,24
1,Bulbasaur,Power,55
2,Charmander,Power,33
3,Geodude,Power,22
4,JigglyPuff,Power,12
5,Paul,Power,55
6,Pikachu,Speed,25
7,Bulbasaur,Speed,55
8,Charmander,Speed,35
9,Geodude,Speed,25


In [11]:
# perform pivot() with Name as index reference, reset_index subsequently so that Name remains a df column
melted_pivot = melted_name.pivot(index = 'Name', columns = 'variable', values = 'value').reset_index()
melted_pivot

variable,Name,HP,Power,Speed
0,Bulbasaur,120,55,55
1,Charmander,100,33,35
2,Geodude,105,22,25
3,JigglyPuff,90,12,15
4,Paul,115,55,55
5,Pikachu,100,24,25


In [12]:
melted_pivot.columns

Index(['Name', 'HP', 'Power', 'Speed'], dtype='object', name='variable')

In [13]:
# removing index name that got added from melt --> pivot operation
melted_pivot.columns.name = None
melted_pivot

Unnamed: 0,Name,HP,Power,Speed
0,Bulbasaur,120,55,55
1,Charmander,100,33,35
2,Geodude,105,22,25
3,JigglyPuff,90,12,15
4,Paul,115,55,55
5,Pikachu,100,24,25
