<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Long and Wide Data

_Authors: Dave Yerrington (SF)_

---

In [2]:
import pandas as pd

### Going from _wide_ to _long_ format of data.

> _While this is important, you may choose to skip over this in the event you would like more time, or have other priorities for your class.  Almost every student will have a need for this at some point in the course, we will not have an immediate use for this enough to practice it with real-world datasets._

Someitmes you'll get data encoded in a specific format that isn't very condusive for modeling.  Ideally, we would like to have each entity described by a row.  There are exceptions depending on how we're modeling.  We might not even care about modeling and we want to aggregate very specific aspects or we just want our data in a format for our specific wants and desires.


Let's ease into this a little bit first.  Maybe we only want to melt one feature, just to see what this does.


In [3]:
data = [
    (24, 25, 100, "Pikachu"),
    (55, 55, 120, "Bulbasaur"),
    (33, 35, 100, "Charmander"),
    (22, 25, 105, "Geodude"),
    (12, 15, 90,  "JigglyPuff"),
    (55, 55, 115, "Paul"),
]

columns = ["Power", "Speed", "HP", "Name"]
df = pd.DataFrame(data, columns = columns)
df

Unnamed: 0,Power,Speed,HP,Name
0,24,25,100,Pikachu
1,55,55,120,Bulbasaur
2,33,35,100,Charmander
3,22,25,105,Geodude
4,12,15,90,JigglyPuff
5,55,55,115,Paul


In [4]:
pd.melt(df)

Unnamed: 0,variable,value
0,Power,24
1,Power,55
2,Power,33
3,Power,22
4,Power,12
5,Power,55
6,Speed,25
7,Speed,55
8,Speed,35
9,Speed,25


In [5]:
pd.melt(df, id_vars=['Power', 'Speed', 'HP'])

Unnamed: 0,Power,Speed,HP,variable,value
0,24,25,100,Name,Pikachu
1,55,55,120,Name,Bulbasaur
2,33,35,100,Name,Charmander
3,22,25,105,Name,Geodude
4,12,15,90,Name,JigglyPuff
5,55,55,115,Name,Paul


By specifying the id_variables for _Power, Speed, and HP_, we're telling the `melt()` function that we wish to keep those columns from melting (like putting our ice cream in the freezer).  However, `Name` wasn't specified so it became part of the _DataFrame_, in addition to it's value (the name of the Pokemon) being shifted in a new variable called `value`.  

- Everytime we `melt()`, the features _variable_ and _value_ are created.
- Columns not specifed by the `id_vars=` parameter are unpacked into the _variable_ and _value_ columns.
- The name of the `column` that is being melted becomes the value for each row inside of the `variable` column.
- The value of the data that is being melted is put inside of the `value` column for each row.

In [6]:
# We should see what this data looks like if we where to melt all of it
pd.melt(df, id_vars=['Power', 'Speed'])

Unnamed: 0,Power,Speed,variable,value
0,24,25,HP,100
1,55,55,HP,120
2,33,35,HP,100
3,22,25,HP,105
4,12,15,HP,90
5,55,55,HP,115
6,24,25,Name,Pikachu
7,55,55,Name,Bulbasaur
8,33,35,Name,Charmander
9,22,25,Name,Geodude


In [5]:
melted = df.melt()
melted

Unnamed: 0,variable,value
0,Power,24
1,Power,55
2,Power,33
3,Power,22
4,Power,12
5,Power,55
6,Speed,25
7,Speed,55
8,Speed,35
9,Speed,25


### Visually explained

Melting can be thought of as transposing the row data with it's cooresponding column name into a 2d space for each column and row value.  Be default, `melt()` with no parameters will throw everything into a 2 column space (`variable` for the name of the column, and `value` for their cooresponding values.)

![](https://snag.gy/j7J4tz.jpg)

![](https://snag.gy/myP41e.jpg)

## From _long_ to _wide_ again
We can always convert from _long_ to _wide_ using `pivot()`.  Pivot, needs to know `index`, `columns`, and `values` parameters in order to convert an existing _DataFrame_ to a new one, where the specified `columns` are created from the values in the series that exist in, using their cooresponding values in the `values` column specified.  These combined parameters seek to create unique combinations in the row space (wide format), to a given index.

> _This is a tough concept but it's not the biggest problem we'll face._

In [6]:
# Here's our "long" melted dataset
melted

Unnamed: 0,variable,value
0,Power,24
1,Power,55
2,Power,33
3,Power,22
4,Power,12
5,Power,55
6,Speed,25
7,Speed,55
8,Speed,35
9,Speed,25


### We know the "variable" column as our features, and the "values" column their values.

Let's see if we can get our data back to it's once handsome "wide" format.

In [7]:
melted = pd.melt(df, id_vars="Name")
melted.pivot(index="Name", columns="variable", values="value").reset_index()

variable,Name,HP,Power,Speed
0,Bulbasaur,120,55,55
1,Charmander,100,33,35
2,Geodude,105,22,25
3,JigglyPuff,90,12,15
4,Paul,115,55,55
5,Pikachu,100,24,25


## Uh oh there's a problem here!?
To convert back, it seems like somethings not quite right.  What do you suppose it is?

> This is a tricky problem actually and it's ok if you don't quite get it at first.  Maybe you can't figure this out at first glance, but what _appears_ to be wrong?

In [8]:
data = [
    (15, 150, .5,  "bacon"),
    (45, 150, .99, "stack"),
    (55, 150, .55, "bacon"),
    (42, 150, .98, "stack"),
    (56, 150, .66, "bacon"),
    (33, 225, .65, "bacon"),
    (44, 234, .89, "stack"),
]

columns = ["height", "weight", "success", "category"]
df = pd.DataFrame(data, columns = columns)
df

Unnamed: 0,height,weight,success,category
0,15,150,0.5,bacon
1,45,150,0.99,stack
2,55,150,0.55,bacon
3,42,150,0.98,stack
4,56,150,0.66,bacon
5,33,225,0.65,bacon
6,44,234,0.89,stack
