Often the way data is collected makes it more difficult to analyze it with a computer.

Many analysts agree that the cleanest format is when a row of your data contains only one observation.  
This format is often called tidy data an is fully explained in this 
[link](http://vita.had.co.nz/papers/tidy-data.pdf).

However, we often collect and store data in a format that has multiple observations per row.

As an example imagine measurements of energy use from a set of rooms in a few dorms on a campus.  

It is likely that the data will be in the following format where each room's observation is on a single row.
However, there are actually two observations on each row, one of the energy used by phones and one of the energy used by laptops.

|    | household   | dorm    |   phone_energy |   laptop_energy |
|---:|:------------|:--------|---------------:|----------------:|
|  0 | A           | tuscany |             10 |              50 |
|  1 | B           | sauv    |             30 |              60 |
|  2 | C           | tuscany |             12 |              45 |
|  3 | D           | sauv    |             20 |              50 |

We may want to answer a set of questions.

- Which dorm uses more energy?
- Which appliance uses more energy?
- Does one dorm have more energy from different sorts of appliances?

To do this, it may make sense to separate each observation of the phone energy use and the laptop energy use into a separate observation.  If we do structure the data in this way, we can take advantage of some powerful existing computer processes to analyze data in this format.



In [1]:
data = '''
household,dorm,phone_energy,laptop_energy
A,tuscany,10,50
B,sauv,30,60
C,tuscany,12,45
D,sauv,20,50
'''

%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from io import StringIO
from tabulate import tabulate

df = pd.read_csv(StringIO(data))
df

Unnamed: 0,household,dorm,phone_energy,laptop_energy
0,A,tuscany,10,50
1,B,sauv,30,60
2,C,tuscany,12,45
3,D,sauv,20,50


The 
[`melt`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html) 
command in pandas allows us to transform from the data with multiple observations per row (wide-format data) to a single observation per row (long-format data).

The details in the command can be intimidating but the idea is simple.
For each row of the original data, separate the observations into separate rows.  
Look at the first row of the data above and see how it corresponds to the two rows below.

| household | dorm | appliance | energy_kWh |
| --------- | ---- | --------- | ---------- |
| A         | tuscany | phone_energy  | 10 |
| A         | tuscany | laptop_energy | 50 |


The `melt` command performs this for the entire table at once.
The first argument is the data frame that you want to transform to long format.
The argument `id_vars` contains the columns you want to be in each row of your new data frame.
The argument `value_vars` are the columns containing the observations.
Note that these column names become entries in the new data frame.
The argument `var_name` is the name of the column that will store the column names in `value_vars`.
Finally, the argument `value_name` is the name of the column that will have the data from each of the `value_vars` columns.

Before you had 4 rows of data with 8 total observations of energy.  
Now you have 8 rows of data and still have 8 observations.

In [2]:
tidy_df = pd.melt(df, 
                  id_vars=['household', 'dorm'], 
                  value_vars=['phone_energy', 'laptop_energy'],
                  var_name='appliance',
                  value_name='energy_kWh')
tidy_df

Unnamed: 0,household,dorm,appliance,energy_kWh
0,A,tuscany,phone_energy,10
1,B,sauv,phone_energy,30
2,C,tuscany,phone_energy,12
3,D,sauv,phone_energy,20
4,A,tuscany,laptop_energy,50
5,B,sauv,laptop_energy,60
6,C,tuscany,laptop_energy,45
7,D,sauv,laptop_energy,50


Now, it is very straightforward for the computer to find the observations matching certain characteristics.
We can use filtering to isolate the observations we are interested in.
We can also use a powerful related concept called the group by to separate your data into groups and perform analysis.