# Objectives
- Define 'tidy' data
- Describe the differences between Melt and Pivot
- Use melt and pivot to reshape data set into a format that is tidy
***
## Tidy Data
* each variable is a column
* each row is an observation
* each type of observational unit is a table

## Melt
- Why do we use `melt`?
    - To bring our data from wide form to long form and make it tidy
    - To make each row a single observation

## Pivot
- Why do we use `pivot`
    - To make each variable a column

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Reshaping - Melt
## Read and Manipulate

Read the following URL using Pandas `read_csv` function and assign it to the variable `house`. You can click the link and download the csv if you would like to. Inspect the first five lines.

https://files.zillowstatic.com/research/public_v2/zhvi/Metro_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_mon.csv


It looks like we have some unnecessary columns `RegionID, SizeRank, RegionType`. Let's `drop` them from our DataFrame.

Let's check out all of our columns using `info` and see if we have any null values to deal with. We have too many columns for pandas to display by default, so we must pass the value `True` to the parameter `verbose`, as well as the value `True` to the parameter `null_counts`.

Let's check out which `StateName` `isnull`.

It looks like the United States is missing a state name. We don't want to drop this data point, so let's fill the `StateName` with `'US'`.

In [None]:
# Print info


Let's find out why we have null values in a lot of our early data columns.

Since we will just be plotting, we can leave these as null. 

## Melt
Melt `house` with its reset index and leave the columns `RegionName` and `StateName` (parameter: `id_vars`), make a new column called `Date` for the values previously the column labels (parameter: `var_name`), and make a label `Price` for the values previously contained in the multiple `Date` columns (parameter: `value_name`). Assign this new, melted DataFrame to `house_tidy`.

Let's plot the house prices in Colorado over time.

# 2. Reshaping - Pivot & Pivot Table
***
## Pivot
- When should we use `pivot`?
    - to make each variable a column
    - when there is **at most** one observation for each unique combination of 2 columns

## Pivot Table
- When should we use `pivot_table`?
    - to make each variable a column
    - when there are more than one observation for each unique combination of 2 columns

## Read

In [13]:
url = 'http://bit.ly/2cLzoxH'
df = pd.read_csv(url).drop(['pop', 'gdpPercap'], axis=1)
df

Unnamed: 0,country,year,continent,lifeExp
0,Afghanistan,1952,Asia,28.801
1,Afghanistan,1957,Asia,30.332
2,Afghanistan,1962,Asia,31.997
3,Afghanistan,1967,Asia,34.020
4,Afghanistan,1972,Asia,36.088
...,...,...,...,...
1699,Zimbabwe,1987,Africa,62.351
1700,Zimbabwe,1992,Africa,60.377
1701,Zimbabwe,1997,Africa,46.809
1702,Zimbabwe,2002,Africa,39.989


Let's pause and think about what questions might be asked and whether we should use `pivot` or `pivot_table`. First let's find all of the combinations of our columns (excluding `lifeExp` since that is the data we want to plot).

#### Unique Combinations
- country-year
- country-continent
- year-continent

#### One Observation per Combination
- country-year

#### More than One Observation per Combination
- country-continent (multiple years)
- year-continent (multiple countries)

## Pivot
We are going to try to answer a question about the **trend of life expectancy in each country over time**.

In [None]:
# Use pivot on df1


In [1]:
# Plot the data


In [2]:
# Create lineplot


## Pivot Table
We are going to try to answer a question about the **average life expectancy in each continent**. Right now our data is in long format, which is semi-tidy. However, we want each column to be its own variable. To answer our question, we have 6 variables:
- Year
- Africa Life Expectancy
- Americas Life Expectancy
- Asia Life Expectancy
- Europe Life Expectancy
- Oceania Life Expectancy

In order to answer our question, we need to make our DataFrame tidy. Let's pivot our data using `year` and the `index`, `continent` for the `columns` and `lifeExp` for the `values`.

Now we can easily plot using pandas!

Using a pivot table like this is identical to using `groupby()`! Let's group by `continent` and `year` to find the `mean` of each continent/year combination. Use `reset_index` so we dont have to mess with multiple indices.

In [3]:
# Create barplot
