<a href="https://colab.research.google.com/github/donw385/DS-Unit-1-Sprint-2-Data-Wrangling/blob/master/module3-reshape-data/LS_DS_123_Reshape_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science_

# Reshape data

Objectives
-  understand tidy data formatting
-  melt and pivot data with pandas

Links
- [Tidy Data](https://en.wikipedia.org/wiki/Tidy_data)
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)
  - Tidy Data
  - Reshaping Data
- Python Data Science Handbook
  - [Chapter 3.8](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html), Aggregation and Grouping
  - [Chapter 3.9](https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html), Pivot Tables
  
Reference
- pandas documentation: [Reshaping and Pivot Tables](https://pandas.pydata.org/pandas-docs/stable/reshaping.html)
- Modern Pandas, Part 5: [Tidy Data](https://tomaugspurger.github.io/modern-5-tidy.html)

## Upgrade Seaborn

Run the cell below which upgrades Seaborn and automatically restarts your Google Colab Runtime.

In [0]:
!pip install seaborn --upgrade
import os
os.kill(os.getpid(), 9)

Requirement already up-to-date: seaborn in /usr/local/lib/python3.6/dist-packages (0.9.0)


## Hadley Wickham's Examples

From his paper, [Tidy Data](http://vita.had.co.nz/papers/tidy-data.html)

In [0]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns

table1 = pd.DataFrame(
    [[np.nan, 2],
     [16,    11], 
     [3,      1]],
    index=['John Smith', 'Jane Doe', 'Mary Johnson'], 
    columns=['treatmenta', 'treatmentb'])

table2 = table1.T

"Table 1 provides some data about an imaginary experiment in a format commonly seen in the wild. 

The table has two columns and three rows, and both rows and columns are labelled."

In [0]:
table1

"There are many ways to structure the same underlying data. 

Table 2 shows the same data as Table 1, but the rows and columns have been transposed. The data is the same, but the layout is different."

In [0]:
table2

"Table 3 reorganises Table 1 to make the values, variables and obserations more clear.

Table 3 is the tidy version of Table 1. Each row represents an observation, the result of one treatment on one person, and each column is a variable."

| name         | trt | result |
|--------------|-----|--------|
| John Smith   | a   | -      |
| Jane Doe     | a   | 16     |
| Mary Johnson | a   | 3      |
| John Smith   | b   | 2      |
| Jane Doe     | b   | 11     |
| Mary Johnson | b   | 1      |

## Table 1 --> Tidy

We can use the pandas `melt` function to reshape Table 1 into Tidy format.

In [0]:
tidy = table1.reset_index().melt(id_vars='index')
tidy = tidy.rename(columns={'index':'name','variable':'trt','value':'result'})
tidy.trt = tidy.trt.str.replace('treatment','')
tidy

In [0]:
tidy['trt'] = tidy.trt.replace('a',0).replace('b',1)


In [0]:
tidy.trt.map({'a':0,'b':1})

In [0]:
tidy.head()

## Table 2 --> Tidy

In [0]:
tidy2 = table2.reset_index().melt(id_vars='index')
tidy2 = tidy2.rename(columns={'index':'trt','variable':'name','value':'result'})
tidy2.trt = tidy2.trt.str.replace('treatment','')
tidy2

## Tidy --> Table 1

The `pivot_table` function is the inverse of `melt`.

In [0]:
tidy.pivot_table(index='name',columns='trt', values='result')

## Tidy --> Table 2

In [0]:
tidy2.pivot_table(index='name',columns='trt', values='result')

## Seaborn uses tidy data



> "[Seaborn will be most powerful when your datasets have a particular organization.](https://seaborn.pydata.org/introduction.html#organizing-datasets) This format ia alternately called “long-form” or “tidy” data and is described in detail by Hadley Wickham. The rules can be simply stated:

> - Each variable is a column
- Each observation is a row

> A helpful mindset for determining whether your data are tidy is to think backwards from the plot you want to draw. From this perspective, a “variable” is something that will be assigned a role in the plot."

In [0]:
import seaborn as sns
sns.catplot(x='trt',y='result',col='name',kind='bar',data=tidy,height=2);

## Load Instacart data

Let's return to the dataset of [3 Million Instacart Orders](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2)

If necessary, uncomment and run the cells below to re-download and extract the data

In [0]:
!wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
!tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

Run these cells to load the data

In [0]:
%cd instacart_2017_05_01

In [0]:
products = pd.read_csv('products.csv')

order_products = pd.concat([pd.read_csv('order_products__prior.csv'), 
                            pd.read_csv('order_products__train.csv')])

orders = pd.read_csv('orders.csv')

## Goal: Reproduce part of this example

Instead of a plot with 50 products, we'll just do two — the first products from each list
- Half And Half Ultra Pasteurized
- Half Baked Frozen Yogurt

In [0]:
from IPython.display import display, Image
url = 'https://cdn-images-1.medium.com/max/1600/1*wKfV6OV-_1Ipwrl7AjjSuw.png'
example = Image(url=url, width=600)

display(example)

So, given a `product_name` we need to calculate its `order_hour_of_day` pattern.

## Subset and Merge

In [0]:
products.columns.tolist()

In [0]:
orders.columns.tolist()

In [0]:
order_products.columns.tolist()

In [0]:
a=products[['product_id','product_name']]
b=order_products[['order_id','product_id']]
c=orders[['order_id','order_hour_of_day']]

merged1=pd.merge(a,b)
merged2=pd.merge(merged1,c)

In [0]:
merged2.shape

In [0]:
merged2.head()

In [0]:
product_names = ['Half Baked Frozen Yogurt', 'Half And Half Ultra Pasteurized']
#condition = (merged2.product_name=='Half Baked Frozen Yogurt') |(merged2.product_name=='Half And Half Ultra Pasteurized')
#equal to below
condition = merged2.product_name.isin(product_names)
subset=merged2[condition]

In [0]:
subset.sample(n=10)

In [0]:
subset.product_name.value_counts()

## 4 ways to reshape and plot

### 1. value_counts

In [0]:
cream=subset[subset.product_name=='Half And Half Ultra Pasteurized']
froyo=subset[subset.product_name=='Half Baked Frozen Yogurt']
cream.order_hour_of_day.value_counts().sort_index().plot();
froyo.order_hour_of_day.value_counts().sort_index().plot();

In [0]:
cream.order_hour_of_day.value_counts(normalize=True).sort_index()

### 2. crosstab

In [0]:
#normalize=True is for making percentages, if passed ‘all’ or True, will normalize over all values, If passed ‘index’ will normalize over each row, If passed ‘columns’ will normalize over each column
pd.crosstab(subset.order_hour_of_day,subset.product_name,normalize='columns').plot();

### 3. pivot_table

In [0]:
subset.pivot_table(index='order_hour_of_day',columns='product_name',values='order_id',aggfunc=len).plot();

### 4. melt

In [0]:
table=pd.crosstab(subset.order_hour_of_day,subset.product_name,normalize='columns')

In [0]:
melted=(table.reset_index().melt(id_vars='order_hour_of_day').rename(columns={'order_hour_of_day':'Hour of Day Ordered','product_name':'Product','value':'Percent of Orders by Product'}))

In [0]:
sns.relplot(x='Hour of Day Ordered',y='Percent of Orders by Product',hue='Product',data=melted,kind='line');

In [0]:
#assignment

flights = sns.load_dataset('flights')
flights.head()

In [0]:
flight_table = flights.pivot_table(index='year',columns='month', values='passengers')

In [0]:
ax = sns.heatmap(flight_table)

In [0]:
#stretch

aisles = pd.read_csv('aisles.csv')
departments = pd.read_csv('departments.csv')
aisles.columns, orders.columns, order_products.columns, products.columns, departments.columns

In [0]:
df = pd.merge(order_products[['order_id','product_id','reordered']],products[['product_id','product_name','aisle_id','department_id']])
df=pd.merge(df,aisles[['aisle_id','aisle']])
df=pd.merge(df,departments[['department_id','department']])

In [0]:
df.head()

In [0]:
orders_and_reorders= df.groupby('aisle').order_id.count()
reorders= df.groupby('aisle').reordered.sum()
orders_and_reorders = pd.DataFrame(orders_and_reorders)
orders_and_reorders=orders_and_reorders.rename(columns={'order_id':'orders'})
orders_and_reorders['reorders']=reorders
orders_and_reorders['percent reordered'] = orders_and_reorders['reorders']/orders_and_reorders['orders']

orders_and_reorders.head(200)

In [0]:
output=pd.merge(orders_and_reorders,df[['aisle','department']],left_index=True,right_on='aisle')


In [0]:
output.head()