<a href="https://colab.research.google.com/github/dustiny5/DS-Unit-1-Sprint-2-Data-Wrangling/blob/master/module3-reshape-data/LS_DS_123_Reshape_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science_

# Reshape data

Objectives
-  understand tidy data formatting
-  melt and pivot data with pandas

Links
- [Tidy Data](https://en.wikipedia.org/wiki/Tidy_data)
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)
  - Tidy Data
  - Reshaping Data
- Python Data Science Handbook
  - [Chapter 3.8](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html), Aggregation and Grouping
  - [Chapter 3.9](https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html), Pivot Tables
  
Reference
- pandas documentation: [Reshaping and Pivot Tables](https://pandas.pydata.org/pandas-docs/stable/reshaping.html)
- Modern Pandas, Part 5: [Tidy Data](https://tomaugspurger.github.io/modern-5-tidy.html)

## Why reshape data?

#### Some libraries prefer data in different formats

For example, the Seaborn data visualization library prefers data in "Tidy" format often (but not always).

> "[Seaborn will be most powerful when your datasets have a particular organization.](https://seaborn.pydata.org/introduction.html#organizing-datasets) This format ia alternately called “long-form” or “tidy” data and is described in detail by Hadley Wickham. The rules can be simply stated:

> - Each variable is a column
- Each observation is a row

> A helpful mindset for determining whether your data are tidy is to think backwards from the plot you want to draw. From this perspective, a “variable” is something that will be assigned a role in the plot."

#### Data science is often about putting square pegs in round holes

Here's an inspiring [video clip from _Apollo 13_](https://www.youtube.com/watch?v=ry55--J4_VQ): “Invent a way to put a square peg in a round hole.” It's a good metaphor for data wrangling!

## Upgrade Seaborn

Run the cell below which upgrades Seaborn and automatically restarts your Google Colab Runtime.

In [0]:
!pip install seaborn --upgrade
import os
os.kill(os.getpid(), 9)

Collecting seaborn
[?25l  Downloading https://files.pythonhosted.org/packages/a8/76/220ba4420459d9c4c9c9587c6ce607bf56c25b3d3d2de62056efe482dadc/seaborn-0.9.0-py3-none-any.whl (208kB)
[K    4% |█▋                              | 10kB 8.5MB/s eta 0:00:01[K    9% |███▏                            | 20kB 2.4MB/s eta 0:00:01[K    14% |████▊                           | 30kB 3.5MB/s eta 0:00:01[K    19% |██████▎                         | 40kB 2.6MB/s eta 0:00:01[K    24% |███████▉                        | 51kB 3.2MB/s eta 0:00:01[K    29% |█████████▌                      | 61kB 3.8MB/s eta 0:00:01[K    34% |███████████                     | 71kB 4.4MB/s eta 0:00:01[K    39% |████████████▋                   | 81kB 4.9MB/s eta 0:00:01[K    44% |██████████████▏                 | 92kB 5.5MB/s eta 0:00:01[K    49% |███████████████▊                | 102kB 4.4MB/s eta 0:00:01[K    54% |█████████████████▎              | 112kB 4.6MB/s eta 0:00:01[K    59% |███████████████████  

## Hadley Wickham's Examples

From his paper, [Tidy Data](http://vita.had.co.nz/papers/tidy-data.html)

In [0]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns

table1 = pd.DataFrame(
    [[np.nan, 2],
     [16,    11], 
     [3,      1]],
    index=['John Smith', 'Jane Doe', 'Mary Johnson'], 
    columns=['treatmenta', 'treatmentb'])

table2 = table1.T

"Table 1 provides some data about an imaginary experiment in a format commonly seen in the wild. 

The table has two columns and three rows, and both rows and columns are labelled."

In [0]:
table1

"There are many ways to structure the same underlying data. 

Table 2 shows the same data as Table 1, but the rows and columns have been transposed. The data is the same, but the layout is different."

In [0]:
table2

"Table 3 reorganises Table 1 to make the values, variables and obserations more clear.

Table 3 is the tidy version of Table 1. Each row represents an observation, the result of one treatment on one person, and each column is a variable."

| name         | trt | result |
|--------------|-----|--------|
| John Smith   | a   | -      |
| Jane Doe     | a   | 16     |
| Mary Johnson | a   | 3      |
| John Smith   | b   | 2      |
| Jane Doe     | b   | 11     |
| Mary Johnson | b   | 1      |

## Table 1 --> Tidy

We can use the pandas `melt` function to reshape Table 1 into Tidy format.

In [0]:
table1.columns.tolist()

In [0]:
table1.index.tolist()

In [0]:
tidy = table1.reset_index().melt(id_vars='index')

tidy = tidy.rename(columns={'index':'name', 'variable':'trt', 'value':'result'})

tidy['trt'] = tidy['trt'].str.replace('treatment', '')
#tidy['trt'] = tidy['trt'].str.strip('tremn') Removes begining and ending of the string that has 'tremn'

#Change to binary for trt feature a= 0 & b = 1
tidy['trt'].replace('a', 0).replace('b', 1) #Can replace using dictionary

tidy['trt'].map({'a':0, 'b':1})

(tidy['trt'] == 'b').astype(int) # Change boolean to integers true=1, false=0 // Use astype for whole column use int() for single line code

tidy['trt'].apply(lambda x: ord(x) - ord('a'))#ord returns the unicode number
tidy

## Table 2 --> Tidy

In [0]:
#Table 2 dataframe
table2

In [0]:
#Reset index
#Melt the data and set its attribute, id_vars='index', (Displays individual treatment and pairs it with the name and result)
#Rename the columns
tidy2 = (table2
        .reset_index()
        .melt(id_vars='index')
        .rename(columns={'index': 'trt',
                     'variable': 'name',
                     'value': 'result'}))

#Use string method to replace 'treatment' with an empty string
tidy2['trt'] = tidy2['trt'].str.replace('treatment', '')
tidy2

## Tidy --> Table 1

The `pivot_table` function is the inverse of `melt`.

In [0]:
table1

In [0]:
tidy

In [0]:
tidy.pivot_table(index='name', columns='trt', values='result')

## Tidy --> Table 2

In [0]:
#Tidy2 data from Table2
tidy2

In [0]:
table2

In [0]:
tidy2_pivot = tidy2.pivot_table(index='trt', columns='name', values='result', margins_name='name')
tidy2_pivot

In [0]:
#Delete name from columns(name)
tidy2_pivot.columns.name = None

#Delete name from index(trt)
tidy2_pivot.index.name = None

tidy2_pivot

In [0]:
#Rename index
tidy2_pivot = tidy2_pivot.rename(index={'a':'treatmenta', 'b':'treatmentb'})

tidy2_pivot

In [0]:
#Original order of columns from table 2
column_order = table2.columns

#Reorganize using reindex method and pass in the column_order for colums
tidy2_pivot = tidy2_pivot.reindex(column_order, axis=1)

tidy2_pivot

#Seaborn example
- Each variable is a column
- Each observation is a row

In [0]:
sns.catplot(x='trt', y='result', col='name', kind='bar', data=tidy, height=3);

## Load Instacart data

Let's return to the dataset of [3 Million Instacart Orders](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2)

If necessary, uncomment and run the cells below to re-download and extract the data

In [0]:
!wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
!tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

Run these cells to load the data

In [0]:
%cd instacart_2017_05_01 #Change dir to acccess the csv directly when reading the data

In [0]:
products = pd.read_csv('products.csv')

order_products = pd.concat([pd.read_csv('order_products__prior.csv'), 
                            pd.read_csv('order_products__train.csv')])

orders = pd.read_csv('orders.csv')

## Goal: Reproduce part of this example

Instead of a plot with 50 products, we'll just do two — the first products from each list
- Half And Half Ultra Pasteurized
- Half Baked Frozen Yogurt

In [0]:
from IPython.display import display, Image
url = 'https://cdn-images-1.medium.com/max/1600/1*wKfV6OV-_1Ipwrl7AjjSuw.png'
example = Image(url=url, width=600)

display(example)

So, given a `product_name` we need to calculate its `order_hour_of_day` pattern.

## Subset and Merge

In [0]:
product_names = ['Half Baked Frozen Yogurt', 'Half And Half Ultra Pasteurized']

In [0]:
products.columns.tolist()

In [0]:
orders.columns.tolist()

In [0]:
order_products.columns.tolist()

In [0]:
merged = (products[['product_id', 'product_name']]
          .merge(order_products[['order_id', 'product_id']])
          .merge(orders[['order_id', 'order_hour_of_day']]))

merged

In [0]:
products.shape, order_products.shape, orders.shape, merged.shape

In [0]:
merged.head()

In [0]:
condition = ((merged['product_name']=='Half Baked Frozen Yogurt') |
            (merged['product_name']=='Half And Half ultra Pasteurized'))

product_names = ['Half Baked Frozen Yogurt', 'Half And Half Ultra Pasteurized']

condition = merged['product_name'].isin(product_names)

subset = merged[condition]

In [0]:
merged.shape, subset.shape

In [0]:
subset.sample(5) #Give a random sample of 5

## 4 ways to reshape and plot

### 1. value_counts

In [0]:
froyo = subset[ subset['product_name'] == 'Half Baked Frozen Yogurt' ]
cream = subset[ subset['product_name'] == 'Half And Half Ultra Pasteurized' ]

In [0]:
(froyo['order_hour_of_day']
 .value_counts(normalize='columns')
 .sort_index()
 .plot())

(cream['order_hour_of_day']
 .value_counts(normalize='columns')
 .sort_index()
 .plot())

### 2. crosstab

In [0]:
#1st is index, 2nd is column, 3rd is normalize
(pd.crosstab(subset['order_hour_of_day'],
           subset['product_name'],
           normalize='columns')* 100).plot()

### 3. pivot_table

In [0]:
#Default the value is taking the average. We use aggfunc len to use the raw numbers
subset.pivot_table(index='order_hour_of_day',
                  columns='product_name',
                  values='order_id',
                  aggfunc=len).plot();

### 4. melt

In [0]:
table = pd.crosstab(subset['order_hour_of_day'],
           subset['product_name'],
           normalize='columns')

melted = (table
 .reset_index()
 .melt(id_vars='order_hour_of_day')
.rename(columns={'order_hour_of_day':'Hour of Day Ordered',
                'product_name': 'Product',
                'value': 'Percent of Orders by Product'
                }))

sns.relplot(x='Hour of Day Ordered',
           y='Percent of Orders by Product',
           hue='Product',
           data=melted,
           kind='line')

# ASSIGNMENT
- Replicate the lesson code
- Complete the code cells we skipped near the beginning of the notebook
  - Table 2 --> Tidy
  - Tidy --> Table 2

- Load seaborn's `flights` dataset by running the cell below. Then create a pivot table showing the number of passengers by month and year. Use year for the index and month for the columns. You've done it right if you get 112 passengers for January 1949 and 432 passengers for December 1960.

In [0]:
flights = sns.load_dataset('flights')
flights.head()

In [0]:
#Melt table(long) to pivot table(wide)
flights_pv = flights.pivot_table(index='year', columns='month', values='passengers')
flights_pv

In [0]:
import matplotlib.pyplot as plt
for col in flights_pv.columns:
  sns.distplot(flights_pv[col], axlabel='Passengers', label=col)
plt.legend()

# STRETCH OPTIONS

_Try whatever sounds most interesting to you!_

- Replicate more of Instacart's visualization showing "Hour of Day Ordered" vs "Percent of Orders by Product"
- Replicate parts of the other visualization from [Instacart's blog post](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2), showing "Number of Purchases" vs "Percent Reorder Purchases"
- Get the most recent order for each user in Instacart's dataset. This is a useful baseline when [predicting a user's next order](https://www.kaggle.com/c/instacart-market-basket-analysis)
- Replicate parts of the blog post linked at the top of this notebook: [Modern Pandas, Part 5: Tidy Data](https://tomaugspurger.github.io/modern-5-tidy.html)

#CrossTab Method

In [0]:
merged2 = merged.copy()
merged2.head()

In [0]:
merged2 = merged2.drop(columns=['product_id','order_id'])
merged2.head()

In [0]:
#Crosstab - values are percentages
merged2_ct = pd.crosstab(merged2['order_hour_of_day'], merged2['product_name'], normalize = 'columns')
merged2_ct

In [0]:
merge2_reset = merged2_ct.reset_index()
merge2_reset

In [0]:
sns.relplot(x='order_hour_of_day', y='#2 Coffee Filters', data=merge2_reset, kind='line')

#Pivot Table Method

In [0]:
merged3 = merged.copy()
merged3.head()