<a href="https://colab.research.google.com/github/goldenbear7/DS-Unit-1-Sprint-1-Data-Wrangling-and-Storytelling/blob/master/Bjorn_LS_DS_113_Join_and_Reshape_Data_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 1, Sprint 1, Module 3*

---

# Join and Reshape datasets

Objectives
- concatenate data with pandas
- merge data with pandas
-  understand tidy data formatting
-  melt and pivot data with pandas

Links
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)
- [Tidy Data](https://en.wikipedia.org/wiki/Tidy_data)
  - Combine Data Sets: Standard Joins
  - Tidy Data
  - Reshaping Data
- Python Data Science Handbook
  - [Chapter 3.6](https://jakevdp.github.io/PythonDataScienceHandbook/03.06-concat-and-append.html), Combining Datasets: Concat and Append
  - [Chapter 3.7](https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html), Combining Datasets: Merge and Join
  - [Chapter 3.8](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html), Aggregation and Grouping
  - [Chapter 3.9](https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html), Pivot Tables
  
Reference
- Pandas Documentation: [Reshaping and Pivot Tables](https://pandas.pydata.org/pandas-docs/stable/reshaping.html)
- Modern Pandas, Part 5: [Tidy Data](https://tomaugspurger.github.io/modern-5-tidy.html)

In [1]:
!wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

--2019-11-06 02:16:40--  https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.98.245
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.98.245|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 205548478 (196M) [application/x-gzip]
Saving to: ‘instacart_online_grocery_shopping_2017_05_01.tar.gz’


2019-11-06 02:16:46 (32.1 MB/s) - ‘instacart_online_grocery_shopping_2017_05_01.tar.gz’ saved [205548478/205548478]



In [2]:
!tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

instacart_2017_05_01/
instacart_2017_05_01/._aisles.csv
instacart_2017_05_01/aisles.csv
instacart_2017_05_01/._departments.csv
instacart_2017_05_01/departments.csv
instacart_2017_05_01/._order_products__prior.csv
instacart_2017_05_01/order_products__prior.csv
instacart_2017_05_01/._order_products__train.csv
instacart_2017_05_01/order_products__train.csv
instacart_2017_05_01/._orders.csv
instacart_2017_05_01/orders.csv
instacart_2017_05_01/._products.csv
instacart_2017_05_01/products.csv


In [3]:
%cd instacart_2017_05_01

/content/instacart_2017_05_01


In [4]:
!ls -lh *.csv

-rw-r--r-- 1 502 staff 2.6K May  2  2017 aisles.csv
-rw-r--r-- 1 502 staff  270 May  2  2017 departments.csv
-rw-r--r-- 1 502 staff 551M May  2  2017 order_products__prior.csv
-rw-r--r-- 1 502 staff  24M May  2  2017 order_products__train.csv
-rw-r--r-- 1 502 staff 104M May  2  2017 orders.csv
-rw-r--r-- 1 502 staff 2.1M May  2  2017 products.csv


# Assignment

## Join Data Practice

These are the top 10 most frequently ordered products. How many times was each ordered? 

1. Banana
2. Bag of Organic Bananas
3. Organic Strawberries
4. Organic Baby Spinach 
5. Organic Hass Avocado
6. Organic Avocado
7. Large Lemon 
8. Strawberries
9. Limes 
10. Organic Whole Milk

First, write down which columns you need and which dataframes have them.

Next, merge these into a single dataframe.

Then, use pandas functions from the previous lesson to get the counts of the top 10 most frequently ordered products.

In [13]:
import pandas as pd
import numpy as np

orders = pd.read_csv('orders.csv')
orders.shape

(3421083, 7)

In [20]:
products = pd.read_csv('products.csv')
products.shape

(49688, 4)

In [17]:
products.sample()

Unnamed: 0,product_id,product_name,aisle_id,department_id
45051,45052,Fast-Max Maximum Strength Cold Flu & Sore Throat,11,11


In [19]:
orders.sample()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
2589450,2336410,155880,prior,40,5,8,5.0


In [0]:
# merge orders and products into a single dataframe


#first I am renaming the column for user_id into product_id so we can establish a key
orders = orders.rename(columns={'user_id':'product_id',})




In [37]:
merged = pd.merge(orders, products, on = 'product_id', how = 'left')

merged.shape

merged


Unnamed: 0,order_id,product_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_name,aisle_id,department_id
0,2539329,1,prior,1,2,8,,Chocolate Sandwich Cookies,61.0,19.0
1,2398795,1,prior,2,3,7,15.0,Chocolate Sandwich Cookies,61.0,19.0
2,473747,1,prior,3,3,12,21.0,Chocolate Sandwich Cookies,61.0,19.0
3,2254736,1,prior,4,4,7,29.0,Chocolate Sandwich Cookies,61.0,19.0
4,431534,1,prior,5,4,15,28.0,Chocolate Sandwich Cookies,61.0,19.0
...,...,...,...,...,...,...,...,...,...,...
3421078,2266710,206209,prior,10,5,18,29.0,,,
3421079,1854736,206209,prior,11,4,10,30.0,,,
3421080,626363,206209,prior,12,1,12,18.0,,,
3421081,2977660,206209,prior,13,1,12,7.0,,,


In [40]:
merged['product_id'].mode(10)

0          210
1          310
2          313
3          690
4          786
         ...  
1369    205483
1370    205543
1371    205878
1372    205972
1373    206105
Length: 1374, dtype: int64

In [44]:
cols = ['product_id', 'product_name']
merged = merged[cols]
merged

Unnamed: 0,product_id,product_name
0,1,Chocolate Sandwich Cookies
1,1,Chocolate Sandwich Cookies
2,1,Chocolate Sandwich Cookies
3,1,Chocolate Sandwich Cookies
4,1,Chocolate Sandwich Cookies
...,...,...
3421078,206209,
3421079,206209,
3421080,206209,
3421081,206209,


In [51]:
merged['product_name'].value_counts

Stainless Steel Grater                           100
Passionfruit Papaya Tea Bags                     100
California Style Marinated Artichoke Hearts      100
Dark Chocolate Mint Cups                         100
Wild Blueberry Scone Mox                         100
                                                ... 
Aged Cheddar                                       4
Chewy Chocolate Chip                               4
2nd Foods Peas                                     4
G Series Orange Sports Drink                       4
Organic Mini Sandwich Crackers Cheddar Cheese      4
Name: product_name, Length: 49688, dtype: int64

In [54]:
print (merged[merged.product_name == 'Banana'])

        product_id product_name
412896       24852       Banana
412897       24852       Banana
412898       24852       Banana
412899       24852       Banana
412900       24852       Banana


In [58]:
print (merged[merged.product_name == 'Bag of Organic Bananas'])

        product_id            product_name
218034       13176  Bag of Organic Bananas
218035       13176  Bag of Organic Bananas
218036       13176  Bag of Organic Bananas
218037       13176  Bag of Organic Bananas
218038       13176  Bag of Organic Bananas
218039       13176  Bag of Organic Bananas
218040       13176  Bag of Organic Bananas
218041       13176  Bag of Organic Bananas
218042       13176  Bag of Organic Bananas
218043       13176  Bag of Organic Bananas
218044       13176  Bag of Organic Bananas
218045       13176  Bag of Organic Bananas
218046       13176  Bag of Organic Bananas
218047       13176  Bag of Organic Bananas
218048       13176  Bag of Organic Bananas
218049       13176  Bag of Organic Bananas
218050       13176  Bag of Organic Bananas
218051       13176  Bag of Organic Bananas
218052       13176  Bag of Organic Bananas
218053       13176  Bag of Organic Bananas
218054       13176  Bag of Organic Bananas
218055       13176  Bag of Organic Bananas
218056     

In [60]:
print (merged[merged.product_name == 'Organic Strawberries'])

        product_id          product_name
349406       21137  Organic Strawberries
349407       21137  Organic Strawberries
349408       21137  Organic Strawberries
349409       21137  Organic Strawberries
349410       21137  Organic Strawberries
349411       21137  Organic Strawberries
349412       21137  Organic Strawberries
349413       21137  Organic Strawberries
349414       21137  Organic Strawberries
349415       21137  Organic Strawberries


In [62]:
print (merged[merged.product_name == 'Organic Baby Spinach'])

        product_id          product_name
362406       21903  Organic Baby Spinach
362407       21903  Organic Baby Spinach
362408       21903  Organic Baby Spinach
362409       21903  Organic Baby Spinach
362410       21903  Organic Baby Spinach


In [65]:
print (merged[merged.product_name == 'Organic Hass Avocado'])

        product_id          product_name
785128       47209  Organic Hass Avocado
785129       47209  Organic Hass Avocado
785130       47209  Organic Hass Avocado
785131       47209  Organic Hass Avocado
785132       47209  Organic Hass Avocado
785133       47209  Organic Hass Avocado


In [67]:
print (merged[merged.product_name == 'Organic Avocado'])

        product_id     product_name
794399       47766  Organic Avocado
794400       47766  Organic Avocado
794401       47766  Organic Avocado
794402       47766  Organic Avocado
794403       47766  Organic Avocado
794404       47766  Organic Avocado
794405       47766  Organic Avocado
794406       47766  Organic Avocado
794407       47766  Organic Avocado
794408       47766  Organic Avocado


In [69]:
print (merged[merged.product_name == 'Large Lemon'])

        product_id product_name
792056       47626  Large Lemon
792057       47626  Large Lemon
792058       47626  Large Lemon
792059       47626  Large Lemon
792060       47626  Large Lemon
792061       47626  Large Lemon
792062       47626  Large Lemon
792063       47626  Large Lemon
792064       47626  Large Lemon
792065       47626  Large Lemon


In [71]:
print (merged[merged.product_name == 'Strawberries'])

        product_id  product_name
278117       16797  Strawberries
278118       16797  Strawberries
278119       16797  Strawberries
278120       16797  Strawberries
278121       16797  Strawberries
278122       16797  Strawberries
278123       16797  Strawberries
278124       16797  Strawberries
278125       16797  Strawberries
278126       16797  Strawberries
278127       16797  Strawberries
278128       16797  Strawberries
278129       16797  Strawberries
278130       16797  Strawberries
278131       16797  Strawberries
278132       16797  Strawberries
278133       16797  Strawberries
278134       16797  Strawberries
278135       16797  Strawberries
278136       16797  Strawberries
278137       16797  Strawberries
278138       16797  Strawberries
278139       16797  Strawberries
278140       16797  Strawberries
278141       16797  Strawberries
278142       16797  Strawberries
278143       16797  Strawberries
278144       16797  Strawberries
278145       16797  Strawberries
278146    

In [81]:
print (merged[merged.product_name == 'Limes'])

        product_id product_name
436201       26209        Limes
436202       26209        Limes
436203       26209        Limes
436204       26209        Limes
436205       26209        Limes
436206       26209        Limes
436207       26209        Limes
436208       26209        Limes
436209       26209        Limes
436210       26209        Limes
436211       26209        Limes


In [75]:
print (merged[merged.product_name == 'Organic Whole Milk'])

        product_id        product_name
462458       27845  Organic Whole Milk
462459       27845  Organic Whole Milk
462460       27845  Organic Whole Milk
462461       27845  Organic Whole Milk
462462       27845  Organic Whole Milk
462463       27845  Organic Whole Milk
462464       27845  Organic Whole Milk
462465       27845  Organic Whole Milk
462466       27845  Organic Whole Milk
462467       27845  Organic Whole Milk
462468       27845  Organic Whole Milk
462469       27845  Organic Whole Milk
462470       27845  Organic Whole Milk


In [82]:
n = 10
merged['product_name'].value_counts()[:n].index.tolist()



['Stainless Steel Grater',
 'Passionfruit Papaya Tea Bags',
 'California Style Marinated Artichoke Hearts',
 'Dark Chocolate Mint Cups',
 'Wild Blueberry Scone Mox',
 'Goat Milk Mozzarella',
 'Grated Parmesan',
 'Shrimp Ceviche',
 'Sparkling Clementine Juice',
 'Mighty Pacs Laundry Detergent']

In [83]:
merged['product_name'].value_counts()

Stainless Steel Grater                           100
Passionfruit Papaya Tea Bags                     100
California Style Marinated Artichoke Hearts      100
Dark Chocolate Mint Cups                         100
Wild Blueberry Scone Mox                         100
                                                ... 
Aged Cheddar                                       4
Chewy Chocolate Chip                               4
2nd Foods Peas                                     4
G Series Orange Sports Drink                       4
Organic Mini Sandwich Crackers Cheddar Cheese      4
Name: product_name, Length: 49688, dtype: int64

## Reshape Data Section

- Replicate the lesson code
- Complete the code cells we skipped near the beginning of the notebook
- Table 2 --> Tidy
- Tidy --> Table 2
- Load seaborn's `flights` dataset by running the cell below. Then create a pivot table showing the number of passengers by month and year. Use year for the index and month for the columns. You've done it right if you get 112 passengers for January 1949 and 432 passengers for December 1960.

In [0]:
flights = sns.load_dataset('flights')

In [0]:
##### YOUR CODE HERE #####

## Join Data Stretch Challenge

The [Instacart blog post](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2) has a visualization of "**Popular products** purchased earliest in the day (green) and latest in the day (red)." 

The post says,

> "We can also see the time of day that users purchase specific products.

> Healthier snacks and staples tend to be purchased earlier in the day, whereas ice cream (especially Half Baked and The Tonight Dough) are far more popular when customers are ordering in the evening.

> **In fact, of the top 25 latest ordered products, the first 24 are ice cream! The last one, of course, is a frozen pizza.**"

Your challenge is to reproduce the list of the top 25 latest ordered popular products.

We'll define "popular products" as products with more than 2,900 orders.



In [0]:
##### YOUR CODE HERE #####

## Reshape Data Stretch Challenge

_Try whatever sounds most interesting to you!_

- Replicate more of Instacart's visualization showing "Hour of Day Ordered" vs "Percent of Orders by Product"
- Replicate parts of the other visualization from [Instacart's blog post](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2), showing "Number of Purchases" vs "Percent Reorder Purchases"
- Get the most recent order for each user in Instacart's dataset. This is a useful baseline when [predicting a user's next order](https://www.kaggle.com/c/instacart-market-basket-analysis)
- Replicate parts of the blog post linked at the top of this notebook: [Modern Pandas, Part 5: Tidy Data](https://tomaugspurger.github.io/modern-5-tidy.html)

In [0]:
##### YOUR CODE HERE #####