Lambda School Data Science

*Unit 1, Sprint 1, Module 3*

---

# Join and Reshape datasets

Objectives
- concatenate data with pandas
- merge data with pandas
-  understand tidy data formatting
-  melt and pivot data with pandas

Links
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)
- [Tidy Data](https://en.wikipedia.org/wiki/Tidy_data)
  - Combine Data Sets: Standard Joins
  - Tidy Data
  - Reshaping Data
- Python Data Science Handbook
  - [Chapter 3.6](https://jakevdp.github.io/PythonDataScienceHandbook/03.06-concat-and-append.html), Combining Datasets: Concat and Append
  - [Chapter 3.7](https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html), Combining Datasets: Merge and Join
  - [Chapter 3.8](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html), Aggregation and Grouping
  - [Chapter 3.9](https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html), Pivot Tables
  
Reference
- Pandas Documentation: [Reshaping and Pivot Tables](https://pandas.pydata.org/pandas-docs/stable/reshaping.html)
- Modern Pandas, Part 5: [Tidy Data](https://tomaugspurger.github.io/modern-5-tidy.html)

In [5]:
!wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

--2020-07-09 02:16:55--  https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.13.94
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.13.94|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 205548478 (196M) [application/x-gzip]
Saving to: ‘instacart_online_grocery_shopping_2017_05_01.tar.gz.2’


2020-07-09 02:17:00 (41.4 MB/s) - ‘instacart_online_grocery_shopping_2017_05_01.tar.gz.2’ saved [205548478/205548478]



In [6]:
!tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

instacart_2017_05_01/
instacart_2017_05_01/._aisles.csv
instacart_2017_05_01/aisles.csv
instacart_2017_05_01/._departments.csv
instacart_2017_05_01/departments.csv
instacart_2017_05_01/._order_products__prior.csv
instacart_2017_05_01/order_products__prior.csv
instacart_2017_05_01/._order_products__train.csv
instacart_2017_05_01/order_products__train.csv
instacart_2017_05_01/._orders.csv
instacart_2017_05_01/orders.csv
instacart_2017_05_01/._products.csv
instacart_2017_05_01/products.csv


In [7]:
%cd instacart_2017_05_01

/content/instacart_2017_05_01


In [8]:
!ls -lh *.csv

-rw-r--r-- 1 502 staff 2.6K May  2  2017 aisles.csv
-rw-r--r-- 1 502 staff  270 May  2  2017 departments.csv
-rw-r--r-- 1 502 staff 551M May  2  2017 order_products__prior.csv
-rw-r--r-- 1 502 staff  24M May  2  2017 order_products__train.csv
-rw-r--r-- 1 502 staff 104M May  2  2017 orders.csv
-rw-r--r-- 1 502 staff 2.1M May  2  2017 products.csv


# Assignment

## Join Data Practice

These are the top 10 most frequently ordered products. How many times was each ordered? 

1. Banana
2. Bag of Organic Bananas
3. Organic Strawberries
4. Organic Baby Spinach 
5. Organic Hass Avocado
6. Organic Avocado
7. Large Lemon 
8. Strawberries
9. Limes 
10. Organic Whole Milk

First, write down which columns you need and which dataframes have them.

Next, merge these into a single dataframe.

Then, use pandas functions from the previous lesson to get the counts of the top 10 most frequently ordered products.

In [9]:
# I need product_name, product_id, order_id
# order_products_prior and order_products_train has order_id and product_id
# products has product id and product name


In [10]:
import pandas as pd
import numpy as np

In [11]:
# step 1: make df we need reading csv
products = pd.read_csv('products.csv')

print(products.shape)
products.head()

(49688, 4)


Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


In [12]:
order_products__prior = pd.read_csv('order_products__prior.csv')
print(order_products__prior.shape)
order_products__prior.head()

(32434489, 4)


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [13]:
order_products__train = pd.read_csv('order_products__train.csv')
print(order_products__train.shape)
order_products__train.head()

(1384617, 4)


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1


In [14]:
orders = pd.read_csv('orders.csv')
print(orders.shape)
orders.head()

(3421083, 7)


Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [15]:
# step 2: Concat order_products_prior and order_products_train
order_products = pd.concat([order_products__prior, order_products__train])
print(order_products.shape)
order_products.head()

(33819106, 4)


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [16]:
# step 3: slim down the df order_products 
order_products = order_products.drop(['add_to_cart_order', 'reordered'], axis=1)
order_products.head()

Unnamed: 0,order_id,product_id
0,2,33120
1,2,28985
2,2,9327
3,2,45918
4,2,30035


In [17]:
# step 4 : slim down orders
orders = orders[['order_id', 'order_number']]
orders

Unnamed: 0,order_id,order_number
0,2539329,1
1,2398795,2
2,473747,3
3,2254736,4
4,431534,5
...,...,...
3421078,2266710,10
3421079,1854736,11
3421080,626363,12
3421081,2977660,13


In [18]:
# step 5: slim down df product_id
products = products[['product_id', 'product_name']]

print(products.shape)
products.head()

(49688, 2)


Unnamed: 0,product_id,product_name
0,1,Chocolate Sandwich Cookies
1,2,All-Seasons Salt
2,3,Robust Golden Unsweetened Oolong Tea
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...
4,5,Green Chile Anytime Sauce


In [20]:
# # making a list of products to filter out 
top_ten_products = ['Banana', 'Bag of Organic Bananas', 'Organic Strawberries', 'Organic Baby Spinach', 'Organic Hass Avocado', 'Organic Avocado', 'Large Lemon', 'Strawberries', 'Limes', 'Organic Whole Milk']

In [21]:
# filtering 
condition = products['product_name'].isin(top_ten_products)
products = products[condition]
print(products.shape)
products.head()

(10, 2)


Unnamed: 0,product_id,product_name
13175,13176,Bag of Organic Bananas
16796,16797,Strawberries
21136,21137,Organic Strawberries
21902,21903,Organic Baby Spinach
24851,24852,Banana


In [22]:
# step 6: merge using product_id
list_of_products = pd.merge(products, order_products, on='product_id', how='inner')
print(list_of_products.shape)
list_of_products.head()

(2418314, 3)


Unnamed: 0,product_id,product_name,order_id
0,13176,Bag of Organic Bananas,5
1,13176,Bag of Organic Bananas,27
2,13176,Bag of Organic Bananas,29
3,13176,Bag of Organic Bananas,32
4,13176,Bag of Organic Bananas,42


In [23]:
# merge using order_id
top10_products = pd.merge(list_of_products, orders, on='order_id', how='inner')
print(top10_products.shape)
top10_products.head()

(2418314, 4)


Unnamed: 0,product_id,product_name,order_id,order_number
0,13176,Bag of Organic Bananas,5,42
1,47209,Organic Hass Avocado,5,42
2,13176,Bag of Organic Bananas,27,16
3,47766,Organic Avocado,27,16
4,13176,Bag of Organic Bananas,29,14


In [24]:
# step 6: get the count of the top 10 most frequently ordered products 
top10_products['product_name'].value_counts().sort_values(ascending=False)

Banana                    491291
Bag of Organic Bananas    394930
Organic Strawberries      275577
Organic Baby Spinach      251705
Organic Hass Avocado      220877
Organic Avocado           184224
Large Lemon               160792
Strawberries              149445
Limes                     146660
Organic Whole Milk        142813
Name: product_name, dtype: int64

## Reshape Data Section

- Replicate the lesson code. Remember, if you haven't followed along typing out what we did during lecture, do that now to get more repetition with the syntax.
- Take table 2 (the transpose of table 1) and practice changing it into Tidy format and back again. You should not use the transpose operation anywhere in this code:
 - Table 2 --> Tidy
 - Tidy --> Table 2
- Load seaborn's `flights` dataset by running the cell below. Then create a pivot table showing the number of passengers by month and year. Use year for the index and month for the columns. You've done it right if you get 112 passengers for January 1949 and 432 passengers for December 1960.

In [1]:
import pandas as pd
import numpy as np

table1 = pd.DataFrame(
    [[np.nan, 2],
     [16,    11], 
     [3,      1]],
    index=['John Smith', 'Jane Doe', 'Mary Johnson'], 
    columns=['treatmenta', 'treatmentb'])

table2 = table1.T

table2

Unnamed: 0,John Smith,Jane Doe,Mary Johnson
treatmenta,,16.0,3.0
treatmentb,2.0,11.0,1.0


In [2]:
table2 = table2.reset_index()
table2


Unnamed: 0,index,John Smith,Jane Doe,Mary Johnson
0,treatmenta,,16.0,3.0
1,treatmentb,2.0,11.0,1.0


In [6]:
# table2 -> tidy
tidy1 = table2.melt(id_vars='index', value_vars=['John Smith', 'Jane Doe', 'Mary Johnson'], var_name='Name')
tidy1

Unnamed: 0,index,Name,value
0,treatmenta,John Smith,
1,treatmentb,John Smith,2.0
2,treatmenta,Jane Doe,16.0
3,treatmentb,Jane Doe,11.0
4,treatmenta,Mary Johnson,3.0
5,treatmentb,Mary Johnson,1.0


In [11]:
# tidy -> table2
wide = tidy1.pivot_table(index=['index'], columns='Name', values='value')
wide

Name,Jane Doe,John Smith,Mary Johnson
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
treatmenta,16.0,,3.0
treatmentb,11.0,2.0,1.0


In [15]:
import seaborn as sns
flights = sns.load_dataset('flights')

  import pandas.util.testing as tm


In [16]:
# Flights Pivot Table
flights

Unnamed: 0,year,month,passengers
0,1949,January,112
1,1949,February,118
2,1949,March,132
3,1949,April,129
4,1949,May,121
...,...,...,...
139,1960,August,606
140,1960,September,508
141,1960,October,461
142,1960,November,390


In [20]:
# pivot_table to melt
melt_table_flights = flights.melt(id_vars= ['year','month'], value_vars=['passengers'], var_name='Passengers')
melt_table_flights 

Unnamed: 0,year,month,Passengers,value
0,1949,January,passengers,112
1,1949,February,passengers,118
2,1949,March,passengers,132
3,1949,April,passengers,129
4,1949,May,passengers,121
...,...,...,...,...
139,1960,August,passengers,606
140,1960,September,passengers,508
141,1960,October,passengers,461
142,1960,November,passengers,390


In [22]:
# melt to pivot_table
pivot_table_flights = melt_table_flights.pivot_table(index= 'year', columns=['month'], values='value')
pivot_table_flights

month,January,February,March,April,May,June,July,August,September,October,November,December
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1949,112,118,132,129,121,135,148,148,136,119,104,118
1950,115,126,141,135,125,149,170,170,158,133,114,140
1951,145,150,178,163,172,178,199,199,184,162,146,166
1952,171,180,193,181,183,218,230,242,209,191,172,194
1953,196,196,236,235,229,243,264,272,237,211,180,201
1954,204,188,235,227,234,264,302,293,259,229,203,229
1955,242,233,267,269,270,315,364,347,312,274,237,278
1956,284,277,317,313,318,374,413,405,355,306,271,306
1957,315,301,356,348,355,422,465,467,404,347,305,336
1958,340,318,362,348,363,435,491,505,404,359,310,337


## Join Data Stretch Challenge

The [Instacart blog post](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2) has a visualization of "**Popular products** purchased earliest in the day (green) and latest in the day (red)." 

The post says,

> "We can also see the time of day that users purchase specific products.

> Healthier snacks and staples tend to be purchased earlier in the day, whereas ice cream (especially Half Baked and The Tonight Dough) are far more popular when customers are ordering in the evening.

> **In fact, of the top 25 latest ordered products, the first 24 are ice cream! The last one, of course, is a frozen pizza.**"

Your challenge is to reproduce the list of the top 25 latest ordered popular products.

We'll define "popular products" as products with more than 2,900 orders.



In [None]:
# we need:
#1.'hour of day ordered"
#2. 'product_name'
#3. 'product_id'
#4. 'order_id'

In [67]:
orders = pd.read_csv('orders.csv')
orders = orders[['order_id', 'order_hour_of_day','order_number']]
print(orders.shape)
orders.head()

(3421083, 3)


Unnamed: 0,order_id,order_hour_of_day,order_number
0,2539329,8,1
1,2398795,7,2
2,473747,12,3
3,2254736,7,4
4,431534,15,5


In [31]:
# we need product name and product id 
products = pd.read_csv('products.csv')
products = products[['product_name', 'product_id']]
print(products.shape)
products.head()

(49688, 2)


Unnamed: 0,product_name,product_id
0,Chocolate Sandwich Cookies,1
1,All-Seasons Salt,2
2,Robust Golden Unsweetened Oolong Tea,3
3,Smart Ones Classic Favorites Mini Rigatoni Wit...,4
4,Green Chile Anytime Sauce,5


In [32]:
order_products.head()

Unnamed: 0,order_id,product_id
0,2,33120
1,2,28985
2,2,9327
3,2,45918
4,2,30035


In [33]:
# merge products and order_products
orders_and_products = pd.merge(products, order_products, on='product_id', how='inner')
orders_and_products

Unnamed: 0,product_name,product_id,order_id
0,Chocolate Sandwich Cookies,1,1107
1,Chocolate Sandwich Cookies,1,5319
2,Chocolate Sandwich Cookies,1,7540
3,Chocolate Sandwich Cookies,1,9228
4,Chocolate Sandwich Cookies,1,9273
...,...,...,...
33819101,Fresh Foaming Cleanser,49688,3401313
33819102,Fresh Foaming Cleanser,49688,655800
33819103,Fresh Foaming Cleanser,49688,2198380
33819104,Fresh Foaming Cleanser,49688,2508423


In [69]:
# now we merge orders with orders_and_products using 'order_id'
popular_products = pd.merge(orders, orders_and_products, on='order_id', how='inner')
popular_products.head(25)

Unnamed: 0,order_id,order_hour_of_day,order_number,product_name,product_id
0,2539329,8,1,Soda,196
1,2539329,8,1,Original Beef Jerky,12427
2,2539329,8,1,Organic Unsweetened Vanilla Almond Milk,14084
3,2539329,8,1,Aged White Cheddar Popcorn,26088
4,2539329,8,1,XL Pick-A-Size Paper Towel Rolls,26405
5,2398795,7,2,Soda,196
6,2398795,7,2,Pistachios,10258
7,2398795,7,2,Original Beef Jerky,12427
8,2398795,7,2,Cinnamon Toast Crunch,13032
9,2398795,7,2,Bag of Organic Bananas,13176


In [35]:
popular_products['product_name'].value_counts().sort_values(ascending=False).nlargest(25)

Banana                        491291
Bag of Organic Bananas        394930
Organic Strawberries          275577
Organic Baby Spinach          251705
Organic Hass Avocado          220877
Organic Avocado               184224
Large Lemon                   160792
Strawberries                  149445
Limes                         146660
Organic Whole Milk            142813
Organic Raspberries           142603
Organic Yellow Onion          117716
Organic Garlic                113936
Organic Zucchini              109412
Organic Blueberries           105026
Cucumber Kirby                 99728
Organic Fuji Apple             92889
Organic Lemon                  91251
Organic Grape Tomatoes         88078
Apple Honeycrisp Organic       87272
Seedless Red Grapes            86748
Organic Cucumber               85005
Honeycrisp Apple               83320
Organic Baby Carrots           80493
Sparkling Water Grapefruit     79245
Name: product_name, dtype: int64

In [36]:
popular_products = popular_products.sort_values(by=['order_hour_of_day'], ascending=False)
popular_products

Unnamed: 0,order_id,order_hour_of_day,product_name,product_id
2662734,1730200,23,Kale & Spinach Superfood Puffs,38984
2797020,3105038,23,Organic Old Fashioned Rolled Oats,44422
17707703,2232147,23,Large Lemon,47626
17707704,2232147,23,Mini Banana Chocolate Chip Bars Snack Cakes,48103
17707705,2232147,23,0% Fat Free Organic Milk,49517
...,...,...,...,...
6926011,2752807,0,Organic Great Northern Beans,42617
6926012,2752807,0,Vegetable Bouillon Cubes,45150
31640749,3024463,0,Organic Quick Rolled Oats,7314
31640750,3024463,0,Organic Yellow Onion,22935


In [37]:
popular_products['product_name'].value_counts().sort_values(ascending=False)

Banana                                                        491291
Bag of Organic Bananas                                        394930
Organic Strawberries                                          275577
Organic Baby Spinach                                          251705
Organic Hass Avocado                                          220877
                                                               ...  
Pantene Pro-V Color Preserve Volume Conditioner                    1
Orangemint Flavored Water                                          1
Drink Distinct All Natural Soda Pineapple Coconut & Nutmeg         1
Tropic Thunder  Coconut & Cream                                    1
Chocolate Go Bites                                                 1
Name: product_name, Length: 49685, dtype: int64

In [23]:
filt1 = popular_products['order_hour_of_day'] >=23

NameError: ignored

In [86]:
popular_products_evening = popular_products.loc[filt1]

In [94]:
popular_products_evening.sort_values(by=['order_hour_of_day'], ascending=False)

Unnamed: 0,order_id,order_hour_of_day,order_number,product_name,product_id
31914104,953131,23,2,Pure Sparkling Water,14947
28186173,1202988,23,18,Raspberries,43352
28186175,1202988,23,18,Organic Zucchini,45007
28186176,1202988,23,18,Honeydew Melon,47788
28186177,1202988,23,18,Organic Lite Coconut Milk,48104
...,...,...,...,...,...
11944796,502858,18,20,1 Liter,5428
11944797,502858,18,20,Large Alfresco Eggs,11520
11944798,502858,18,20,Organic Sunday Bacon,12456
11944799,502858,18,20,Strawberries,16797


In [87]:
popular_products_evening['order_hour_of_day'].unique()

array([19, 18, 20, 21, 22, 23])

In [81]:
popular_products_evening['product_name'].value_counts().sort_values(ascending=False).head(25)

Banana                      87425
Bag of Organic Bananas      70226
Organic Strawberries        51300
Organic Baby Spinach        45578
Organic Hass Avocado        39515
Organic Avocado             32798
Large Lemon                 26632
Organic Whole Milk          26623
Strawberries                26328
Organic Raspberries         25974
Limes                       24657
Organic Blueberries         20069
Organic Zucchini            19498
Organic Yellow Onion        19189
Organic Garlic              19093
Cucumber Kirby              18290
Organic Grape Tomatoes      16274
Organic Lemon               16117
Seedless Red Grapes         15646
Organic Fuji Apple          15568
Organic Cucumber            15365
Apple Honeycrisp Organic    14857
Organic Baby Carrots        14343
Honeycrisp Apple            14295
Carrots                     13744
Name: product_name, dtype: int64

In [91]:
popular_products_evening['order_number'].value_counts().sort_values().head(25)

100    1187
99     1883
97     2044
98     2071
96     2084
95     2304
93     2331
94     2431
92     2437
91     2669
90     2806
87     2880
89     3044
88     3124
86     3154
82     3536
85     3559
84     3794
83     3857
81     3979
80     4152
79     4429
78     4804
77     4866
76     5060
Name: order_number, dtype: int64

In [89]:
filt = popular_products_evening['order_number'] > 2900
popular = popular_products_evening[filt]

In [90]:
popular 

Unnamed: 0,order_id,order_hour_of_day,order_number,product_name,product_id


## Reshape Data Stretch Challenge

_Try whatever sounds most interesting to you!_

- Replicate more of Instacart's visualization showing "Hour of Day Ordered" vs "Percent of Orders by Product"
- Replicate parts of the other visualization from [Instacart's blog post](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2), showing "Number of Purchases" vs "Percent Reorder Purchases"
- Get the most recent order for each user in Instacart's dataset. This is a useful baseline when [predicting a user's next order](https://www.kaggle.com/c/instacart-market-basket-analysis)
- Replicate parts of the blog post linked at the top of this notebook: [Modern Pandas, Part 5: Tidy Data](https://tomaugspurger.github.io/modern-5-tidy.html)

In [None]:
##### YOUR CODE HERE #####