<a href="https://colab.research.google.com/github/DylanGraves/DS-Unit-1-Sprint-2-Data-Wrangling/blob/master/module2-join-datasets/LS_DS_122_Join_datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science_

# Join datasets

Objectives
- concatenate data with pandas
- merge data with pandas

Links
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)
  - Combine Data Sets: Standard Joins
- Python Data Science Handbook
  - [Chapter 3.6](https://jakevdp.github.io/PythonDataScienceHandbook/03.06-concat-and-append.html), Combining Datasets: Concat and Append
  - [Chapter 3.7](https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html), Combining Datasets: Merge and Join

## Download data

We’ll work with a dataset of [3 Million Instacart Orders, Open Sourced](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2)!

In [1]:
!wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

--2019-03-26 21:21:14--  https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.49.67
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.49.67|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 205548478 (196M) [application/x-gzip]
Saving to: ‘instacart_online_grocery_shopping_2017_05_01.tar.gz.1’


2019-03-26 21:21:16 (96.3 MB/s) - ‘instacart_online_grocery_shopping_2017_05_01.tar.gz.1’ saved [205548478/205548478]



In [2]:
!tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

instacart_2017_05_01/
instacart_2017_05_01/._aisles.csv
instacart_2017_05_01/aisles.csv
instacart_2017_05_01/._departments.csv
instacart_2017_05_01/departments.csv
instacart_2017_05_01/._order_products__prior.csv
instacart_2017_05_01/order_products__prior.csv
instacart_2017_05_01/._order_products__train.csv
instacart_2017_05_01/order_products__train.csv
instacart_2017_05_01/._orders.csv
instacart_2017_05_01/orders.csv
instacart_2017_05_01/._products.csv
instacart_2017_05_01/products.csv


In [3]:
%cd instacart_2017_05_01

/content/instacart_2017_05_01


## Goal: Reproduce this example

The first two orders for user id 1:

In [4]:
from IPython.display import display, Image
url = 'https://cdn-images-1.medium.com/max/1600/1*vYGFQCafJtGBBX5mbl0xyw.png'
example = Image(url=url, width=600)

display(example)

## Load data

Here's a list of all six CSV filenames

In [5]:
!ls -lh

total 681M
-rw-r--r-- 1 502 staff 2.6K May  2  2017 aisles.csv
-rw-r--r-- 1 502 staff  270 May  2  2017 departments.csv
-rw-r--r-- 1 502 staff 551M May  2  2017 order_products__prior.csv
-rw-r--r-- 1 502 staff  24M May  2  2017 order_products__train.csv
-rw-r--r-- 1 502 staff 104M May  2  2017 orders.csv
-rw-r--r-- 1 502 staff 2.1M May  2  2017 products.csv


For each CSV
- Load it with pandas
- Look at the dataframe's shape
- Look at its head (first rows)
- `display(example)`
- Which columns does it have in common with the example we want to reproduce?

### aisles

In [0]:
import pandas as pd
aisles = pd.read_csv('aisles.csv')

In [7]:
aisles.shape

(134, 2)

In [8]:
aisles.head(10)

Unnamed: 0,aisle_id,aisle
0,1,prepared soups salads
1,2,specialty cheeses
2,3,energy granola bars
3,4,instant foods
4,5,marinades meat preparation
5,6,other
6,7,packaged meat
7,8,bakery desserts
8,9,pasta sauce
9,10,kitchen supplies


### departments

In [9]:
departments = pd.read_csv('departments.csv')
departments.shape

(21, 2)

In [10]:
departments.head(10)

Unnamed: 0,department_id,department
0,1,frozen
1,2,other
2,3,bakery
3,4,produce
4,5,alcohol
5,6,international
6,7,beverages
7,8,pets
8,9,dry goods pasta
9,10,bulk


In [11]:
display(example)

# I don't need departments csv, it seems?

### order_products__prior

In [12]:
prior_products = pd.read_csv('order_products__prior.csv')
prior_products.shape

(32434489, 4)

In [13]:
prior_products.head(10)

# It looks like I'll need order_id, add_to_cart_order, and product_id.

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0
5,2,17794,6,1
6,2,40141,7,1
7,2,1819,8,1
8,2,43668,9,0
9,3,33754,1,1


### order_products__train

In [14]:
train_products = pd.read_csv('order_products__train.csv')
train_products.shape

(1384617, 4)

In [15]:
train_products.head(10)

# Same deal.

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1
5,1,13176,6,0
6,1,47209,7,0
7,1,22035,8,1
8,36,39612,1,0
9,36,19660,2,1


### orders

In [16]:
orders = pd.read_csv('orders.csv')
orders.shape

(3421083, 7)

In [17]:
orders.head(10)

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0
5,3367565,1,prior,6,2,7,19.0
6,550135,1,prior,7,1,9,20.0
7,3108588,1,prior,8,1,14,14.0
8,2295261,1,prior,9,1,16,0.0
9,2550362,1,prior,10,4,8,30.0


In [18]:
display(example)

# order_id, user_id, order_number, order_dow, order_hour_of_day, are all in the example.

### products

In [19]:
products = pd.read_csv('products.csv')
products.shape

(49688, 4)

In [20]:
products.head(10)

# product_id, product_name, are in the example.

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13
5,6,Dry Nose Oil,11,11
6,7,Pure Coconut Water With Orange,98,7
7,8,Cut Russet Potatoes Steam N' Mash,116,1
8,9,Light Strawberry Blueberry Yogurt,120,16
9,10,Sparkling Orange Juice & Prickly Pear Beverage,115,7


## Concatenate order_products__prior and order_products__train

In [0]:
order_products = pd.concat([prior_products, train_products])

In [22]:
order_products.shape

(33819106, 4)

## Get a subset of orders — the first two orders for user id 1

In [23]:
display(example)

In [24]:
orders.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [25]:
orders[orders['user_id']==1].head(2)

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0


In [0]:
conditions = (orders['user_id']==1) & (orders['order_number']<=2)

columns = ['user_id',
          'order_id',
          'order_number',
          'order_dow',
          'order_hour_of_day']

subset = orders.loc[conditions, columns]

## Merge dataframes

In [0]:
merged = pd.merge(subset,
         order_products[['order_id', 'add_to_cart_order', 'product_id']])

In [28]:
merged.head()

Unnamed: 0,user_id,order_id,order_number,order_dow,order_hour_of_day,add_to_cart_order,product_id
0,1,2539329,1,2,8,1,196
1,1,2539329,1,2,8,2,14084
2,1,2539329,1,2,8,3,12427
3,1,2539329,1,2,8,4,26088
4,1,2539329,1,2,8,5,26405


In [29]:
display(example)

In [0]:
final = pd.merge(merged, products[['product_id', 'product_name']])

In [31]:
merged.shape, products[['product_id', 'product_name']].shape, final.shape

((11, 7), (49688, 2), (11, 8))

In [32]:
final.head()

Unnamed: 0,user_id,order_id,order_number,order_dow,order_hour_of_day,add_to_cart_order,product_id,product_name
0,1,2539329,1,2,8,1,196,Soda
1,1,2398795,2,3,7,1,196,Soda
2,1,2539329,1,2,8,2,14084,Organic Unsweetened Vanilla Almond Milk
3,1,2539329,1,2,8,3,12427,Original Beef Jerky
4,1,2398795,2,3,7,3,12427,Original Beef Jerky


In [0]:
final = final.sort_values(by=['order_number', 'add_to_cart_order'])

In [0]:
final.columns = [column.replace('_', ' ') for column in final]

In [35]:
final.head(1)

Unnamed: 0,user id,order id,order number,order dow,order hour of day,add to cart order,product id,product name
0,1,2539329,1,2,8,1,196,Soda


# Assignment

These are the top 10 most frequently ordered products. How many times was each ordered? 

1. Banana
2. Bag of Organic Bananas
3. Organic Strawberries
4. Organic Baby Spinach 
5. Organic Hass Avocado
6. Organic Avocado
7. Large Lemon 
8. Strawberries
9. Limes 
10. Organic Whole Milk

First, write down which columns you need and which dataframes have them.

Next, merge these into a single dataframe.

Then, use pandas functions from the previous lesson to get the counts of the top 10 most frequently ordered products.

## Stretch challenge

The [Instacart blog post](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2) has a visualization of "**Popular products** purchased earliest in the day (green) and latest in the day (red)." 

The post says,

> "We can also see the time of day that users purchase specific products.

> Healthier snacks and staples tend to be purchased earlier in the day, whereas ice cream (especially Half Baked and The Tonight Dough) are far more popular when customers are ordering in the evening.

> **In fact, of the top 25 latest ordered products, the first 24 are ice cream! The last one, of course, is a frozen pizza.**"

Your challenge is to reproduce the list of the top 25 latest ordered popular products.

We'll define "popular products" as products with more than 2,900 orders.

In [37]:
prior_products = pd.read_csv('order_products__prior.csv')

train_products = pd.read_csv('order_products__train.csv')

products = pd.read_csv('products.csv')



print(prior_products.shape)
print(train_products.shape)
print(products.shape)

(32434489, 4)
(1384617, 4)
(49688, 4)


In [38]:
order_products = pd.concat([prior_products, train_products])

order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [39]:
order_products.shape

(33819106, 4)

In [40]:
frequent_products = pd.merge(order_products, products[['product_id', 'product_name']])

frequent_products.head()

# Merging so I can see what the name of each product id is.

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name
0,2,33120,1,1,Organic Egg Whites
1,26,33120,5,0,Organic Egg Whites
2,120,33120,13,0,Organic Egg Whites
3,327,33120,5,1,Organic Egg Whites
4,390,33120,28,1,Organic Egg Whites


In [41]:
frequent_products.head(25)

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name
0,2,33120,1,1,Organic Egg Whites
1,26,33120,5,0,Organic Egg Whites
2,120,33120,13,0,Organic Egg Whites
3,327,33120,5,1,Organic Egg Whites
4,390,33120,28,1,Organic Egg Whites
5,537,33120,2,1,Organic Egg Whites
6,582,33120,7,1,Organic Egg Whites
7,608,33120,5,1,Organic Egg Whites
8,623,33120,1,1,Organic Egg Whites
9,689,33120,4,1,Organic Egg Whites


In [42]:
frequent_products.sort_values(by=['order_id'])

# Just checking to see if that worked.

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name
30698102,1,43633,5,1,Lightly Smoked Sardines in Olive Oil
1532449,1,47209,7,0,Organic Hass Avocado
17184946,1,10246,3,0,Organic Celery Hearts
33046497,1,49302,1,1,Bulgarian Yogurt
1039726,1,13176,6,0,Bag of Organic Bananas
10299527,1,22035,8,1,Organic Whole String Cheese
5868019,1,49683,4,0,Cucumber Kirby
23821643,1,11109,2,1,Organic 4% Milk Fat Whole Milk Cottage Cheese
174717,2,1819,8,1,All Natural No Stir Creamy Almond Butter
177234,2,43668,9,0,Classic Blend Cole Slaw


In [43]:
frequent_products.describe(include='all')

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name
count,33819110.0,33819110.0,33819110.0,33819110.0,33819106
unique,,,,,49685
top,,,,,Banana
freq,,,,,491291
mean,1710566.0,25575.51,8.367738,0.5900617,
std,987400.8,14097.7,7.13954,0.491822,
min,1.0,1.0,1.0,0.0,
25%,855413.0,13519.0,3.0,0.0,
50%,1710660.0,25256.0,6.0,1.0,
75%,2565587.0,37935.0,11.0,1.0,


In [45]:
frequent_products['product_name'].value_counts()

# So it looks like at number 1 we Banana at 491291 orders. That is a lot.

# 1. Banana: 491291
# 2. Bag of Organic Bananas: 394930
# 3. Organic Strawberries: 275577
# 4. Organic Baby Spinach: 251705
# 5. Organic Hass Avocado: 220877
# 6. Organic Avocado: 184224
# 7. Large Lemon: 160792
# 8. Strawberries: 149445
# 9. Limes: 146660
# 10. Organic Whole Milk: 142813

# People are buying a lot of bananas. If we were to combine "Banana" and "Bag of Organic Bananas" into one product then it
# would be even more obvious. The same goes for combining "Organic Hass Avocado" and "Organic Avocado".

Banana                                             491291
Bag of Organic Bananas                             394930
Organic Strawberries                               275577
Organic Baby Spinach                               251705
Organic Hass Avocado                               220877
Organic Avocado                                    184224
Large Lemon                                        160792
Strawberries                                       149445
Limes                                              146660
Organic Whole Milk                                 142813
Organic Raspberries                                142603
Organic Yellow Onion                               117716
Organic Garlic                                     113936
Organic Zucchini                                   109412
Organic Blueberries                                105026
Cucumber Kirby                                      99728
Organic Fuji Apple                                  92889
Organic Lemon 