_Lambda School Data Science_

# Join datasets

Objectives
- concatenate data with pandas
- merge data with pandas

Links
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)
  - Combine Data Sets: Standard Joins
- Python Data Science Handbook
  - [Chapter 3.6](https://jakevdp.github.io/PythonDataScienceHandbook/03.06-concat-and-append.html), Combining Datasets: Concat and Append
  - [Chapter 3.7](https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html), Combining Datasets: Merge and Join

## Download data

We’ll work with a dataset of [3 Million Instacart Orders, Open Sourced](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2)!

In [1]:
!wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

--2019-03-26 20:00:09--  https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.227.35
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.227.35|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 205548478 (196M) [application/x-gzip]
Saving to: ‘instacart_online_grocery_shopping_2017_05_01.tar.gz.2’


2019-03-26 20:00:14 (41.3 MB/s) - ‘instacart_online_grocery_shopping_2017_05_01.tar.gz.2’ saved [205548478/205548478]



In [2]:
!tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

instacart_2017_05_01/
instacart_2017_05_01/._aisles.csv
instacart_2017_05_01/aisles.csv
instacart_2017_05_01/._departments.csv
instacart_2017_05_01/departments.csv
instacart_2017_05_01/._order_products__prior.csv
instacart_2017_05_01/order_products__prior.csv
instacart_2017_05_01/._order_products__train.csv
instacart_2017_05_01/order_products__train.csv
instacart_2017_05_01/._orders.csv
instacart_2017_05_01/orders.csv
instacart_2017_05_01/._products.csv
instacart_2017_05_01/products.csv


In [3]:
%cd instacart_2017_05_01

/content/instacart_2017_05_01


## Goal: Reproduce this example

The first two orders for user id 1:

In [5]:
from IPython.display import display, Image
url = 'https://cdn-images-1.medium.com/max/1600/1*vYGFQCafJtGBBX5mbl0xyw.png'
example = Image(url=url, width=600)

display(example)

## Load data

Here's a list of all six CSV filenames

In [4]:
!ls -lh *.csv

-rw-r--r-- 1 502 staff 2.6K May  2  2017 aisles.csv
-rw-r--r-- 1 502 staff  270 May  2  2017 departments.csv
-rw-r--r-- 1 502 staff 551M May  2  2017 order_products__prior.csv
-rw-r--r-- 1 502 staff  24M May  2  2017 order_products__train.csv
-rw-r--r-- 1 502 staff 104M May  2  2017 orders.csv
-rw-r--r-- 1 502 staff 2.1M May  2  2017 products.csv


For each CSV
- Load it with pandas
- Look at the dataframe's shape
- Look at its head (first rows)
- `display(example)`
- Which columns does it have in common with the example we want to reproduce?

### aisles

In [0]:
import pandas as pd
aisles = pd.read_csv('aisles.csv')

In [8]:
aisles.shape

(134, 2)

In [9]:
aisles.head()

Unnamed: 0,aisle_id,aisle
0,1,prepared soups salads
1,2,specialty cheeses
2,3,energy granola bars
3,4,instant foods
4,5,marinades meat preparation


In [10]:
display(example)

Aisles does not have any data that we need

### departments

In [11]:
departments = pd.read_csv('departments.csv')
departments.shape

(21, 2)

In [12]:
departments.head()

Unnamed: 0,department_id,department
0,1,frozen
1,2,other
2,3,bakery
3,4,produce
4,5,alcohol


seems there are no data relevant in the departments data either

### order_products__prior

In [6]:
order_products_prior = pd.read_csv('order_products__prior.csv')
order_products_prior.shape

(32434489, 4)

In [14]:
order_products_prior.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


There are relevant columns in the **order_products_prior** data
- order_id
- product_id
- add_to_cart_order

In [15]:
order_products_prior.groupby('order_id')['product_id'].count() ## counts number or rows for each order

order_id
2           9
3           8
4          13
5          26
6           3
7           2
8           1
9          15
10         15
11          5
12         15
13         13
14         11
15          5
16          3
18         28
19          3
20          8
21          5
22         14
23         14
24          3
25         14
26          8
27         27
28         16
29          5
30          3
31         10
32          9
           ..
3421048     8
3421050    13
3421051    31
3421052     2
3421053     9
3421055    19
3421057     5
3421059     6
3421060    17
3421061    22
3421062     7
3421064     3
3421065     5
3421066     6
3421067     1
3421068    14
3421069    12
3421071     5
3421072    12
3421073     2
3421074     4
3421075     8
3421076     8
3421077     4
3421078     9
3421079     1
3421080     9
3421081     7
3421082     7
3421083    10
Name: product_id, Length: 3214874, dtype: int64

In [16]:
order_products_prior.groupby('order_id')['product_id'].count().mean() ## average number of rows for each order!

10.088883421247614

It is important to understand data.. and understand the results.  small concrete examples can help you be more intuitive with the data, and make sure that it makes sense! 

### order_products__train

In [7]:
order_products_train = pd.read_csv('order_products__train.csv')
order_products_train.shape

(1384617, 4)

In [18]:
order_products_train.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1


**order_products_train** has relevant columns as well
- order_id
- products_id
- add_to_cart_order

### orders

In [8]:
orders = pd.read_csv('orders.csv')
orders.shape

(3421083, 7)

In [20]:
orders.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


relevant **orders** columns
- order_id
- user_id
- order_number
- order_dow
- order_hour_of_day

### products

In [9]:
products = pd.read_csv('products.csv')
products.shape

(49688, 4)

In [22]:
products.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


**products** has relevant columns
- product_name
- product_id

## Concatenate order_products__prior and order_products__train

In [0]:
## because they have the exact same columns. we can just concatenate them
order_products = pd.concat([order_products_prior, order_products_train])

In [24]:
order_products.shape, order_products_prior.shape, order_products_train.shape # compare shapes

((33819106, 4), (32434489, 4), (1384617, 4))

In [0]:
assert len(order_products) == len(order_products_prior) + len(order_products_train) ## check to see if they got added properly

## assert makes sure that something 'passes' a test. Otherwise there will be an 'assertion error'

In [0]:
## number of columns in this case should stay equal for all of them, which you can see that they do by looking at the shape of each

In [27]:
## unpacking tuples
rows, columns = order_products.shape
rows, columns

## assigns each value to variable names you give it. In this case 'rows' and 'columns'.. variable names are important. be descriptive
## but at the same time efficient

(33819106, 4)

In [28]:
condition = order_products['order_id'] == 2539329
order_products[condition]

## getting data that has a specific value

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
24076664,2539329,196,1,0
24076665,2539329,14084,2,0
24076666,2539329,12427,3,0
24076667,2539329,26088,4,0
24076668,2539329,26405,5,0


## Get a subset of orders — the first two orders for user id 1

In [29]:
display(example) ## check to see what it is you want

In [30]:
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


From 'orders' data, 
- user id
- order id
- order number
- order dow
- order hour of day


In [31]:
## take a look at when you are putting in multiple conditions.. putting parentheses around the conditions is necessary for syntax
orders[(orders['user_id'] == 1) & (orders['order_number'] <= 2)]  ## be aware of the assumptions you are making when calling conditions (sorting, negative values, etc..) 

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0


In [32]:
## previous code does succesfully show first 2 orders of user_id=1
## can also do..
condition = (orders['user_id'] == 1) & (orders['order_number'] <= 2)
orders[condition]

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0


## Merge dataframes

Merge the subset from 'orders' with the columns from 'order_products'

In [0]:
merged = pd.merge(orders[condition],
         order_products[['order_id', 'add_to_cart_order', 'product_id']],
                 how='inner', on='order_id')

In [34]:
orders[condition].shape, order_products.shape, merged.shape

((2, 7), (33819106, 4), (11, 9))

In [35]:
orders.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [36]:
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [37]:
## refer to cheat sheet on how the join can occur, and what you want to do once you understand the data and determine what is best
merged

## drop 'days_since_prior_order' and 'eval_set' columns

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,add_to_cart_order,product_id
0,2539329,1,prior,1,2,8,,1,196
1,2539329,1,prior,1,2,8,,2,14084
2,2539329,1,prior,1,2,8,,3,12427
3,2539329,1,prior,1,2,8,,4,26088
4,2539329,1,prior,1,2,8,,5,26405
5,2398795,1,prior,2,3,7,15.0,1,196
6,2398795,1,prior,2,3,7,15.0,2,10258
7,2398795,1,prior,2,3,7,15.0,3,12427
8,2398795,1,prior,2,3,7,15.0,4,13176
9,2398795,1,prior,2,3,7,15.0,5,26088


Merge with columns from 'products'

In [0]:
final = pd.merge(merged, products[['product_name', 'product_id']],
        how='inner', on='product_id')

In [39]:
final.sort_values(by=['order_number', 'add_to_cart_order'])

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,add_to_cart_order,product_id,product_name
0,2539329,1,prior,1,2,8,,1,196,Soda
2,2539329,1,prior,1,2,8,,2,14084,Organic Unsweetened Vanilla Almond Milk
3,2539329,1,prior,1,2,8,,3,12427,Original Beef Jerky
5,2539329,1,prior,1,2,8,,4,26088,Aged White Cheddar Popcorn
7,2539329,1,prior,1,2,8,,5,26405,XL Pick-A-Size Paper Towel Rolls
1,2398795,1,prior,2,3,7,15.0,1,196,Soda
8,2398795,1,prior,2,3,7,15.0,2,10258,Pistachios
4,2398795,1,prior,2,3,7,15.0,3,12427,Original Beef Jerky
9,2398795,1,prior,2,3,7,15.0,4,13176,Bag of Organic Bananas
6,2398795,1,prior,2,3,7,15.0,5,26088,Aged White Cheddar Popcorn


In [40]:
final.columns = [column.replace('_',' ') for column in final.columns]

final

Unnamed: 0,order id,user id,eval set,order number,order dow,order hour of day,days since prior order,add to cart order,product id,product name
0,2539329,1,prior,1,2,8,,1,196,Soda
1,2398795,1,prior,2,3,7,15.0,1,196,Soda
2,2539329,1,prior,1,2,8,,2,14084,Organic Unsweetened Vanilla Almond Milk
3,2539329,1,prior,1,2,8,,3,12427,Original Beef Jerky
4,2398795,1,prior,2,3,7,15.0,3,12427,Original Beef Jerky
5,2539329,1,prior,1,2,8,,4,26088,Aged White Cheddar Popcorn
6,2398795,1,prior,2,3,7,15.0,5,26088,Aged White Cheddar Popcorn
7,2539329,1,prior,1,2,8,,5,26405,XL Pick-A-Size Paper Towel Rolls
8,2398795,1,prior,2,3,7,15.0,2,10258,Pistachios
9,2398795,1,prior,2,3,7,15.0,4,13176,Bag of Organic Bananas


In [41]:
merged.shape, products[['product_id','product_name']].shape, final.shape

((11, 9), (49688, 2), (11, 10))

In [42]:
final.info() # information about dataframe

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11 entries, 0 to 10
Data columns (total 10 columns):
order id                  11 non-null int64
user id                   11 non-null int64
eval set                  11 non-null object
order number              11 non-null int64
order dow                 11 non-null int64
order hour of day         11 non-null int64
days since prior order    6 non-null float64
add to cart order         11 non-null int64
product id                11 non-null int64
product name              11 non-null object
dtypes: float64(1), int64(7), object(2)
memory usage: 968.0+ bytes


# Assignment

These are the top 10 most frequently ordered products. How many times was each ordered? 

1. Banana
2. Bag of Organic Bananas
3. Organic Strawberries
4. Organic Baby Spinach 
5. Organic Hass Avocado
6. Organic Avocado
7. Large Lemon 
8. Strawberries
9. Limes 
10. Organic Whole Milk

First, write down which columns you need and which dataframes have them.

Next, merge these into a single dataframe.

Then, use pandas functions from the previous lesson to get the counts of the top 10 most frequently ordered products.



***Which columns do I need?***
- product name --> to get the names of the products
- product id --> to merge datasets that have the information I need
- order id -- > to get all the times the product got ordered

-***Which dataframes will I need?***
- order_products
- products

In [43]:
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [45]:
products.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


I can merge producst and order_products on product_id.

In [12]:
merged = pd.merge(products, order_products,
        how='left', on='product_id')
merged.head()


Unnamed: 0,product_id,product_name,aisle_id,department_id,order_id,add_to_cart_order,reordered
0,1,Chocolate Sandwich Cookies,61,19,1107.0,7.0,0.0
1,1,Chocolate Sandwich Cookies,61,19,5319.0,3.0,1.0
2,1,Chocolate Sandwich Cookies,61,19,7540.0,4.0,1.0
3,1,Chocolate Sandwich Cookies,61,19,9228.0,2.0,0.0
4,1,Chocolate Sandwich Cookies,61,19,9273.0,30.0,0.0


In [18]:
## group by the name of the products.
## count how many of each there are
## sort them from greatest to lowest
## show the top ten results only
merged.groupby('product_name')['product_id'].count().sort_values(ascending=False).head(10)

product_name
Banana                    491291
Bag of Organic Bananas    394930
Organic Strawberries      275577
Organic Baby Spinach      251705
Organic Hass Avocado      220877
Organic Avocado           184224
Large Lemon               160792
Strawberries              149445
Limes                     146660
Organic Whole Milk        142813
Name: product_id, dtype: int64

## Stretch challenge

The [Instacart blog post](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2) has a visualization of "**Popular products** purchased earliest in the day (green) and latest in the day (red)." 

The post says,

> "We can also see the time of day that users purchase specific products.

> Healthier snacks and staples tend to be purchased earlier in the day, whereas ice cream (especially Half Baked and The Tonight Dough) are far more popular when customers are ordering in the evening.

> **In fact, of the top 25 latest ordered products, the first 24 are ice cream! The last one, of course, is a frozen pizza.**"

Your challenge is to reproduce the list of the top 25 latest ordered popular products.

We'll define "popular products" as products with more than 2,900 orders.