# Part I: Modeling Instacart Data

The following changes include joinining of tables with similar attributes, addition of primary keys to to all tables missing one, and basic cleaning of data, such as dropping uneccesary columns for our purpose.

In [8]:
dataset_id = "instacart_modeled"
!bq --location=US mk --dataset {dataset_id}  #Note: This will not work if you already have a dataset with this name

Dataset 'responsive-cab-267123:instacart_modeled' successfully created.


### The following tables do not need alterations.

#### Aisles Table

In [7]:
%%bigquery
create table instacart_modeled.Aisles as select * from instacart_staging.Aisles

In [9]:
%%bigquery 
select * from instacart_modeled.Aisles limit 5

Unnamed: 0,aisle_id,aisle
0,23,popcorn jerky
1,33,kosher foods
2,122,meat counter
3,123,packaged vegetables fruits
4,103,ice cream toppings


#### Departments

In [10]:
%%bigquery
create table instacart_modeled.Departments as select * from instacart_staging.Departments

In [11]:
%%bigquery 
select * from instacart_modeled.Departments limit 5

Unnamed: 0,department_id,department
0,5,alcohol
1,15,canned goods
2,16,dairy eggs
3,8,pets
4,2,other


#### Products

In [19]:
%%bigquery
create table instacart_modeled.Products as select * from instacart_staging.Products

In [20]:
%%bigquery 
select * from instacart_modeled.Products limit 5

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,9676,"Egg, Bacon & Cheese Breakfast Taquitos",34,1
1,16363,Gluten Free Breaded Chicken Breast Tenders,34,1
2,23095,Flame Grilled Beef Patty,34,1
3,32547,Bite Size Turkey Meatball,34,1
4,45845,Battered Whole Fish Fillet,34,1


### Modeling Data

All tables have appropriate data types representative of the data in each attribute. Therefore, no casting was done.

The modeling focuses on:
* dropping attributes that don't yield useful data for our purpse (Orders)
* joining tables with the same attributes (Order_Products_Prior and Order_Products_Train)
* adding primary keys to all tables (Order_Products_Prior, Order_Products_Train)

### Dropping Uncessary Columns on Tables

Alterations to 'Orders' table
* The eval_set attribute is no longer needed because the two evaluation set tables have been merged.
* The days_since_prior_order attribute is not needed because it only contains nulls and floats, which are not useful for our analysis

The following query shows how the days_since_prior_order attribute only yields 0.0s and nulls, which isn't very useful.

In [13]:
%%bigquery
select days_since_prior_order from instacart_staging.Orders where days_since_prior_order not in (null, 0.0, 0)

Unnamed: 0,days_since_prior_order


In [21]:
%%bigquery
create table instacart_modeled.Orders as select order_id, user_id, order_number, order_dow, order_hour_of_day from instacart_staging.Orders

In [22]:
%%bigquery 
select * from instacart_modeled.Orders limit 5

Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day
0,2688538,47867,43,0,1
1,2420789,198165,54,0,2
2,742407,199989,50,0,5
3,921979,21991,61,0,3
4,1936280,41896,44,0,1


#### Joining Tables

The Order_Products_Prior and Order_Products_Train tables have the same attributes. After checking documentation on the dataset, we found that merging these two tables without accounting which evaluation set each record was taken from would fit best for our purposes. 

In [2]:
%%bigquery
create table instacart_modeled.Order_Products as 
select * from instacart_staging.Order_Products_Prior
union distinct 
select * from instacart_staging.Order_Products_Train

Displaying merged of table

In [3]:
%%bigquery
select * from instacart_modeled.Order_Products limit 12

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,76181,43768,43,0
1,2492557,7361,27,0
2,1347843,26194,33,0
3,843703,7188,29,0
4,2960599,15006,28,0
5,510317,10545,40,0
6,1626114,18987,28,0
7,676086,48437,32,0
8,993081,40264,31,0
9,1945159,32402,26,0


#### Primary Keys

All tables in our dataset had valid primary keys with the exception of the Order_Products_Prior and Order_Products_Train tables, which are now merged into 'Order_Products' table. The query below shows that making a composite key out of order_id and product_id is valid in the newly merged table.

In [4]:
%%bigquery
select count(*) from instacart_modeled.Order_Products

Unnamed: 0,f0_
0,33819106


In [5]:
%%bigquery
select count(*) from (select distinct order_id, product_id from instacart_modeled.Order_Products) 

Unnamed: 0,f0_
0,33819106


# Part II: Writing Join Queries with the Modeled Dataset

#### Query 1: This query displays what products are in what orders where the aisle is "instant foods".

In [2]:
%%bigquery
select count(*) as frequency, product_name
from instacart_modeled.Products p join instacart_modeled.Order_Products o on p.product_id = o.product_id
join instacart_modeled.Aisles a on a.aisle_id = p.aisle_id
where a.aisle = 'instant foods'
group by product_name
order by frequency desc
limit 12

Unnamed: 0,frequency,product_name
0,11316,Organic Macaroni Shells & Real Aged Cheddar
1,8648,Macaroni Shells & White Cheddar Cheese
2,8435,Bunny Pasta with Yummy Cheese Macaroni & Cheese
3,7832,Shells & Real Aged Cheddar Macaroni & Cheese
4,6596,Organic Shells And White Cheddar
5,6494,Garlic Couscous
6,6397,Macaroni & Cheese Dinner Original Flavor
7,5144,Creamy Deluxe Shells & Real Aged Cheddar Sauce
8,4712,Spanish Rice Pilaf Mix
9,3863,Parmesan Couscous Mix


#### Query 2: This query displays what each user ordered at any point, sorted by order number

In [3]:
%%bigquery
select o.user_id, o.order_number, p.product_name
from instacart_modeled.Products p join instacart_modeled.Order_Products op on p.product_id = op.product_id
join instacart_modeled.Orders o on op.order_id = o.order_id
order by user_id, order_number
limit 12

Unnamed: 0,user_id,order_number,product_name
0,1,1,Organic Unsweetened Vanilla Almond Milk
1,1,1,XL Pick-A-Size Paper Towel Rolls
2,1,1,Soda
3,1,1,Aged White Cheddar Popcorn
4,1,1,Original Beef Jerky
5,1,2,Soda
6,1,2,Aged White Cheddar Popcorn
7,1,2,Bag of Organic Bananas
8,1,2,Pistachios
9,1,2,Original Beef Jerky


#### Query 3: This query displays which users are most frequently purchasing baby products (aisles 82 and 92)


In [4]:
%%bigquery
select user_id, count(*) as freq
from instacart_modeled.Products p join instacart_modeled.Order_Products op on p.product_id = op.product_id
join instacart_modeled.Orders o on o.order_id = op.order_id
where aisle_id = 92 or aisle_id = 82
group by user_id
order by freq desc
limit 12

Unnamed: 0,user_id,freq
0,128627,820
1,84092,555
2,124042,450
3,21463,427
4,8812,378
5,111128,373
6,197502,369
7,169991,366
8,58919,358
9,108736,354


#### Query 4: This query determines the most popular aisles that users shop from, sorted from most to least popular

In [5]:
%%bigquery
select  count(*) as freq, aisle
from instacart_modeled.Orders o join instacart_modeled.Order_Products op on o.order_id = op.order_id
join instacart_modeled.Products p on p.product_id = op.product_id
join instacart_modeled.Aisles a on p.aisle_id = a.aisle_id
group by aisle
order by freq desc
limit 12


Unnamed: 0,freq,aisle
0,3792661,fresh fruits
1,3568630,fresh vegetables
2,1843806,packaged vegetables fruits
3,1507583,yogurt
4,1021462,packaged cheese
5,923659,milk
6,878150,water seltzer sparkling water
7,753739,chips pretzels
8,664493,soy lactosefree
9,608469,bread


#### Query 5: This query determines the users who purchase the most items (with no cost specified) from instacart.

In [6]:
%%bigquery
select user_id, count(*) as freq
from instacart_modeled.Orders o join instacart_modeled.Order_Products op on o.order_id = op.order_id
join instacart_modeled.Products p on p.product_id = op.product_id
group by user_id
order by freq desc
limit 12

Unnamed: 0,user_id,freq
0,201268,3725
1,129928,3689
2,164055,3089
3,176478,2952
4,186704,2936
5,137629,2931
6,182401,2929
7,33731,2912
8,108187,2760
9,4694,2735


#### Query 6: This query can help to determine how many orders have frozen items in them. This could be used to decide whether Instacart should start distributing frozen coolers to its drivers so that the company can ensure its frozen demands are being met.

In [7]:
%%bigquery
select count(distinct user_id) as freq_of_frozen_purchases
from instacart_modeled.Orders o join instacart_modeled.Order_Products op on o.order_id = op.order_id
join instacart_modeled.Products p on p.product_id = op.product_id 
join instacart_modeled.Departments d on p.department_id = d.department_id
where department = 'frozen'

Unnamed: 0,freq_of_frozen_purchases
0,166124


# MILESTONE 5

### Casting

The following query returned 0 results in our last milestone when searching for column values other than 0, 0.0 or null. The query showed that this column did not yield useful data for analysis. 

In [1]:
%%bigquery
select days_since_prior_order from instacart_staging.Orders where days_since_prior_order not in (null, 0.0, 0)

Unnamed: 0,days_since_prior_order


This query returns the same empty result.

In [2]:
%%bigquery
select * from instacart_staging.Orders where days_since_prior_order not in (null, 0.0, 0)

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order


Searching for values between 1.0 and 30.0 (from documentation) returns a non-empty value. I'm unsure why the two queries before returned a false result.

In [5]:
%%bigquery
select * from instacart_staging.Orders where days_since_prior_order between 1.0 and 30.0 limit 10

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,1630627,54,prior,33,1,1,2.0
1,3287976,190,prior,6,6,2,2.0
2,1403692,262,prior,37,3,2,2.0
3,1712783,313,prior,86,5,4,2.0
4,3341750,313,prior,98,4,3,2.0
5,1710899,409,prior,31,4,4,2.0
6,750107,409,prior,50,3,1,2.0
7,1865455,786,prior,74,3,4,2.0
8,2698935,1024,prior,59,6,2,2.0
9,1452111,1246,prior,39,4,1,2.0


This column is useful and should be added to our table again.

In [24]:
%%bigquery
drop table instacart_modeled.Orders

In [25]:
%%bigquery
create table instacart_modeled.Orders as 
select order_id, user_id, order_number, order_dow, order_hour_of_day, cast(days_since_prior_order as int64) as days_since_prior_order
from instacart_staging.Orders

In [27]:
%%bigquery
select * from instacart_modeled.Orders limit 5

Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2847109,15026,1,0,2,
1,3045723,15032,1,0,5,
2,2537448,20322,1,0,2,
3,1024842,72224,1,0,2,
4,749230,115103,1,0,2,


### Checking Primary Keys

#### Aisles
`aisle_id`

In [29]:
%%bigquery
select count(*) from instacart_modeled.Aisles

Unnamed: 0,f0_
0,134


In [30]:
%%bigquery
select count(distinct aisle_id) from instacart_modeled.Aisles

Unnamed: 0,f0_
0,134


#### Departments 
`department_id`

In [32]:
%%bigquery
select count(*) from instacart_modeled.Departments

Unnamed: 0,f0_
0,21


In [33]:
%%bigquery
select count(distinct department_id) from instacart_modeled.Departments

Unnamed: 0,f0_
0,21


#### Orders
`order_id`

In [36]:
%%bigquery
select count(*) from instacart_modeled.Orders

Unnamed: 0,f0_
0,3421083


In [35]:
%%bigquery
select count(distinct order_id) from instacart_modeled.Orders

Unnamed: 0,f0_
0,3421083


#### Products
`product_id`

In [39]:
%%bigquery
select count(*) from instacart_modeled.Products

Unnamed: 0,f0_
0,49688


In [38]:
%%bigquery
select count(distinct product_id) from instacart_modeled.Products

Unnamed: 0,f0_
0,49688


#### Order_Products
`order_id, product_id`

In [41]:
%%bigquery
select count(*) from instacart_modeled.Order_Products

Unnamed: 0,f0_
0,33819106


In [43]:
%%bigquery
select count(*) from (select distinct order_id, product_id from instacart_modeled.Order_Products)

Unnamed: 0,f0_
0,33819106


### Beam Transformation

This transformation is not needed but done to practice beams.  
The days of the week in `order_dow` are represented by integers 0-6.  
The beam changes this attribute to a string type with Sunday being 0, Monday 1, etc.

In [1]:
%%bigquery
select order_dow, count(*) 
from instacart_modeled.Orders
group by order_dow

Unnamed: 0,order_dow,f0_
0,0,600905
1,1,587478
2,2,467260
3,3,436972
4,4,426339
5,5,453368
6,6,448761


In [2]:
%run Orders_beam.py

  experiments = p.options.view_as(DebugOptions).experiments or []
INFO:apache_beam.runners.direct.direct_runner:Running pipeline with DirectRunner.
INFO:apache_beam.internal.gcp.auth:Setting socket default timeout to 60 seconds.
INFO:apache_beam.internal.gcp.auth:socket default timeout is 60.0 seconds.
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:apache_beam.io.gcp.bigquery_tools:Using location 'US' from table <TableReference
 datasetId: 'instacart_modeled'
 projectId: 'responsive-cab-267123'
 tableId: 'Orders'> referenced by query SELECT * FROM instacart_modeled.Orders limit 100


current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current dow: 0
new dow: sunday
current 

INFO:apache_beam.io.gcp.bigquery_tools:Created table responsive-cab-267123.instacart_modeled.Orders_Beam with schema <TableSchema
 fields: [<TableFieldSchema
 fields: []
 mode: 'NULLABLE'
 name: 'order_id'
 type: 'INTEGER'>, <TableFieldSchema
 fields: []
 mode: 'NULLABLE'
 name: 'user_id'
 type: 'INTEGER'>, <TableFieldSchema
 fields: []
 mode: 'NULLABLE'
 name: 'order_number'
 type: 'INTEGER'>, <TableFieldSchema
 fields: []
 mode: 'NULLABLE'
 name: 'order_dow'
 type: 'STRING'>, <TableFieldSchema
 fields: []
 mode: 'NULLABLE'
 name: 'order_hour_of_day'
 type: 'INTEGER'>, <TableFieldSchema
 fields: []
 mode: 'NULLABLE'
 name: 'days_since_prior_order'
 type: 'INTEGER'>]>. Result: <Table
 creationTime: 1583113459989
 etag: 'rL5K9+j+2psvG/mIln6sow=='
 id: 'responsive-cab-267123:instacart_modeled.Orders_Beam'
 kind: 'bigquery#table'
 lastModifiedTime: 1583113460041
 location: 'US'
 numBytes: 0
 numLongTermBytes: 0
 numRows: 0
 schema: <TableSchema
 fields: [<TableFieldSchema
 fields: []
 mod

In [3]:
%%bigquery
select order_dow, count(*) 
from instacart_modeled.Orders_Beam
group by order_dow

Unnamed: 0,order_dow,f0_
0,sunday,100


### Verifying Primary Key in Orders_Beam

In [32]:
%%bigquery
select count(*) from instacart_modeled.Orders_Beam

Unnamed: 0,f0_
0,100


In [33]:
%%bigquery
select count(distinct order_id) from instacart_modeled.Orders_Beam

Unnamed: 0,f0_
0,100


## Milestone 5 Dataflow Beam

In [39]:
%run Orders_beam_dataflow.py

  kms_key=transform.kms_key))
INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://bmease_cs327e/staging/transform-orders-df.1583727151.869137/pipeline.pb...
INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://bmease_cs327e/staging/transform-orders-df.1583727151.869137/pipeline.pb in 0 seconds.
INFO:apache_beam.runners.portability.stager:Downloading source distribution of the SDK from PyPi
INFO:apache_beam.runners.portability.stager:Executing command: ['/home/jupyter/venv/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/tmptg2a7i8x', 'apache-beam==2.19.0', '--no-deps', '--no-binary', ':all:']
INFO:apache_beam.runners.portability.stager:Staging SDK sources from PyPI to gs://bmease_cs327e/staging/transform-orders-df.1583727151.869137/dataflow_python_sdk.tar
INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://bmease_cs327e/staging/transform-orders-df.1583727151.869137/dataflow_python_sdk.tar...
INF

### Verifying Primary Key in Orders_Beam_DF

In [40]:
%%bigquery
select count(*) from instacart_modeled.Orders_Beam_DF

Unnamed: 0,f0_
0,3421083


In [41]:
%%bigquery
select count(distinct order_id) from instacart_modeled.Orders_Beam_DF

Unnamed: 0,f0_
0,3421083


## Milestone 6 Beam

Since our datasets were pretty much clean after milestone 4, this transformation analyzes the Order_Products table. The frequency a product appeared in an order and the total amount of times that product was ordered across all orders was analyzed. The outputted table reflects the latter.

### Running both directrunner and dataflow beams

In [1]:
%run Order_Products_beam.py

  experiments = p.options.view_as(DebugOptions).experiments or []
INFO:apache_beam.runners.direct.direct_runner:Running pipeline with DirectRunner.
INFO:apache_beam.internal.gcp.auth:Setting socket default timeout to 60 seconds.
INFO:apache_beam.internal.gcp.auth:socket default timeout is 60.0 seconds.
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:apache_beam.io.gcp.bigquery_tools:Using location 'US' from table <TableReference
 datasetId: 'instacart_modeled'
 projectId: 'responsive-cab-267123'
 tableId: 'Order_Products'> referenced by query SELECT product_id, add_to_cart_order as total FROM instacart_modeled.Order_Products limit 100
INFO:apache_beam.io.filebasedsink:Starting finalize_write threads with num_shards: 1 (skipped: 0), batches: 1, num_threads: 1
INFO:apache_beam.io.filebasedsink:Renamed 1 shards in 0.10 seconds.
INFO:apache_beam.io.filebasedsink:Starting finalize_write threads with num_shards: 1 (skipped: 0), batches: 1, num_threads: 1
IN

In [49]:
%run Order_Products_beam_dataflow.py

  kms_key=transform.kms_key))
INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://bmease_cs327e/staging/orders-df1.1583730903.426188/pipeline.pb...
INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://bmease_cs327e/staging/orders-df1.1583730903.426188/pipeline.pb in 0 seconds.
INFO:apache_beam.runners.portability.stager:Downloading source distribution of the SDK from PyPi
INFO:apache_beam.runners.portability.stager:Executing command: ['/home/jupyter/venv/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/tmpgokvahky', 'apache-beam==2.19.0', '--no-deps', '--no-binary', ':all:']
INFO:apache_beam.runners.portability.stager:Staging SDK sources from PyPI to gs://bmease_cs327e/staging/orders-df1.1583730903.426188/dataflow_python_sdk.tar
INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://bmease_cs327e/staging/orders-df1.1583730903.426188/dataflow_python_sdk.tar...
INFO:apache_beam.runners.dataflow.inter

### Verifying Primary Keys in Order_Products_Beam

In [None]:
%%bigquery
select count(*) from instacart_modeled.Order_Products_Beam

In [None]:
%%bigquery
select count(distinct product_id) from instacart_modeled.Order_Products_Beam

### Verifying Primary Keys in Order_Products_Beam_DF

In [50]:
%%bigquery
select count(*) from instacart_modeled.Order_Products_Beam_DF

Unnamed: 0,f0_
0,49685


In [51]:
%%bigquery
select count(distinct product_id) from instacart_modeled.Order_Products_Beam_DF

Unnamed: 0,f0_
0,49685
