# Market Basket Analysis with SQL

# Project Overview

Whether you shop from meticulously planned grocery lists or let whimsy guide your grazing, our unique food rituals define who we are. Instacart, a grocery ordering and delivery app, aims to make it easy to fill your refrigerator and pantry with your personal favorites and staples when you need them.

In this project, I will use this anonymized data from Instacart on customer orders over time to predict which previously purchased products will be in a user’s next order.

## What I Have Learned From This Project

* SQL<br>
* PostgreSQL<br>
* Association rule<br>
* Apriori algorithm<br>
* Machine Learning algorithms: ....

# Data Description

The dataset for this competition is a relational set of files describing customers' orders over time. The goal of the competition is to predict which products will be in a user's next order. The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, we provide between 4 and 100 of their orders, with the sequence of products purchased in each order. We also provide the week and hour of day the order was placed, and a relative measure of time between orders. Each entity (customer, product, order, aisle, etc.) has an associated unique id. Most of the files and variable names should be self-explanatory.

More information about this dataset can be found [here](https://www.kaggle.com/c/instacart-market-basket-analysis/data).

# Load the Data into PostgreSQL Database

In [1]:
# import necessary tools
import warnings
warnings.filterwarnings('ignore')

import psycopg2
import numpy as np
import pandas as pd

# load the sql extension
%load_ext sql

In [2]:
# connect to the database
%sql postgresql://postgres:postgres@localhost:5432/postgres

'Connected: postgres@postgres'

## Load aisles.csv

Create a SQL table:

In [3]:
%%sql
CREATE TABLE aisles
(
  aisle_id serial NOT NULL,
  aisle character varying(50)
)

 * postgresql://postgres:***@localhost:5432/postgres
Done.


[]

Copy the csv file into the SQL table:

In [4]:
%%sql
COPY aisles(aisle_id, aisle) 
FROM '/Users/andreduong/market-basket-analysis/aisles.csv' DELIMITER ',' CSV HEADER;

 * postgresql://postgres:***@localhost:5432/postgres
134 rows affected.


[]

View the table:

In [5]:
%%sql
SELECT * FROM aisles LIMIT 5;

 * postgresql://postgres:***@localhost:5432/postgres
5 rows affected.


aisle_id,aisle
1,prepared soups salads
2,specialty cheeses
3,energy granola bars
4,instant foods
5,marinades meat preparation


## Load departments.csv  

In [6]:
%%sql
CREATE TABLE departments
(
  department_id serial NOT NULL,
  department character varying(50)
)

 * postgresql://postgres:***@localhost:5432/postgres
Done.


[]

In [7]:
%%sql
COPY departments(department_id, department) 
FROM '/Users/andreduong/market-basket-analysis/departments.csv' DELIMITER ',' CSV HEADER

 * postgresql://postgres:***@localhost:5432/postgres
21 rows affected.


[]

In [8]:
%%sql
SELECT * FROM departments LIMIT 5;

 * postgresql://postgres:***@localhost:5432/postgres
5 rows affected.


department_id,department
1,frozen
2,other
3,bakery
4,produce
5,alcohol


## Load orders.csv

In [9]:
%%sql
CREATE TABLE orders
(
  order_id integer NOT NULL,
  user_id integer NOT NULL,
  eval_set character varying(50),
  order_number integer NOT NULL,
  order_dow integer NOT NULL,
  order_hour_of_day integer NOT NULL,
  days_since_prior_order real
)

 * postgresql://postgres:***@localhost:5432/postgres
Done.


[]

In [10]:
%%sql
COPY orders(order_id, user_id, eval_set, order_number, order_dow, order_hour_of_day, days_since_prior_order) 
FROM '/Users/andreduong/market-basket-analysis/orders.csv' DELIMITER ',' CSV HEADER;

 * postgresql://postgres:***@localhost:5432/postgres
3421083 rows affected.


[]

In [11]:
%%sql
SELECT * FROM orders LIMIT 5

 * postgresql://postgres:***@localhost:5432/postgres
5 rows affected.


order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
2539329,1,prior,1,2,8,
2398795,1,prior,2,3,7,15.0
473747,1,prior,3,3,12,21.0
2254736,1,prior,4,4,7,29.0
431534,1,prior,5,4,15,28.0


## Load products.csv

In [12]:
%%sql
CREATE TABLE products
(
  product_id serial NOT NULL,
  product_name character varying(500),
  aisle_id integer NOT NULL,
  department_id integer NOT NULL
)

 * postgresql://postgres:***@localhost:5432/postgres
Done.


[]

In [13]:
%%sql
COPY products(product_id, product_name, aisle_id, department_id) 
FROM '/Users/andreduong/market-basket-analysis/products.csv' DELIMITER ',' CSV HEADER;

 * postgresql://postgres:***@localhost:5432/postgres
49688 rows affected.


[]

In [14]:
%%sql
SELECT * FROM products LIMIT 5

 * postgresql://postgres:***@localhost:5432/postgres
5 rows affected.


product_id,product_name,aisle_id,department_id
1,Chocolate Sandwich Cookies,61,19
2,All-Seasons Salt,104,13
3,Robust Golden Unsweetened Oolong Tea,94,7
4,Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce,38,1
5,Green Chile Anytime Sauce,5,13


## Load order_products__train.csv

In [15]:
%%sql
CREATE TABLE order_products__train
(
  order_id integer NOT NULL,
  product_id integer NOT NULL,
  add_to_cart_order integer NOT NULL,
  reordered integer NOT NULL
)

 * postgresql://postgres:***@localhost:5432/postgres
Done.


[]

In [16]:
%%sql
COPY order_products__train(order_id, product_id, add_to_cart_order, reordered) 
FROM '/Users/andreduong/market-basket-analysis/order_products__train.csv' DELIMITER ',' CSV HEADER;

 * postgresql://postgres:***@localhost:5432/postgres
1384617 rows affected.


[]

In [17]:
%%sql
SELECT * FROM order_products__train LIMIT 5

 * postgresql://postgres:***@localhost:5432/postgres
5 rows affected.


order_id,product_id,add_to_cart_order,reordered
1,49302,1,1
1,11109,2,1
1,10246,3,0
1,49683,4,0
1,43633,5,1


## Load order_products__prior.csv

In [18]:
%%sql
CREATE TABLE order_products__prior
(
  order_id integer NOT NULL,
  product_id integer NOT NULL,
  add_to_cart_order integer NOT NULL,
  reordered integer NOT NULL
)

 * postgresql://postgres:***@localhost:5432/postgres
Done.


[]

In [19]:
%%sql
COPY order_products__prior(order_id, product_id, add_to_cart_order, reordered) 
FROM '/Users/andreduong/market-basket-analysis/order_products__prior.csv' DELIMITER ',' CSV HEADER;

 * postgresql://postgres:***@localhost:5432/postgres
32434489 rows affected.


[]

In [20]:
%%sql
SELECT * FROM order_products__prior LIMIT 5

 * postgresql://postgres:***@localhost:5432/postgres
5 rows affected.


order_id,product_id,add_to_cart_order,reordered
2,33120,1,1
2,28985,2,1
2,9327,3,0
2,45918,4,1
2,30035,5,0


# Data Cleaning

## Table: orders

Check numbers of null values in each column:

In [4]:
#%%sql
#SELECT COUNT(*) - COUNT(order_id) AS order_id,
#       COUNT(*) - COUNT(user_id) AS user_id,
#       COUNT(*) - COUNT(eval_set) AS eval_set,
#       COUNT(*) - COUNT(order_number) AS order_number,
#       COUNT(*) - COUNT(order_dow) AS order_dow,
#       COUNT(*) - COUNT(order_hour_of_day) AS order_hour_of_day,
#       COUNT(*) - COUNT(days_since_prior_order) AS days_since_prior_order
#FROM orders;

 * postgresql://postgres:***@localhost:5432/postgres
1 rows affected.


order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,0,0,0,0,0,206209


days_since_prior_order attribute for first order for all the users is null so it can be changed to 0:

In [63]:
#%%sql
#UPDATE orders SET days_since_prior_order = 0 WHERE days_since_prior_order IS NULL;

 * postgresql://postgres:***@localhost:5432/postgres
206209 rows affected.


[]

Double check if there's any NULL value left in the table:

In [64]:
#%%sql
#SELECT days_since_prior_order FROM orders LIMIT 10

 * postgresql://postgres:***@localhost:5432/postgres
10 rows affected.


days_since_prior_order
15.0
21.0
29.0
28.0
19.0
20.0
14.0
0.0
30.0
14.0


In [65]:
#%%sql
#SELECT COUNT(*) - COUNT(order_id) AS order_id,
#       COUNT(*) - COUNT(user_id) AS user_id,
#       COUNT(*) - COUNT(eval_set) AS eval_set,
#       COUNT(*) - COUNT(order_number) AS order_number,
#       COUNT(*) - COUNT(order_dow) AS order_dow,
#       COUNT(*) - COUNT(order_hour_of_day) AS order_hour_of_day,
#       COUNT(*) - COUNT(days_since_prior_order) AS days_since_prior_order
#FROM orders;

 * postgresql://postgres:***@localhost:5432/postgres
1 rows affected.


order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,0,0,0,0,0,0


# Feature Engineering

## Merge Tables: departments, aisles, products

We will merge these three tables and call it productscombined, so that we have all product information in one table.

In [21]:
%%sql
CREATE TABLE productscombined AS
SELECT p.*, d.department, a.aisle
FROM products p
INNER JOIN departments d ON p.department_id = d.department_id
INNER JOIN aisles a ON p.aisle_id = a.aisle_id;

 * postgresql://postgres:***@localhost:5432/postgres
49688 rows affected.


[]

In [22]:
%%sql
SELECT * FROM productscombined Limit 5;

 * postgresql://postgres:***@localhost:5432/postgres
5 rows affected.


product_id,product_name,aisle_id,department_id,department,aisle
1,Chocolate Sandwich Cookies,61,19,snacks,cookies cakes
2,All-Seasons Salt,104,13,pantry,spices seasonings
3,Robust Golden Unsweetened Oolong Tea,94,7,beverages,tea
4,Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce,38,1,frozen,frozen meals
5,Green Chile Anytime Sauce,5,13,pantry,marinades meat preparation


## Merge Tables: order_products_prior and orders