**Capstone 1  -  Instacart Market Basket Analysis**

Instacart is a U.S. based company that provides same-day grocery delivery service. It provides the customers an option to choose the grocery store they want their groceries through the item list, select the items and get it hand-delivered by a personal shopper. The company offers the services both through their website as well their app.

They have over 3 million transactional data of multiple shoppers located across the United States. 

**Problem Statement**
Instacart wants to leverage this data to build a model that can predict which previously purchased products will be in a user’s next order. This will help the company strategize where to show the previously ordered items on their page or app.The model will also help recommend items for users while they are browsing.


In [1]:
#Import the necessary modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.stats.weightstats as ssw
import seaborn as sns
import statsmodels.stats.proportion as ssp
color = sns.color_palette()
from scipy.stats import norm
from scipy.stats import normaltest
from statsmodels.stats.weightstats import ztest
from scipy import stats
import math
import statsmodels.stats.api as sms

In [2]:
#Import the files
#files provided - aisles, products,departments,orders_prior,orders

#aisles
aisles = pd.read_csv('aisles.csv')
#importing the departments file
departments= pd.read_csv('departments.csv')
# importing the products file
#products
products =pd.read_csv("products.csv")
#orders_prior
#importing the orders_prior file
orders_prior =pd.read_csv("order_products__prior.csv")
#5
#importing the orders_train file
order_train = pd.read_csv('order_products__train.csv')
#6 orders
orders = pd.read_csv('orders.csv')

In [3]:
orders_train_sub = orders[(orders['eval_set'] == 'train')]
len(orders_train_sub)

131209

In [30]:
orders_train_sub.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
10,1187899,1,train,11,4,8,14.0
25,1492625,2,train,15,1,11,30.0
49,2196797,5,train,5,0,11,6.0
74,525192,7,train,21,2,11,6.0
78,880375,8,train,4,1,14,10.0


In [31]:
orders.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [32]:
products.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


In [33]:
orders_prior.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [5]:
# Merging files
#merging product with aisle and department files

products_a = pd.merge(products,aisles,on='aisle_id')

In [6]:
# merging with departments file
products_all =pd.merge(products_a,departments,left_on = 'department_id'
                       ,right_on = 'department_id',how='left',suffixes = ('_x', '_y'))
products_all.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49688 entries, 0 to 49687
Data columns (total 6 columns):
product_id       49688 non-null int64
product_name     49688 non-null object
aisle_id         49688 non-null int64
department_id    49688 non-null int64
aisle            49688 non-null object
department       49688 non-null object
dtypes: int64(3), object(3)
memory usage: 2.7+ MB


In [41]:
#orders table - merge with orders_train to get other variables
orders_all = pd.merge(orders,orders_prior,left_on= 'order_id',right_on='order_id',how="inner",suffixes = ('_x', '_y'))
orders_all = pd.merge(orders_all,products_all,left_on= 'product_id',right_on='product_id',how="inner" )
orders_all.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32434489 entries, 0 to 32434488
Data columns (total 15 columns):
order_id                  int64
user_id                   int64
eval_set                  object
order_number              int64
order_dow                 int64
order_hour_of_day         int64
days_since_prior_order    float64
product_id                int64
add_to_cart_order         int64
reordered                 int64
product_name              object
aisle_id                  int64
department_id             int64
aisle                     object
department                object
dtypes: float64(1), int64(10), object(4)
memory usage: 3.9+ GB


**From the previous EDA, we had noticed the following trends**

Most items are ordered from the produce department

Instacart gets more orders on Sundays and Saturdays

Let us check we can prove this statistically

Assuming null hypothesis - There is no difference in the proportion of items from produce department in the orders between the orders that are placed on a weekday and a weekend

Alternate hypothesis - There is a significant difference in the proportion of items from produce department in the orders between the orders that are placed on a weekday and a weekend

In [43]:
#create the order sets for weekdays and weekend
week = [1,2,3,4,5]
weekend = [0,6]
orders_week = orders_all[orders_all.order_dow.isin(week)]
orders_weekend = orders_all[orders_all.order_dow.isin(weekend)]

In [37]:
orders_week.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered
0,2539329,1,prior,1,2,8,,196,1,0
1,2539329,1,prior,1,2,8,,14084,2,0
2,2539329,1,prior,1,2,8,,12427,3,0
3,2539329,1,prior,1,2,8,,26088,4,0
4,2539329,1,prior,1,2,8,,26405,5,0


In [44]:
n1=len(orders_week)
n1

21724519

In [45]:
n2= len(orders_weekend)
n2

10709970

In [46]:
orders_week_prod

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,aisle,department
0,2539329,1,prior,1,2,8,,196,1,0,Soda,77,7,soft drinks,beverages
1,2398795,1,prior,2,3,7,15.0,196,1,1,Soda,77,7,soft drinks,beverages
2,473747,1,prior,3,3,12,21.0,196,1,1,Soda,77,7,soft drinks,beverages
3,2254736,1,prior,4,4,7,29.0,196,1,1,Soda,77,7,soft drinks,beverages
4,431534,1,prior,5,4,15,28.0,196,1,1,Soda,77,7,soft drinks,beverages
5,3367565,1,prior,6,2,7,19.0,196,1,1,Soda,77,7,soft drinks,beverages
6,550135,1,prior,7,1,9,20.0,196,1,1,Soda,77,7,soft drinks,beverages
7,3108588,1,prior,8,1,14,14.0,196,2,1,Soda,77,7,soft drinks,beverages
8,2295261,1,prior,9,1,16,0.0,196,4,1,Soda,77,7,soft drinks,beverages
9,2550362,1,prior,10,4,8,30.0,196,1,1,Soda,77,7,soft drinks,beverages


In [47]:
orders_week_prod=orders_week[orders_week.department=='produce']
n1_week_produce=len(orders_week_prod)
n1_week_produce

6186911

In [48]:
orders_weekend_prod=orders_weekend[orders_weekend.department=='produce']
n2_weekend_produce=len(orders_weekend_prod)
n2_weekend_produce

3292380

In [49]:
#Proportion of orders with items from produce department
#proportion of produ calls
wc = n1_week_produce/n1
#proportion of black calls
bc = n2_weekend_produce/n2
wc,bc
diff = bc - wc
diff

0.022623305498974067

In [50]:
wc,bc

(0.28478932030670046, 0.3074126258056745)

In [54]:
#calculate the p value
#pooled sample proportion
#Pooled sample proportion. Since the null hypothesis states that P1=P2, we use a pooled sample proportion (p) to compute the standard error of the sampling distribution.
#p = (p1 * n1 + p2 * n2) / (n1 + n2)
pooled  = (wc*n1+bc*n2)/(n1+n2)
pooled


0.2922596067414535

In [55]:
# calculating the standard error
se_pooled = np.sqrt((pooled*(1 - pooled)/(n1)) + (pooled*(1 - pooled) /(n2)))


se_pooled_1 = np.sqrt( pooled * ( 1 - pooled ) * ((1/n1) + (1/n2) ))

In [56]:
se_pooled,se_pooled_1

(0.0001698070461738605, 0.0001698070461738605)

In [57]:
# calculate the statistic
z = (diff)/se_pooled #standard error calculated in CI above
p_values = stats.norm.sf(abs(z))*2 #twoside
print("Z-score : %0.1F  p-value : %0.9F" % (z,p_values))

Z-score : 133.2  p-value : 0.000000000


In [58]:
count = np.array([n1_week_produce, n2_weekend_produce])
nobs = np.array([n1, n2])
count,nobs

(array([6186911, 3292380]), array([21724519, 10709970]))

In [59]:
tstat,pval=ssp.proportions_ztest(count,nobs, value=0, alternative='two-sided', prop_var=False)
tstat,pval

(-133.22948610631104, 0.0)

Based on the p-value, we can say that there is difference between the proportion of orders with produce on weekend as compared to weekdays.

The p-value can be 0 here because the data is probably not normally distributed.



** Second Hypothesis test**

Null Hypthesis  - There is no difference in number of reorderd items between items ordered on weekday versus items ordered on a weekend.

Alternate Hypothesis - There is significant difference in number of reorderd items between items ordered on weekday versus items ordered on a weekend.

In [60]:
#create the order sets for weekdays and weekend
week = [1,2,3,4,5]
weekend = [0,6]
orders_week = orders_all[orders_all.order_dow.isin(week)]
orders_weekend = orders_all[orders_all.order_dow.isin(weekend)]

In [63]:
orders_week_re=orders_week[orders_week.reordered==1]
n1_week_re=len(orders_week_re)
n1_week_re

12907335

In [65]:
orders_weekend_re=orders_weekend[orders_weekend.reordered==1]
n2_weekend_re=len(orders_weekend_re)
n2_weekend_re

6219201

In [66]:
#proportion of reorders over the week 
wc = n1_week_re/n1
#proportion of reorders by weekend
bc = n2_weekend_re/n2
wc,bc
diff = bc - wc
diff

-0.0134440817490461

In [67]:
wc,bc

(0.5941367447537044, 0.5806926630046583)

In [68]:
#calculate the p value
#pooled sample proportion
#Pooled sample proportion. Since the null hypothesis states that P1=P2, we use a pooled sample proportion (p) to compute the standard error of the sampling distribution.
#p = (p1 * n1 + p2 * n2) / (n1 + n2)
pooled  = (wc*n1+bc*n2)/(n1+n2)
pooled

0.5896974667922161

In [69]:
# calculating the standard error
se_pooled = np.sqrt((pooled*(1 - pooled)/(n1)) + (pooled*(1 - pooled) /(n2)))


se_pooled_1 = np.sqrt( pooled * ( 1 - pooled ) * ((1/n1) + (1/n2) ))

In [70]:
se_pooled,se_pooled_1

(0.00018365427765138162, 0.0001836542776513816)

In [71]:
# calculate the statistic
z = (diff)/se_pooled #standard error calculated in CI above
p_values = stats.norm.sf(abs(z))*2 #twoside
print("Z-score : %0.1F  p-value : %0.9F" % (z,p_values))

Z-score : -73.2  p-value : 0.000000000


Since the data does not have a normal distribution, the value is zero.

In [72]:
#Checking for correlation
orders_week_prod.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,aisle,department
63885,2398795,1,prior,2,3,7,15.0,13176,4,0,Bag of Organic Bananas,24,4,fresh fruits,produce
63886,431534,1,prior,5,4,15,28.0,13176,8,1,Bag of Organic Bananas,24,4,fresh fruits,produce
63887,2168274,2,prior,1,2,11,,13176,12,0,Bag of Organic Bananas,24,4,fresh fruits,produce
63888,3120740,7,prior,15,3,16,2.0,13176,8,0,Bag of Organic Bananas,24,4,fresh fruits,produce
63889,1468214,11,prior,7,5,9,30.0,13176,12,0,Bag of Organic Bananas,24,4,fresh fruits,produce


In [76]:
from scipy.stats import pearsonr
corr, p_value = pearsonr(orders_week_prod['order_dow'], orders_week_prod['order_hour_of_day'])

In [77]:
corr

0.03141291423773683

They don't seem to be correlated