**Capstone 1  -  Instacart Market Basket Analysis**

Instacart is a U.S. based company that provides same-day grocery delivery service. It provides the customers an option to choose the grocery store they want their groceries through the item list, select the items and get it hand-delivered by a personal shopper. The company offers the services both through their website as well their app.

They have over 3 million transactional data of multiple shoppers located across the United States. 

**Problem Statement**
Instacart wants to leverage this data to build a model that can predict which previously purchased products will be in a user’s next order. This will help the company strategize where to show the previously ordered items on their page or app.The model will also help recommend items for users while they are browsing.


In [84]:
#Import the necessary modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.stats.weightstats as ssw
import seaborn as sns
import statsmodels.stats.proportion as ssp
color = sns.color_palette()
from scipy.stats import norm
from scipy.stats import normaltest
from statsmodels.stats.weightstats import ztest
from scipy import stats
import math
import statsmodels.stats.api as sms

In [60]:
#Import the files
#files provided - aisles, products,departments,orders_prior,orders

#aisles
aisles = pd.read_csv('aisles.csv')
#importing the departments file
departments= pd.read_csv('departments.csv')
# importing the products file
#products
products =pd.read_csv("products.csv")
#orders_prior
#importing the orders_prior file
orders_prior =pd.read_csv("order_products__prior.csv")
#5
#importing the orders_train file
order_train = pd.read_csv('order_products__train.csv')
#6 orders
orders = pd.read_csv('orders.csv')

In [61]:
orders_train_sub = orders[(orders['eval_set'] == 'train')]
len(orders_train_sub)

131209

In [62]:
orders_train_sub.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
10,1187899,1,train,11,4,8,14.0
25,1492625,2,train,15,1,11,30.0
49,2196797,5,train,5,0,11,6.0
74,525192,7,train,21,2,11,6.0
78,880375,8,train,4,1,14,10.0


In [63]:
# Merging files
#merging product with aisle and department files

products_a = pd.merge(products,aisles,on='aisle_id')

In [64]:
# merging with departments file
products_all =pd.merge(products_a,departments,left_on = 'department_id'
                       ,right_on = 'department_id',how='left',suffixes = ('_x', '_y'))
products_all.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49688 entries, 0 to 49687
Data columns (total 6 columns):
product_id       49688 non-null int64
product_name     49688 non-null object
aisle_id         49688 non-null int64
department_id    49688 non-null int64
aisle            49688 non-null object
department       49688 non-null object
dtypes: int64(3), object(3)
memory usage: 2.7+ MB


In [65]:
orders_prior_p=orders_prior_p.drop_duplicates()

In [66]:
#orders table - merge with orders_train to get other variables
orders_all = pd.merge(orders_prior_p,orders,left_on= 'order_id',right_on='order_id',how="inner")
orders_all.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32434489 entries, 0 to 32434488
Data columns (total 15 columns):
product_id                int64
product_name              object
aisle_id                  int64
department_id             int64
aisle                     object
department                object
order_id                  float64
add_to_cart_order         float64
reordered                 float64
user_id                   int64
eval_set                  object
order_number              int64
order_dow                 int64
order_hour_of_day         int64
days_since_prior_order    float64
dtypes: float64(4), int64(7), object(4)
memory usage: 3.9+ GB


**From the previous EDA, we had noticed the following trends**

Most items are ordered from the produce department

Instacart gets more orders on Sundays and Saturdays

Let us check we can prove this statistically

Assuming null hypothesis - There is no difference between the number of orders which have prooduce items 
between a weekday and a weekend

Alternate hypothesis - There is a significant difference between the number of orders which have prooduce items 
between a weekday and a weekend

In [67]:
#create the order sets for weekdays and weekend
week = [1,2,3,4,5]
weekend = [0,6]
orders_week = orders_all[orders_all.order_dow.isin(week)]
orders_weekend = orders_all[orders_all.order_dow.isin(weekend)]

In [68]:
orders_week.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,aisle,department,order_id,add_to_cart_order,reordered,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,1,Chocolate Sandwich Cookies,61,19,cookies cakes,snacks,1107.0,7.0,0.0,38259,prior,2,1,11,7.0
1,9007,Frosted Flakes,121,14,cereal,breakfast,1107.0,17.0,0.0,38259,prior,2,1,11,7.0
2,32689,Romaine Hearts,123,4,packaged vegetables fruits,produce,1107.0,16.0,0.0,38259,prior,2,1,11,7.0
3,28413,Bunny-Luv Organic Carrots,83,4,fresh vegetables,produce,1107.0,13.0,0.0,38259,prior,2,1,11,7.0
4,46149,Zero Calorie Cola,77,7,soft drinks,beverages,1107.0,6.0,0.0,38259,prior,2,1,11,7.0


In [70]:
n1=len(orders_week)
n1

21724519

In [71]:
n2= len(orders_weekend)
n2

10709970

In [72]:
orders_week_prod=orders_week[orders_week.department=='produce']
n1_week_produce=len(orders_week_prod)
n1_week_produce

6186911

In [73]:
orders_weekend_prod=orders_weekend[orders_weekend.department=='produce']
n2_weekend_produce=len(orders_weekend_prod)
n2_weekend_produce

3292380

In [74]:
#Proportion of orders with items from produce department
#proportion of produ calls
wc = n1_week_produce/n1
#proportion of black calls
bc = n2_weekend_produce/n2
wc,bc
diff = bc - wc
diff

0.022623305498974067

In [75]:
wc,bc

(0.28478932030670046, 0.3074126258056745)

In [76]:
#calculate the p value
#pooled sample proportion
#Pooled sample proportion. Since the null hypothesis states that P1=P2, we use a pooled sample proportion (p) to compute the standard error of the sampling distribution.
#p = (p1 * n1 + p2 * n2) / (n1 + n2)
pooled  = (wc*n1+bc*n2)/(n1+n2)
pooled


0.2922596067414535

In [77]:
# calculating the standard error
se_pooled = np.sqrt((pooled*(1 - pooled)/(n1)) + (pooled*(1 - pooled) /(n2)))


se_pooled_1 = np.sqrt( pooled * ( 1 - pooled ) * ((1/n1) + (1/n2) ))

In [78]:
se_pooled,se_pooled_1

(0.0001698070461738605, 0.0001698070461738605)

In [79]:
# calculate the statistic
z = (diff)/se_pooled #standard error calculated in CI above
p_values = stats.norm.sf(abs(z))*2 #twoside
print("Z-score : %0.1F  p-value : %0.9F" % (z,p_values))

Z-score : 133.2  p-value : 0.000000000


In [95]:
count = np.array([n1_week_produce, n2_weekend_produce])
nobs = np.array([n1, n2])
count,nobs

(array([6186911, 3292380]), array([21724519, 10709970]))

In [96]:
tstat,pval=ssp.proportions_ztest(count,nobs, value=0, alternative='two-sided', prop_var=False)
tstat,pval

(-133.22948610631104, 0.0)

Based on the p-value, we can say that there is difference between the proportion of orders with produce on weekend as compared to weekdays.

The p-value can be 0 here because the data is probably not normally distributed.

