# Instacart 

### Basket Analysis

The main goal of this project is to carry out an analysis of the shopping cart of Instacart users and identify the relationships between different types of products (which products are usually purchased together?). 

To carry out this project, we will make use of the FP-Growth algorithm (an improved version of the Apriori algorithm) to find patterns or associations between the purchased products.

The main challenge of this dataset is that it does not contain generic products, but highly detailed products (including brands, versions, etc). For this reason, I have decided to test 3 FP Growth algorithms:

* Model 1: Look for patterns between detailed products (in this notebook).
* Model 2: Look for patterns between aisles (in this notebook).
* Model 3: Transform detailed products into generic ones and apply FP Growth (in notebook 3.1 Improved Basket Analysis - see folder 'notebooks').




Libraries:

In [1]:
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import fpgrowth, association_rules
from mlxtend.preprocessing import TransactionEncoder

In [2]:
instacart = pd.read_csv('../data/instacart_sample.csv')
print(instacart.shape)
instacart.drop('Unnamed: 0', axis=1, inplace=True)
instacart.head()

(5204393, 15)


Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,department_id,aisle_id,aisle,department
0,6,22352,4,1,12,30.0,15873,2,0,Dryer Sheets Geranium Scent,17,75.0,laundry,household
1,8,3107,5,4,6,17.0,23423,1,1,Original Hawaiian Sweet Rolls,3,43.0,buns rolls,bakery
2,13,45082,2,6,17,1.0,3800,12,0,Hampshire 100% Natural Sour Cream,16,108.0,other creams cheeses,dairy eggs
3,13,45082,2,6,17,1.0,25783,7,0,Lemon Lime Thirst Quencher,7,64.0,energy sports drinks,beverages
4,13,45082,2,6,17,1.0,23020,10,0,Diet Tonic Water,7,77.0,soft drinks,beverages


In [3]:
instacart_ba = instacart.copy()

In [4]:
prods = instacart_ba[['order_id', 'product_name']].reset_index(drop=True)
prods.sort_values(by='order_id')

Unnamed: 0,order_id,product_name
0,6,Dryer Sheets Geranium Scent
1,8,Original Hawaiian Sweet Rolls
2,13,Hampshire 100% Natural Sour Cream
3,13,Lemon Lime Thirst Quencher
4,13,Diet Tonic Water
...,...,...
5204391,3421083,All Natural French Toast Sticks
5204387,3421083,Organic Mixed Berry Yogurt & Fruit Snack
5204386,3421083,Banana
5204388,3421083,Freeze Dried Mango Slices


## 1. Group products by order_id

In order to apply the FP Growth algorithm, it is necessary to create a list of products for each transaction.

In [5]:
print('There are {} transactions and {} different products.'.format(prods['order_id'].nunique(), prods['product_name'].nunique()))

There are 761900 transactions and 31148 different products.


Getting all products by transaction.

In [6]:
order_num = prods['order_id'].unique()
prod_lst = []

for num in order_num:
    products = prods.loc[prods['order_id'] == num]['product_name'].tolist()
    prod_lst.append(products) 

In [7]:
basket = pd.DataFrame(order_num, columns = ['transaction'])
basket.head()

Unnamed: 0,transaction
0,6
1,8
2,13
3,14
4,22


In [9]:
basket['products_ordered'] = [lst for lst in prod_lst]
basket.head()

Unnamed: 0,transaction,products_ordered
0,6,[Dryer Sheets Geranium Scent]
1,8,[Original Hawaiian Sweet Rolls]
2,13,"[Hampshire 100% Natural Sour Cream, Lemon Lime..."
3,14,[Unprocessed American Singles Colby-Style Chee...
4,22,"[2% Reduced Fat Milk, Iceberg Lettuce, Large G..."


## 2. Applying FP Growth

#### 2.1. Encode data.

In [11]:
te = TransactionEncoder()
te_ary = te.fit(prod_lst).transform(prod_lst)
te_ary

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

In [12]:
encoded_df = pd.DataFrame(te_ary, columns = te.columns_)
encoded_df.head()

Unnamed: 0,#2 Coffee Filters,#2 Cone White Coffee Filters,#2 Mechanical Pencils,#4 Natural Brown Coffee Filters,& Go! Hazelnut Spread + Pretzel Sticks,+Energy Black Cherry Vegetable & Fruit Juice,0 Calorie Acai Raspberry Water Beverage,0 Calorie Fuji Apple Pear Water Beverage,0 Calorie Strawberry Dragonfruit Water Beverage,0% Fat Black Cherry Greek Yogurt y,...,with Olive Oil Mayonnaise Dressing,with Pump Rebalancing Shampoo,with Sweet & Smoky BBQ Sauce Cheeseburger Sliders,with Sweet Cinnamon Bunches Cereal,with Xylitol Cinnamon 18 Sticks Sugar Free Gum,with Xylitol Minty Sweet Twist 18 Sticks Sugar Free Gum,with Xylitol Original Flavor 18 Sticks Sugar Free Gum,with Xylitol Unwrapped Original Flavor 50 Sticks Sugar Free Gum,with a Splash of Mango Coconut Water,with a Splash of Pineapple Coconut Water
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


#### 2.2. Find frequent itemsets.

In [13]:
freq_items_fp = fpgrowth(encoded_df, min_support=0.01, use_colnames=True)

In [14]:
freq_items_fp

Unnamed: 0,support,itemsets
0,0.012168,(Soda)
1,0.043850,(Organic Whole Milk)
2,0.010637,(Organic Broccoli Florets)
3,0.151427,(Banana)
4,0.011008,(2% Reduced Fat Milk)
...,...,...
73,0.013132,"(Banana, Strawberries)"
74,0.012948,"(Organic Raspberries, Bag of Organic Bananas)"
75,0.011062,"(Organic Strawberries, Organic Raspberries)"
76,0.010427,"(Banana, Organic Fuji Apple)"


### Products Rules:

In [15]:
rules_fp = association_rules(freq_items_fp, metric='confidence', min_threshold=0.1)
rules_fp

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Organic Baby Spinach),(Organic Strawberries),0.078472,0.083993,0.012322,0.157021,1.869467,0.005731,1.086632
1,(Organic Strawberries),(Organic Baby Spinach),0.083993,0.078472,0.012322,0.146701,1.869467,0.005731,1.079959
2,(Organic Baby Spinach),(Banana),0.078472,0.151427,0.016753,0.213488,1.409842,0.00487,1.078907
3,(Banana),(Organic Baby Spinach),0.151427,0.078472,0.016753,0.110633,1.409842,0.00487,1.036162
4,(Organic Baby Spinach),(Bag of Organic Bananas),0.078472,0.12146,0.016469,0.209875,1.727941,0.006938,1.1119
5,(Bag of Organic Bananas),(Organic Baby Spinach),0.12146,0.078472,0.016469,0.135595,1.727941,0.006938,1.066084
6,(Organic Strawberries),(Bag of Organic Bananas),0.083993,0.12146,0.019176,0.228303,1.879661,0.008974,1.138452
7,(Bag of Organic Bananas),(Organic Strawberries),0.12146,0.083993,0.019176,0.157878,1.879661,0.008974,1.087737
8,(Banana),(Organic Strawberries),0.151427,0.083993,0.017665,0.116657,1.3889,0.004946,1.036979
9,(Organic Strawberries),(Banana),0.083993,0.151427,0.017665,0.210317,1.3889,0.004946,1.074574


## Look for associations between aisles.

In [19]:
aisle_df = instacart_ba[['order_id','aisle']]
aisle_df.head()

Unnamed: 0,order_id,aisle
0,6,laundry
1,8,buns rolls
2,13,other creams cheeses
3,13,energy sports drinks
4,13,soft drinks


Get list of aisles by order_id.

In [21]:
aisle_lst = []

for order in order_num:
    aisle = aisle_df.loc[aisle_df['order_id'] == order]['aisle'].tolist()
    aisle_lst.append(aisle) 

In [23]:
aisle_df_fp = pd.DataFrame(order_num, columns = ['num_order'])
aisle_df_fp['aisles'] = [lst for lst in aisle_lst]
aisle_df_fp.head()

Unnamed: 0,num_order,aisles
0,6,[laundry]
1,8,[buns rolls]
2,13,"[other creams cheeses, energy sports drinks, s..."
3,14,"[packaged cheese, frozen breakfast, frozen pro..."
4,22,"[milk, fresh vegetables, eggs, fresh fruits, f..."


#### Encode data.

In [24]:
te_ary_2 = te.fit(aisle_lst).transform(aisle_lst)
te_ary_2

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ...,  True, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

In [25]:
aisle_enc = pd.DataFrame(te_ary_2, columns = te.columns_)
aisle_enc.head()

Unnamed: 0,air fresheners candles,asian foods,baby accessories,baby bath body care,baby food formula,bakery desserts,baking ingredients,baking supplies decor,beauty,beers coolers,...,spreads,tea,tofu meat alternatives,tortillas flat bread,trail mix snack mix,trash bags liners,vitamins supplements,water seltzer sparkling water,white wines,yogurt
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False
3,False,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


#### Apply FP Growth

In [31]:
freq_items_aisles = fpgrowth(aisle_enc, min_support=0.1, use_colnames=True)
freq_items_aisles

Unnamed: 0,support,itemsets
0,0.123242,(chips pretzels)
1,0.202859,(milk)
2,0.177834,(packaged cheese)
3,0.101806,(frozen produce)
4,0.482386,(fresh fruits)
5,0.363032,(fresh vegetables)
6,0.127992,(bread)
7,0.103224,(eggs)
8,0.197828,(yogurt)
9,0.277644,(packaged vegetables fruits)


### Aisle Rules:

In [33]:
rules_aisles = association_rules(freq_items_aisles, metric='confidence', min_threshold=0.3)
rules_aisles

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(milk),(fresh fruits),0.202859,0.482386,0.119233,0.587766,1.218456,0.021377,1.255632
1,(packaged cheese),(fresh fruits),0.177834,0.482386,0.103821,0.583806,1.210246,0.018036,1.243683
2,(fresh fruits),(fresh vegetables),0.482386,0.363032,0.223479,0.463279,1.276139,0.048358,1.186777
3,(fresh vegetables),(fresh fruits),0.363032,0.482386,0.223479,0.615592,1.276139,0.048358,1.346521
4,(yogurt),(fresh fruits),0.197828,0.482386,0.125096,0.63235,1.31088,0.029667,1.4079
5,(fresh vegetables),(packaged vegetables fruits),0.363032,0.277644,0.149676,0.412294,1.484973,0.048882,1.229111
6,(packaged vegetables fruits),(fresh vegetables),0.277644,0.363032,0.149676,0.539092,1.484973,0.048882,1.381987
7,(fresh fruits),(packaged vegetables fruits),0.482386,0.277644,0.179835,0.372802,1.342734,0.045903,1.15172
8,(packaged vegetables fruits),(fresh fruits),0.277644,0.482386,0.179835,0.647716,1.342734,0.045903,1.46931
9,"(fresh fruits, fresh vegetables)",(packaged vegetables fruits),0.223479,0.277644,0.105623,0.472629,1.702283,0.043575,1.369729
