# Instacart 

### Basket Analysis

The main goal of this project is to carry out an analysis of the shopping cart of Instacart users and identify the relationships between different types of products (which products are usually purchased together?). 

To carry out this project, we will make use of the FP-Growth algorithm (an improved version of the Apriori algorithm) to find patterns or associations between the purchased products.

The main challenge of this dataset is that it does not contain generic products, but highly detailed products (including brands, versions, etc). For this reason, I have decided to test 4 FP Growth algorithm:

* Experiment 1: Look for patterns between top 1000 detailed products (in this notebook).
* Experiment 2: Look for patterns between ALL detailed products (in this notebook).
* Experiment 3: Look for patterns between aisles (in this notebook).
* Experiment 4: Transform detailed products into generic ones and apply FP Growth (in notebook 3.1 Improved Basket Analysis - see folder 'notebooks').




Libraries:

In [1]:
import pandas as pd
import numpy as np
import yaml

from functions import pack_items_by_order

from mlxtend.frequent_patterns import fpgrowth, association_rules
from mlxtend.preprocessing import TransactionEncoder

YAML parameters:

In [2]:
try: 
    with open ("./../params.yaml", 'r') as file:
        config = yaml.safe_load(file)
except Exception as e:
    print('Error reading the config file')

In [3]:
instacart = pd.read_csv(config['data']['instacart_sample'])
print(instacart.shape)
instacart.drop('Unnamed: 0', axis=1, inplace=True)
instacart.head()

(5204393, 15)


Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,department_id,aisle_id,aisle,department
0,6,22352,4,1,12,30.0,15873,2,0,Dryer Sheets Geranium Scent,17,75.0,laundry,household
1,8,3107,5,4,6,17.0,23423,1,1,Original Hawaiian Sweet Rolls,3,43.0,buns rolls,bakery
2,13,45082,2,6,17,1.0,3800,12,0,Hampshire 100% Natural Sour Cream,16,108.0,other creams cheeses,dairy eggs
3,13,45082,2,6,17,1.0,25783,7,0,Lemon Lime Thirst Quencher,7,64.0,energy sports drinks,beverages
4,13,45082,2,6,17,1.0,23020,10,0,Diet Tonic Water,7,77.0,soft drinks,beverages


In [4]:
instacart_ba = instacart.copy()

In [5]:
prods = instacart_ba[['order_id', 'product_name']].reset_index(drop=True)
prods.sort_values(by='order_id')

Unnamed: 0,order_id,product_name
0,6,Dryer Sheets Geranium Scent
1,8,Original Hawaiian Sweet Rolls
2,13,Hampshire 100% Natural Sour Cream
3,13,Lemon Lime Thirst Quencher
4,13,Diet Tonic Water
...,...,...
5204391,3421083,All Natural French Toast Sticks
5204387,3421083,Organic Mixed Berry Yogurt & Fruit Snack
5204386,3421083,Banana
5204388,3421083,Freeze Dried Mango Slices


### Experiment 1: Top 1000 purchased products.

In [6]:
prods_to_keep = prods['product_name'].value_counts().sort_values(ascending=False)[0:1000]
prods_to_keep = prods_to_keep.index.tolist()
prods_to_keep[0:10]

['Banana',
 'Bag of Organic Bananas',
 'Organic Strawberries',
 'Organic Baby Spinach',
 'Strawberries',
 'Limes',
 'Organic Raspberries',
 'Organic Whole Milk',
 'Organic Yellow Onion',
 'Organic Garlic']

In [7]:
prods_for_fp = prods.loc[prods['product_name'].isin(prods_to_keep)]
prods_for_fp

Unnamed: 0,order_id,product_name
7,13,Soda
11,14,Organic Mini Homestyle Waffles
12,14,Organic Broccoli Florets
13,14,Naturals Chicken Nuggets
14,14,Sriracha Chili Sauce
...,...,...
5204385,3421068,Strawberries
5204386,3421083,Banana
5204387,3421083,Organic Mixed Berry Yogurt & Fruit Snack
5204390,3421083,Organic Strawberry Yogurt & Fruit Snack


#### Group products by order_id

In order to apply the FP Growth algorithm, it is necessary to create a list of products for each transaction.

In [8]:
order_num = prods['order_id'].unique()
prod_lst_1k = pack_items_by_order(order_num, prods_for_fp, 'product_name')

#### Test FP Growth Algorithm

##### 1. Encode data

In [9]:
te = TransactionEncoder()
te_ary_1k = te.fit(prod_lst).transform(prod_lst)
te_ary_1k

NameError: name 'prod_lst' is not defined

In [None]:
encoded_df_1k = pd.DataFrame(te_ary_1k, columns = te.columns_)
encoded_df_1k.head()

##### 2. Apply FP Growth

In [None]:
freq_items_fp_1k = fpgrowth(encoded_df_1k, min_support=0.01, use_colnames=True)
freq_items_fp_1k

##### 3. Rules:

In [None]:
rules_fp = association_rules(freq_items_fp_1k, metric='confidence', min_threshold=0.1)
rules_fp

### Experiment 2: All products.

In [None]:
prod_lst = pack_items_by_order(order_num, prods, 'product_name')

In [None]:
basket = pd.DataFrame(order_num, columns = ['transaction'])
basket.head()

In [None]:
basket['products_ordered'] = [lst for lst in prod_lst]
basket.head()

### FP Growth

#### 1. Encode data.

In [None]:
te = TransactionEncoder()
te_ary = te.fit(prod_lst).transform(prod_lst)
te_ary

In [None]:
encoded_df = pd.DataFrame(te_ary, columns = te.columns_)
encoded_df.head()

##### Frequent itemsets.

In [None]:
freq_items_fp = fpgrowth(encoded_df, min_support=0.01, use_colnames=True)

In [None]:
freq_items_fp

##### Rules:

In [None]:
rules_fp = association_rules(freq_items_fp, metric='confidence', min_threshold=0.1)
rules_fp

## Look for associations between aisles.

In [None]:
aisle_df = instacart_ba[['order_id','aisle']]
aisle_df.head()

Get list of aisles by order_id.

In [None]:
aisle_lst = pack_items_by_order(order_num, aisle_df, 'aisle'

In [None]:
aisle_df_fp = pd.DataFrame(order_num, columns = ['num_order'])
aisle_df_fp['aisles'] = [lst for lst in aisle_lst]
aisle_df_fp.head()

#### Encode data.

In [None]:
te_ary_2 = te.fit(aisle_lst).transform(aisle_lst)
te_ary_2

In [None]:
aisle_enc = pd.DataFrame(te_ary_2, columns = te.columns_)
aisle_enc.head()

#### Apply FP Growth

In [None]:
freq_items_aisles = fpgrowth(aisle_enc, min_support=0.1, use_colnames=True)
freq_items_aisles

### Aisle Rules:

In [None]:
rules_aisles = association_rules(freq_items_aisles, metric='confidence', min_threshold=0.3)
rules_aisles