<a href="https://colab.research.google.com/github/arimitramaiti/notebooks/blob/master/Group3_Assignment1_PySpark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### <center>ARM Pyspark Implementation</center>
<center>Machine Learning with Big Data (Week-1)</center>
<center>ePGD ABA 2020-21</center>

**Group-3 includes : Nitin Raheja, Rohan Singh, Anand Dattani and Arimitra Maiti**

You are given an online grocery retail dataset. The dataset primarily contains the following transactional attributes: (a) User identifier (b) Order identifier, (c) Product identifier, (d) Product name, (e) Aisle, or location where the product is placed, (f) Department, or category of the product. You are expected to apply rule mining methods on the given dataset, explore interesting patterns and answer the following queries:

In [None]:
!pip install --quiet mlxtend
!pip install mlxtend --upgrade --no-deps
!pip install --quiet pyspark

Collecting mlxtend
[?25l  Downloading https://files.pythonhosted.org/packages/86/30/781c0b962a70848db83339567ecab656638c62f05adb064cb33c0ae49244/mlxtend-0.18.0-py2.py3-none-any.whl (1.3MB)
[K     |████████████████████████████████| 1.4MB 5.7MB/s 
[?25hInstalling collected packages: mlxtend
  Found existing installation: mlxtend 0.14.0
    Uninstalling mlxtend-0.14.0:
      Successfully uninstalled mlxtend-0.14.0
Successfully installed mlxtend-0.18.0
[K     |████████████████████████████████| 204.2MB 66kB/s 
[K     |████████████████████████████████| 204kB 46.9MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [None]:
from pyspark.ml.fpm import FPGrowth
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori, fpgrowth, association_rules
import pyspark
pd.set_option('max_colwidth', 300)

In [None]:
##Download dataset from shared source
url = "https://raw.githubusercontent.com/arimitramaiti/datasets/master/articles/Online%20Grocery%20Retail%20Customer%20Data.csv"
data = pd.read_csv(url, error_bad_lines=False, header=0, index_col=None)

In [None]:
from pyspark import SparkContext
from pyspark.sql import SQLContext
##Declare instance of Spark and SQL
sc = SparkContext(appName="PythonStreamingQueueStream")
sqlContext = SQLContext(sc)

In [None]:
##Convert pandas data frame into Spark dataframe for big data context
sparkDf = sqlContext.createDataFrame(data, ["uid",	"oid",	"pid",	"pname",	"aisle",	"department",	"aisle_department",	"pname_aisle_department"])

In [None]:
##Use collect_list or collect_set to combine individual transaction items in one single row
##Since pname is the most granular level known from discussion we are using collect_list here as we know we wont have duplicates
from pyspark.sql.functions import collect_list

In [None]:
#Count distinct transactions from big data, this would serve as the denominator while evaluation support from frequency of itemsets
from pyspark.sql.functions import col, countDistinct
sparkDf.agg(countDistinct(col("oid")).alias("count")).show()

+-----+
|count|
+-----+
| 3412|
+-----+



__1) Identify at least six interesting rules from the given dataset. Explain how you determined minimum threshold levels for rule mining__

_All the 6 rules below have high confidence, lift and conviction measures at the same time, all rules are well above minimum support level_

In [None]:
#Convert long transactions from big data into a wide array of items
mylist = sparkDf.groupBy(sparkDf.oid).agg(collect_list('pname'))

In [None]:
mylist

DataFrame[oid: bigint, collect_list(pname): array<string>]

In [None]:
#Run FPGrowth model using spark big list of items
fpGrowth = FPGrowth(itemsCol="collect_list(pname)", minSupport=0.02)
model = fpGrowth.fit(mylist)

In [None]:
# Store frequent itemsets.
results = model.freqItemsets.collect()
# Store frequent itemsets in pandas dataframe.
frequent_items = pd.DataFrame(results, columns=["itemsets", "freq"])
# Store length of each itemset in pandas dataframe.
frequent_items['size'] = frequent_items["itemsets"].apply(lambda x: len(x))
# Store support value which is an input for  association_rules function.
frequent_items['support'] = frequent_items["freq"]/3412
#specify suitable parameter values, random values are set below
rules = association_rules(frequent_items, metric="support", min_threshold=0.02)
#filter rules based on measures
rules[ (rules['confidence'] > 0.53) & (rules['lift'] > 1) ].sort_values(by=['confidence', 'lift'], ascending=[False, False])

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
205,(Icelandic Style Skyr Blueberry Non-fat Yogurt),(Nonfat Icelandic Style Strawberry Yogurt),0.032239,0.034877,0.026671,0.827273,23.719786,0.025546,5.587555
204,(Nonfat Icelandic Style Strawberry Yogurt),(Icelandic Style Skyr Blueberry Non-fat Yogurt),0.034877,0.032239,0.026671,0.764706,23.719786,0.025546,4.112984
107,"(Bag of Organic Bananas, Organic Whole String Cheese)",(Organic Strawberries),0.034291,0.250293,0.021102,0.615385,2.458656,0.012519,1.949238
203,(Organic Garnet Sweet Potato (Yam)),(Bag of Organic Bananas),0.048652,0.248828,0.028722,0.590361,2.372572,0.016616,1.833744
113,(Organic Whole String Cheese),(Organic Strawberries),0.080012,0.250293,0.045721,0.571429,2.283038,0.025695,1.749316
34,"(Organic Hass Avocado, Organic Raspberries)",(Bag of Organic Bananas),0.041618,0.248828,0.022567,0.542254,2.179233,0.012212,1.641023


__2) What product bundles would you consider for boosting sales of ‘Moroccan Mint Green Tea’?__

_Bundling the following consequents with high consequent support as well as itemset correlations suggests they can boost sales of Green tea in lieu of increased bundle value, keeping in mind binary partition of a rule as A implies B also suggests B implies A_

In [None]:
rules['antecedents']="Moroccan Mint Green Tea"
#filter rules based on measures
rules[ (rules['consequent support'] > 0.20) & (rules['lift'] > 1) & (rules['antecedent support'] > 0.20) ]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,Moroccan Mint Green Tea,(Organic Strawberries),0.248828,0.250293,0.074443,0.299176,1.195301,0.012163,1.06975
1,Moroccan Mint Green Tea,(Bag of Organic Bananas),0.250293,0.248828,0.074443,0.297424,1.195301,0.012163,1.069169
2,Moroccan Mint Green Tea,(Organic Hass Avocado),0.248828,0.213658,0.08558,0.343934,1.609743,0.032416,1.198572
3,Moroccan Mint Green Tea,(Bag of Organic Bananas),0.213658,0.248828,0.08558,0.400549,1.609743,0.032416,1.2531
10,Moroccan Mint Green Tea,(Organic Hass Avocado),0.250293,0.213658,0.082943,0.331382,1.550994,0.029466,1.176071
11,Moroccan Mint Green Tea,(Organic Strawberries),0.213658,0.250293,0.082943,0.388203,1.550994,0.029466,1.225418


__3) How would you leverage the aisle and department level information in your analysis? Prepare rule mining workflows using aisle and department level information to demonstrate your ideas.__

__Aisle level__

_From local implementation we evaluated the density at aisle level is 10% which is higher than the product level density at 0.03%. Therefore aisle is at higher level than products is justified here. Now we need to set a minimum threshold over and above 10% density as our minimum support threshold value. We set this is 20% to 30% for aisle level_

In [None]:
##collect_list will give a list without removing duplicates. collect_set will automatically remove duplicates, hence;
from pyspark.sql.functions import collect_set

In [None]:
##At aisle level which is one level higher than product name
#Convert long transactions from big data into a wide array of items
mylist_aisle = sparkDf.groupBy(sparkDf.oid).agg(collect_set('aisle'))

In [None]:
mylist_aisle

DataFrame[oid: bigint, collect_set(aisle): array<string>]

In [None]:
#Run FPGrowth model using spark big list of items
fpGrowth = FPGrowth(itemsCol="collect_set(aisle)", minSupport=0.20)
model = fpGrowth.fit(mylist_aisle)

In [None]:
# Store frequent itemsets.
results_aisle = model.freqItemsets.collect()
# Store frequent itemsets in pandas dataframe.
frequent_items_aisle = pd.DataFrame(results_aisle, columns=["itemsets", "freq"])
# Store length of each itemset in pandas dataframe.
frequent_items_aisle['size'] = frequent_items_aisle["itemsets"].apply(lambda x: len(x))

# Store support value which is an input for  association_rules function.
frequent_items_aisle['support'] = frequent_items_aisle["freq"]/3412
#specify suitable parameter values, random values are set below

##30% is set in association rules function to limit the number of rows
rules_aisle = association_rules(frequent_items_aisle, metric="support", min_threshold=0.30)
rules_aisle = rules_aisle.sort_values(["support", "confidence", "lift", "conviction"], ascending = (False, False, False, False))

#filter rules based on measures
rules_aisle[ (rules_aisle['confidence'] > 0.90) & (rules_aisle['lift'] > 1) & (rules_aisle['conviction'] > 2) ].sort_values(by=['confidence', 'lift', 'conviction'], ascending=[False, False, False])

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
58,"(milk, fresh vegetables)",(fresh fruits),0.338804,0.830598,0.316823,0.935121,1.125841,0.035413,2.611051
50,"(packaged vegetables fruits, milk)",(fresh fruits),0.338218,0.830598,0.314771,0.930676,1.120489,0.033848,2.443625
40,"(packaged cheese, fresh vegetables)",(fresh fruits),0.366061,0.830598,0.340563,0.930344,1.12009,0.036513,2.431991
33,"(packaged cheese, packaged vegetables fruits)",(fresh fruits),0.354631,0.830598,0.329132,0.928099,1.117387,0.034577,2.356052
23,"(fresh vegetables, yogurt)",(fresh fruits),0.374853,0.830598,0.34789,0.928069,1.11735,0.036537,2.355058
4,"(packaged vegetables fruits, fresh vegetables)",(fresh fruits),0.504396,0.830598,0.466589,0.925044,1.113708,0.047638,2.260008


__Department level__

_From local implementation we evaluated the density at department level is 35% which is higher than the product level density at 0.03%. Therefore department is at higher level than aisle or products is justified here. Now we need to set a minimum threshold over and above 35% density as our minimum support threshold value. We set this is 60% for department level_

In [None]:
##At department level which is one level higher than product name
#Convert long transactions from big data into a wide array of items
mylist_department = sparkDf.groupBy(sparkDf.oid).agg(collect_set('department'))

In [None]:
mylist_department

DataFrame[oid: bigint, collect_set(department): array<string>]

In [None]:
#Run FPGrowth model using spark big list of items
fpGrowth = FPGrowth(itemsCol="collect_set(department)", minSupport=0.60)
model = fpGrowth.fit(mylist_department)

In [None]:
# Store frequent itemsets.
results_department = model.freqItemsets.collect()
# Store frequent itemsets in pandas dataframe.
frequent_items_department = pd.DataFrame(results_department, columns=["itemsets", "freq"])
# Store length of each itemset in pandas dataframe.
frequent_items_department['size'] = frequent_items_department["itemsets"].apply(lambda x: len(x))

# Store support value which is an input for  association_rules function.
frequent_items_department['support'] = frequent_items_department["freq"]/3412
#specify suitable parameter values, random values are set below
rules_department = association_rules(frequent_items_department, metric="support", min_threshold=0.60)
rules_department = rules_department.sort_values(["support", "confidence", "lift", "conviction"], ascending = (False, False, False, False))

#filter rules based on measures
rules_department[ (rules_department['confidence'] > 0.90) & (rules_department['lift'] > 1) & (rules_department['conviction'] > 2) ].sort_values(by=['confidence', 'lift', 'conviction'], ascending=[False, False, False])

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
5,"(snacks, produce)",(dairy eggs),0.659437,0.917937,0.63687,0.965778,1.052118,0.031548,2.397954
15,"(beverages, produce)",(dairy eggs),0.627784,0.917937,0.603751,0.961718,1.047695,0.027485,2.143654


__Combined Product, Aisle and Department levels__

_From local implementation we evaluated the density at combined level is same as the product level density at 0.03%. Therefore combined level is NOT at higher level than department or aisle or products is justified here. Now we need to set a minimum threshold over and above .03% density as our minimum support threshold value. We set this is 3%, which is slightly higher than 2% tried above while mining product level rules_

In [None]:
##At product, aisle and department level which is one level higher than product name
#Convert long transactions from big data into a wide array of items
mylist_combined = sparkDf.groupBy(sparkDf.oid).agg(collect_set('pname_aisle_department'))

In [None]:
mylist_combined

DataFrame[oid: bigint, collect_set(pname_aisle_department): array<string>]

In [None]:
#Run FPGrowth model using spark big list of items
fpGrowth = FPGrowth(itemsCol="collect_set(pname_aisle_department)", minSupport=0.03)
model = fpGrowth.fit(mylist_combined)

In [None]:
# Store frequent itemsets.
results_combined = model.freqItemsets.collect()
# Store frequent itemsets in pandas dataframe.
frequent_items_combined = pd.DataFrame(results_combined, columns=["itemsets", "freq"])
# Store length of each itemset in pandas dataframe.
frequent_items_combined['size'] = frequent_items_combined["itemsets"].apply(lambda x: len(x))

# Store support value which is an input for  association_rules function.
frequent_items_combined['support'] = frequent_items_combined["freq"]/3412
#specify suitable parameter values, random values are set below
rules_combined = association_rules(frequent_items_combined, metric="support", min_threshold=0.03)
rules_combined = rules_combined.sort_values(["support", "confidence", "lift", "conviction"], ascending = (False, False, False, False))

#filter rules based on measures
rules_combined[ (rules_combined['confidence'] > 0.45) & (rules_combined['lift'] > 1) & (rules_combined['conviction'] > 1) ].sort_values(by=['confidence', 'lift', 'conviction'], ascending=[False, False, False])

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
44,(Organic Whole String Cheese::packaged cheese::dairy eggs),(Organic Strawberries::fresh fruits::produce),0.080012,0.250293,0.045721,0.571429,2.283038,0.025695,1.749316
5,"(Bag of Organic Bananas::fresh fruits::produce, Organic Strawberries::fresh fruits::produce)",(Organic Hass Avocado::fresh fruits::produce),0.074443,0.213658,0.037515,0.503937,2.358619,0.021609,1.585166
6,"(Organic Strawberries::fresh fruits::produce, Organic Hass Avocado::fresh fruits::produce)",(Bag of Organic Bananas::fresh fruits::produce),0.082943,0.248828,0.037515,0.452297,1.817711,0.016876,1.371495


__4) You wish to discontinue ‘Organic Orange Juice’ product. Assess the implications of your decision__

_There is a strong association between {Organic Orange Juice and Organic Hass Avocado} and {Organic Orange Juice and Bag of Organic Bananas}, however we observe that this takes place in roughly 1.14% of transactions. Surprisingly we also observe there is a strong association between {Organic Hass Avocado and Bag of Organic Bananas} where the overall support is most of times higher than 1% minimum support. We may conclude that Organic Orange Juice has a potential to serve as a link between avocado and organic banana bag. Therefore if we disconitnue the organic orange juice then its association with avocado and banana bag would be broken thereby affecting an adverse impact between avocado and banana bag. Therefore we may think of bundling avocado, organic banana bag and organic orange juice together as part of a promotion_

_Discontinuing an item based on its support value alone (which is a little over 1% here) may not be a wise decision to consider the varitey of product promotions possible. Hence creating efficient bundles with avocado and organic banana may apprise orange juice preference_

**For this product level analysis we lower the support from 3% to 1% as our minimum support threshold**

In [None]:
#Run FPGrowth model using spark big list of items
fpGrowth = FPGrowth(itemsCol="collect_list(pname)", minSupport=0.01)
model = fpGrowth.fit(mylist)

# Store frequent itemsets.
results_OOJ = model.freqItemsets.collect()
# Store frequent itemsets in pandas dataframe.
frequent_items_OOJ = pd.DataFrame(results_OOJ, columns=["itemsets", "freq"])
# Store length of each itemset in pandas dataframe.
frequent_items_OOJ['size'] = frequent_items_OOJ["itemsets"].apply(lambda x: len(x))
# Store support value which is an input for  association_rules function.
frequent_items_OOJ['support'] = frequent_items_OOJ["freq"]/3412
#specify suitable parameter values, random values are set below
rules_OOJ = association_rules(frequent_items_OOJ, metric="support", min_threshold=0.01)

_We do notice Organic Orange Juice has high correlation with avocado and bananas, the high conviction number implies this correlation is not by **chance_**

In [None]:
df_orangejuiceLHS = rules_OOJ[rules_OOJ["antecedents"].apply(lambda x: 'Organic Orange Juice' in str(x))]
df_orangejuiceLHS

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
220,(Organic Orange Juice),(Organic Hass Avocado),0.016999,0.213658,0.01143,0.672414,3.147155,0.007798,2.400413
223,(Organic Orange Juice),(Bag of Organic Bananas),0.016999,0.248828,0.01143,0.672414,2.702327,0.0072,2.293052


_We do notice avocado and bananas also have high correlation with Organic Orange Juice , the high conviction number implies this correlation is not by **chance_**

In [None]:
df_orangejuiceRHS = rules_OOJ[rules_OOJ["consequents"].apply(lambda x: 'Organic Orange Juice' in str(x))]
df_orangejuiceRHS

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
221,(Organic Hass Avocado),(Organic Orange Juice),0.213658,0.016999,0.01143,0.053498,3.147155,0.007798,1.038562
222,(Bag of Organic Bananas),(Organic Orange Juice),0.248828,0.016999,0.01143,0.045936,2.702327,0.0072,1.030331


In [None]:
Avocado = rules_OOJ[rules_OOJ["antecedents"].apply(lambda x: 'Organic Hass Avocado' in str(x))].sort_values(["support", "confidence", "lift", "conviction"], ascending = (False, False, False, False))

Avocado = Avocado[Avocado["consequents"].apply(lambda x: 'Bag of Organic Bananas' in str(x))].sort_values(["support", "confidence", "lift", "conviction"], ascending = (False, False, False, False))
Avocado.shape

(47, 9)

_Relation between Organic avocado and organic bag of bananas_

In [None]:
Avocado.sort_values(["support", "confidence", "lift", "conviction"], ascending = (False, False, False, False))

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
3,(Organic Hass Avocado),(Bag of Organic Bananas),0.213658,0.248828,0.08558,0.400549,1.609743,0.032416,1.2531
6,"(Organic Strawberries, Organic Hass Avocado)",(Bag of Organic Bananas),0.082943,0.248828,0.037515,0.452297,1.817711,0.016876,1.371495
9,(Organic Hass Avocado),"(Bag of Organic Bananas, Organic Strawberries)",0.213658,0.074443,0.037515,0.175583,2.358619,0.021609,1.12268
86,"(Organic Hass Avocado, Organic Raspberries)",(Bag of Organic Bananas),0.041618,0.248828,0.022567,0.542254,2.179233,0.012212,1.641023
88,(Organic Hass Avocado),"(Bag of Organic Bananas, Organic Raspberries)",0.213658,0.05891,0.022567,0.105624,1.792983,0.009981,1.052231
46,"(Organic Baby Spinach, Organic Hass Avocado)",(Bag of Organic Bananas),0.047186,0.248828,0.019343,0.409938,1.647477,0.007602,1.273039
49,(Organic Hass Avocado),"(Bag of Organic Bananas, Organic Baby Spinach)",0.213658,0.044842,0.019343,0.090535,2.018989,0.009763,1.050242
262,"(Organic Large Extra Fancy Fuji Apple, Organic Hass Avocado)",(Bag of Organic Bananas),0.02755,0.248828,0.018757,0.680851,2.736235,0.011902,2.353673
265,(Organic Hass Avocado),"(Bag of Organic Bananas, Organic Large Extra Fancy Fuji Apple)",0.213658,0.035463,0.018757,0.087791,2.475575,0.01118,1.057365
136,"(Organic Whole Milk, Organic Hass Avocado)",(Bag of Organic Bananas),0.03898,0.248828,0.016999,0.43609,1.752579,0.0073,1.332079


In [None]:
OB = rules_OOJ[rules_OOJ["antecedents"].apply(lambda x: 'Bag of Organic Bananas' in str(x))].sort_values(["support", "confidence", "lift", "conviction"], ascending = (False, False, False, False))

OB = OB[OB["consequents"].apply(lambda x: 'Organic Hass Avocado' in str(x))].sort_values(["support", "confidence", "lift", "conviction"], ascending = (False, False, False, False))
OB.shape

(47, 9)

_Relation between Organic bananas and organic avocado_

In [None]:
OB.sort_values(["support", "confidence", "lift", "conviction"], ascending = (False, False, False, False))

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
2,(Bag of Organic Bananas),(Organic Hass Avocado),0.248828,0.213658,0.08558,0.343934,1.609743,0.032416,1.198572
4,"(Bag of Organic Bananas, Organic Strawberries)",(Organic Hass Avocado),0.074443,0.213658,0.037515,0.503937,2.358619,0.021609,1.585166
7,(Bag of Organic Bananas),"(Organic Strawberries, Organic Hass Avocado)",0.248828,0.082943,0.037515,0.150766,1.817711,0.016876,1.079864
85,"(Bag of Organic Bananas, Organic Raspberries)",(Organic Hass Avocado),0.05891,0.213658,0.022567,0.383085,1.792983,0.009981,1.274636
87,(Bag of Organic Bananas),"(Organic Hass Avocado, Organic Raspberries)",0.248828,0.041618,0.022567,0.090695,2.179233,0.012212,1.053972
44,"(Bag of Organic Bananas, Organic Baby Spinach)",(Organic Hass Avocado),0.044842,0.213658,0.019343,0.431373,2.018989,0.009763,1.382878
47,(Bag of Organic Bananas),"(Organic Baby Spinach, Organic Hass Avocado)",0.248828,0.047186,0.019343,0.077739,1.647477,0.007602,1.033127
260,"(Bag of Organic Bananas, Organic Large Extra Fancy Fuji Apple)",(Organic Hass Avocado),0.035463,0.213658,0.018757,0.528926,2.475575,0.01118,1.669253
263,(Bag of Organic Bananas),"(Organic Large Extra Fancy Fuji Apple, Organic Hass Avocado)",0.248828,0.02755,0.018757,0.075383,2.736235,0.011902,1.051733
134,"(Bag of Organic Bananas, Organic Whole Milk)",(Organic Hass Avocado),0.029015,0.213658,0.016999,0.585859,2.742043,0.010799,1.898729


__5) Suppose that the following rules are generated by mining the transaction histories at the aisle level (at suitable minimum threshold levels).__

Soft drinks => Fresh fruits;
Candy chocolate => Fresh vegetables;
Tea => Fresh fruits;
Soft drinks => Fresh vegetables;
Coffee => Fresh fruits

__Explain how you would use the above rules for making promotional decisions. What rule measures did you consider while making your decision?__

_Complement the consequent of dataframe-c with reference to the consequent of dataframe-b, i.e. pick items with high complimentarity from dataframe-b and subsititute them (or create promotional bundles) with the consequents of dataframe-c_

**Earlier we had set 30% minimum support at aisle level, however to cover soft drinks, tea and coffee we have to lower the same to 5%**

In [None]:
#Run FPGrowth model using spark big list of items
fpGrowth = FPGrowth(itemsCol="collect_set(aisle)", minSupport=0.05)
model = fpGrowth.fit(mylist_aisle)

In [None]:
# Store frequent itemsets.
results_aisle = model.freqItemsets.collect()
# Store frequent itemsets in pandas dataframe.
frequent_items_aisle = pd.DataFrame(results_aisle, columns=["itemsets", "freq"])
# Store length of each itemset in pandas dataframe.
frequent_items_aisle['size'] = frequent_items_aisle["itemsets"].apply(lambda x: len(x))

# Store support value which is an input for  association_rules function.
frequent_items_aisle['support'] = frequent_items_aisle["freq"]/3412
#specify suitable parameter values, random values are set below
rules_aisle = association_rules(frequent_items_aisle, metric="support", min_threshold=0.05)
rules_aisle = rules_aisle.sort_values(["support", "confidence", "lift", "conviction"], ascending = (False, False, False, False))

In [None]:
##Convert frozenset string to unicode string
rules_aisle["antecedents"] = rules_aisle["antecedents"].apply(lambda x: ', '.join(list(x))).astype("unicode")
rules_aisle["consequents"] = rules_aisle["consequents"].apply(lambda x: ', '.join(list(x))).astype("unicode")

In [None]:
rules_aisle.shape

(36244, 9)

In [None]:
##Store the distinct items mentioned in the rules above
items = ['soft drinks', 'fresh fruits', 'candy chocolate', 'fresh vegetables', 'tea', 'coffee']
##Search the list to extract all rules where these items exist
rules_aisle['LHS'] = rules_aisle['antecedents'].str.findall('(' + '|'.join(items) + ')')
rules_aisle['RHS'] = rules_aisle['consequents'].str.findall('(' + '|'.join(items) + ')')

In [None]:
rules_aisle.shape

(36244, 11)

In [None]:
##Remove blank rows where the items dont exist
aislepromo = rules_aisle[rules_aisle['LHS'].astype(bool) & rules_aisle['RHS'].astype(bool)]
aislepromo = aislepromo.iloc[:, :9].sort_values(["support", "confidence", "lift", "conviction"], ascending = (False, False, False, False))
##Check the dimensions of the dataset
aislepromo.shape

(6990, 9)

__Soft Drinks__

In [None]:
a = aislepromo[(aislepromo['lift']> 1) & (aislepromo['antecedents'].str.contains('soft drinks')) &
               (aislepromo['confidence']> 0.50)].sort_values(["support", "confidence", "lift", "conviction"],
                                                             ascending = (False, False, False, False))

In [None]:
##Implies complimentary bundles with high confidence
b = a[a['lift']==a['lift'].max()]
b

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
35092,"soft drinks, fresh vegetables","packaged vegetables fruits, fresh fruits",0.078839,0.590563,0.05891,0.747212,1.265254,0.01235,1.619686


In [None]:
##Implies substitutability bundles with high confidence
c = aislepromo[(aislepromo['lift']< 1) & (aislepromo['lift']>aislepromo['lift'].min()) &
               (aislepromo['antecedents'].str.contains('soft drinks')) &
               (aislepromo['confidence']> 0.50)].sort_values(["support", "confidence", "lift", "conviction"],
                                                             ascending = (False, False, False, False))
c

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
35146,soft drinks,fresh fruits,0.118699,0.830598,0.097011,0.817284,0.983971,-0.00158,0.927133
35126,soft drinks,fresh vegetables,0.118699,0.669402,0.078839,0.664198,0.992225,-0.000618,0.984501
35131,soft drinks,"fresh vegetables, fresh fruits",0.118699,0.606389,0.070633,0.595062,0.98132,-0.001345,0.972027


__Fresh Fruits__

In [None]:
a = aislepromo[(aislepromo['lift']> 1) & (aislepromo['antecedents'].str.contains('fresh fruits')) &
               (aislepromo['confidence']> 0.50)].sort_values(["support", "confidence", "lift", "conviction"],
                                                             ascending = (False, False, False, False))

In [None]:
##Implies complimentary bundles with high confidence
b = a[a['lift']==a['lift'].max()]
b

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
35423,"fresh fruits, tortillas flat bread","fresh vegetables, bread",0.104045,0.251172,0.052462,0.504225,2.007488,0.026329,1.510419


In [None]:
##Implies substitutability bundles with high confidence
c = aislepromo[(aislepromo['lift']< 1) & (aislepromo['lift']>aislepromo['lift'].min()) &
               (aislepromo['antecedents'].str.contains('fresh fruits')) &
               (aislepromo['confidence']> 0.50)].sort_values(["support", "confidence", "lift", "conviction"],
                                                             ascending = (False, False, False, False))
c

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
35622,"fresh fruits, coffee",fresh vegetables,0.092321,0.669402,0.056565,0.612698,0.915292,-0.005235,0.853593


__Candy Chocolate__

In [None]:
a = aislepromo[(aislepromo['lift']> 1) & (aislepromo['antecedents'].str.contains('candy chocolate')) &
               (aislepromo['confidence']> 0.50)].sort_values(["support", "confidence", "lift", "conviction"],
                                                             ascending = (False, False, False, False))

In [None]:
##Implies complimentary bundles with high confidence
b = a[a['lift']==a['lift'].max()]
b

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
35280,"candy chocolate, fresh vegetables","packaged vegetables fruits, fresh fruits",0.075029,0.590563,0.055393,0.738281,1.250132,0.011083,1.564417


In [None]:
##Implies substitutability bundles with high confidence
c = aislepromo[(aislepromo['lift']< 1) & (aislepromo['lift']>aislepromo['lift'].min()) &
               (aislepromo['antecedents'].str.contains('candy chocolate')) &
               (aislepromo['confidence']> 0.50)].sort_values(["support", "confidence", "lift", "conviction"],
                                                             ascending = (False, False, False, False))
c

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
35306,candy chocolate,fresh vegetables,0.113423,0.669402,0.075029,0.661499,0.988193,-0.000896,0.976652


__Fresh Vegetables__

In [None]:
a = aislepromo[(aislepromo['lift']> 1) & (aislepromo['antecedents'].str.contains('fresh vegetables')) &
               (aislepromo['confidence']> 0.50)].sort_values(["support", "confidence", "lift", "conviction"],
                                                             ascending = (False, False, False, False))

In [None]:
##Implies complimentary bundles with high confidence
b = a[a['lift']==a['lift'].max()]
b

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
15827,"fresh vegetables, yogurt, lunch meat","packaged cheese, fresh fruits, packaged vegetables fruits",0.109027,0.329132,0.068581,0.629032,1.911183,0.032697,1.808426


In [None]:
##Implies substitutability bundles with high confidence
c = aislepromo[(aislepromo['lift']< 1) & (aislepromo['lift']>aislepromo['lift'].min()) &
               (aislepromo['antecedents'].str.contains('fresh vegetables')) &
               (aislepromo['confidence']> 0.40)].sort_values(["support", "confidence", "lift", "conviction"],
                                                             ascending = (False, False, False, False))
c

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
2466,"fresh vegetables, soy lactosefree","milk, fresh fruits",0.274033,0.432298,0.113423,0.413904,0.957451,-0.005041,0.968616
1988,"packaged vegetables fruits, fresh vegetables, soy lactosefree","milk, fresh fruits",0.225381,0.432298,0.094666,0.420026,0.971613,-0.002766,0.978841
32291,"fresh vegetables, soup broth bouillon","milk, fresh fruits",0.125733,0.432298,0.050703,0.403263,0.932837,-0.003651,0.951345


__Tea__

In [None]:
a = aislepromo[(aislepromo['lift']> 1) & (aislepromo['antecedents'].str.contains('tea')) &
               (aislepromo['confidence']> 0.50)].sort_values(["support", "confidence", "lift", "conviction"],
                                                             ascending = (False, False, False, False))

In [None]:
##Implies complimentary bundles with high confidence
b = a[a['lift']==a['lift'].max()]
b

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
34146,"fresh vegetables, tea","packaged vegetables fruits, fresh fruits",0.081184,0.590563,0.061841,0.761733,1.289842,0.013896,1.718396


In [None]:
##Implies substitutability bundles with high confidence
c = aislepromo[(aislepromo['lift']< 1) & (aislepromo['lift']>aislepromo['lift'].min()) &
               (aislepromo['antecedents'].str.contains('tea')) &
               (aislepromo['confidence']> 0.50)].sort_values(["support", "confidence", "lift", "conviction"],
                                                             ascending = (False, False, False, False))
c

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
34219,tea,fresh fruits,0.128957,0.830598,0.106096,0.822727,0.990524,-0.001015,0.955602
34193,tea,fresh vegetables,0.128957,0.669402,0.081184,0.629545,0.940459,-0.00514,0.892411
34199,tea,"fresh vegetables, fresh fruits",0.128957,0.606389,0.072685,0.563636,0.929496,-0.005513,0.902025
34206,"tea, yogurt",fresh fruits,0.063306,0.830598,0.052462,0.828704,0.997719,-0.00012,0.988942


__Coffee__

In [None]:
a = aislepromo[(aislepromo['lift']> 1) & (aislepromo['antecedents'].str.contains('coffee')) &
               (aislepromo['confidence']> 0.50)].sort_values(["support", "confidence", "lift", "conviction"],ascending = (False, False, False, False))

In [None]:
##Implies complimentary bundles with high confidence
b = a[a['lift']==a['lift'].max()]
b

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
35621,"fresh vegetables, coffee",fresh fruits,0.061547,0.830598,0.056565,0.919048,1.106489,0.005444,2.092614


In [None]:
##Implies substitutability bundles with high confidence
c = aislepromo[(aislepromo['lift']< 1) & (aislepromo['lift']>aislepromo['lift'].min()) &
               (aislepromo['antecedents'].str.contains('coffee')) & (aislepromo['confidence']> 0.50)].sort_values(["support", "confidence", "lift", "conviction"],
                                                                                                                  ascending = (False, False, False, False))
c

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
35643,coffee,fresh fruits,0.111958,0.830598,0.092321,0.824607,0.992788,-0.000671,0.965845
35622,"fresh fruits, coffee",fresh vegetables,0.092321,0.669402,0.056565,0.612698,0.915292,-0.005235,0.853593
35625,coffee,"fresh vegetables, fresh fruits",0.111958,0.606389,0.056565,0.505236,0.833187,-0.011325,0.795552
35597,"packaged vegetables fruits, coffee",fresh vegetables,0.077081,0.669402,0.050703,0.657795,0.98266,-0.000895,0.966081


__How would you generate promotional offers for a given customer, e.g. uid = 4962?__

- _Bundle milk, coffee, lunch meat, refrigerated items in separate bundles with fresh items for breakfast and lunch separately (insights from aisle level)_
- _Create a specialised bundle of all kinds of organic fruits that would include avocado and strawberry (insights from combined level)_

In [None]:
sparkDf.where(sparkDf.uid==4962).agg(collect_set('department')).show(truncate=0)
##Not all departments of this customer fall under minimum support of 60% for above analyzed aisle rules.
## therefore either we need to reduce the department minimum support or we need to go further level down
##either check aisle levels or product levels

+--------------------------------------------------------------------------------------------------------------------------------+
|collect_set(department)                                                                                                         |
+--------------------------------------------------------------------------------------------------------------------------------+
|[personal care, household, other, produce, bakery, breakfast, deli, snacks, meat seafood, pantry, beverages, frozen, dairy eggs]|
+--------------------------------------------------------------------------------------------------------------------------------+



__Aisle level for Customer 4962__

In [None]:
##Find unique list of aisle visitted by this customer
sparkDf.where(sparkDf.uid==4962).agg(collect_set('aisle')).show(truncate=0)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|collect_set(aisle)                                                                                                                                                                                                                                                                                         

In [None]:
items_4962 = sparkDf.where(sparkDf.uid==4962).agg(collect_set('aisle'))

In [None]:
items_4962

DataFrame[collect_set(aisle): array<string>]

In [None]:
##Convert spark dataframe into python list for iteration
items_4962_list = items_4962.select("collect_set(aisle)").rdd.flatMap(list).collect()

In [None]:
#Run FPGrowth model using spark big list of items
fpGrowth = FPGrowth(itemsCol="collect_set(aisle)", minSupport=0.05)
model = fpGrowth.fit(mylist_aisle)

In [None]:
# Store frequent itemsets.
results_aisle = model.freqItemsets.collect()
# Store frequent itemsets in pandas dataframe.
frequent_items_aisle = pd.DataFrame(results_aisle, columns=["itemsets", "freq"])
# Store length of each itemset in pandas dataframe.
frequent_items_aisle['size'] = frequent_items_aisle["itemsets"].apply(lambda x: len(x))

# Store support value which is an input for  association_rules function.
frequent_items_aisle['support'] = frequent_items_aisle["freq"]/3412
#specify suitable parameter values, random values are set below
rules_aisle = association_rules(frequent_items_aisle, metric="support", min_threshold=0.05)
rules_aisle = rules_aisle.sort_values(["support", "confidence", "lift", "conviction"], ascending = (False, False, False, False))

In [None]:
##Convert frozenset string to unicode string
rules_aisle["antecedents"] = rules_aisle["antecedents"].apply(lambda x: ', '.join(list(x))).astype("unicode")
rules_aisle["consequents"] = rules_aisle["consequents"].apply(lambda x: ', '.join(list(x))).astype("unicode")

In [None]:
newlist = items_4962_list[0]

In [None]:
##Find all aisles visited by customer id 4962
rules_aisle['LHS'] = rules_aisle['antecedents'].str.findall('(' + '|'.join(newlist) + ')')
rules_aisle['RHS'] = rules_aisle['consequents'].str.findall('(' + '|'.join(newlist) + ')')

In [None]:
##Remove blank rows where the items dont exist
custpromo = rules_aisle[rules_aisle['LHS'].astype(bool) & rules_aisle['RHS'].astype(bool)]
custpromo = custpromo.iloc[:, :9].sort_values(["support", "confidence", "lift", "conviction"], ascending = (False, False, False, False))

custpromo.shape

(33614, 9)

In [None]:
custpromo['lift'].describe()

count    33614.000000
mean         1.324730
std          0.179043
min          0.821238
25%          1.186631
50%          1.300495
75%          1.438553
max          2.122159
Name: lift, dtype: float64

In [None]:
custpromo['conviction'].describe()

count    33614.000000
mean         1.277953
std          0.543689
min          0.734235
25%          1.044190
50%          1.114838
75%          1.297460
max         15.754396
Name: conviction, dtype: float64

_Complimentary rules suggest bread & fresh vegetables go extremely well with fresh fruits and flat bread_

In [None]:
custpromo[(custpromo['lift']>2) & (custpromo['confidence']>.50)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
35423,"fresh fruits, tortillas flat bread","fresh vegetables, bread",0.104045,0.251172,0.052462,0.504225,2.007488,0.026329,1.510419


_Packaged vegetables, fruits, chips go well with fresh fruits and lactose free product and lunch meat, so customer buys lunch meat with an assortment mostly_

In [None]:
custpromo[(custpromo['lift']<2) & (custpromo['lift']>1.95) & (custpromo['confidence']>.50)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
16318,"fresh fruits, lunch meat, soy lactosefree","packaged vegetables fruits, chips pretzels",0.100528,0.267585,0.053341,0.530612,1.982967,0.026441,1.560362


_Relatively less complimentary rules suggest customer prefers fresh items, he/she prefres yogurt, chips, milk most of the times with fresh fruits or vegetables_

In [None]:
custpromo[(custpromo['lift']>1) & (custpromo['lift']<1.10) & (custpromo['confidence']>.50)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,fresh vegetables,fresh fruits,0.669402,0.830598,0.606389,0.905867,1.090620,0.050385,1.799602
1,fresh fruits,fresh vegetables,0.830598,0.669402,0.606389,0.730064,1.090620,0.050385,1.224725
10,packaged vegetables fruits,fresh fruits,0.660023,0.830598,0.590563,0.894760,1.077248,0.042349,1.609677
11,fresh fruits,packaged vegetables fruits,0.830598,0.660023,0.590563,0.711009,1.077248,0.042349,1.176427
49,yogurt,fresh fruits,0.516999,0.830598,0.450762,0.871882,1.049704,0.021344,1.322236
...,...,...,...,...,...,...,...,...,...
10571,"water seltzer sparkling water, milk, chips pretzels",fresh vegetables,0.073857,0.669402,0.050117,0.678571,1.013698,0.000677,1.028527
31948,"chips pretzels, spreads","fresh vegetables, fresh fruits",0.079426,0.606389,0.050117,0.630996,1.040580,0.001954,1.066685
32491,"packaged vegetables fruits, fresh vegetables, fresh fruits, soup broth bouillon",yogurt,0.094373,0.516999,0.050117,0.531056,1.027190,0.001327,1.029976
6194,"packaged vegetables fruits, fresh vegetables, chips pretzels, bread",milk,0.095545,0.485639,0.050117,0.524540,1.080103,0.003717,1.081817


_Substitue rules suggest customer purchase cofee only with fresh items again, with high confidence_

In [None]:
custpromo[(custpromo['lift']<.90) & (custpromo['confidence']>.50)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
35619,coffee,fresh vegetables,0.111958,0.669402,0.061547,0.549738,0.821238,-0.013397,0.734235
35625,coffee,"fresh vegetables, fresh fruits",0.111958,0.606389,0.056565,0.505236,0.833187,-0.011325,0.795552


_Checking for milk, customer prefers milk again with fresh items or dry fruits or nuts_

In [None]:
custpromo[(custpromo['antecedents'].str.contains('milk')) & (custpromo['confidence']>.50)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
600,milk,fresh fruits,0.485639,0.830598,0.432298,0.890163,1.071713,0.028927,1.542304
501,milk,fresh vegetables,0.485639,0.669402,0.338804,0.697646,1.042193,0.013717,1.093415
181,milk,packaged vegetables fruits,0.485639,0.660023,0.338218,0.696439,1.055174,0.017685,1.119963
502,"milk, fresh vegetables",fresh fruits,0.338804,0.830598,0.316823,0.935121,1.125841,0.035413,2.611051
504,"milk, fresh fruits",fresh vegetables,0.432298,0.669402,0.316823,0.732881,1.094830,0.027442,1.237644
...,...,...,...,...,...,...,...,...,...
25820,"milk, breakfast bakery, yogurt","packaged vegetables fruits, fresh vegetables",0.084408,0.504396,0.050117,0.593750,1.177150,0.007542,1.219948
25821,"packaged vegetables fruits, milk, breakfast bakery","fresh vegetables, yogurt",0.086460,0.374853,0.050117,0.579661,1.546367,0.017708,1.487244
6206,"milk, chips pretzels, bread","packaged vegetables fruits, fresh vegetables",0.088511,0.504396,0.050117,0.566225,1.122580,0.005473,1.142537
5025,"packaged cheese, milk, fresh fruits, soy lactosefree",bread,0.096717,0.348476,0.050117,0.518182,1.486994,0.016414,1.352220


_Same insights of fresh items and refrigerated items hold true_

In [None]:
custpromo[(custpromo['antecedents'].str.contains('refrigerated')) & (custpromo['confidence']>.50)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
9046,refrigerated,fresh fruits,0.308030,0.830598,0.270809,0.879163,1.058470,0.014959,1.401903
8469,refrigerated,fresh vegetables,0.308030,0.669402,0.213658,0.693625,1.036186,0.007461,1.079063
7125,refrigerated,packaged vegetables fruits,0.308030,0.660023,0.213365,0.692674,1.049468,0.010057,1.106239
8470,"refrigerated, fresh vegetables",fresh fruits,0.213658,0.830598,0.196952,0.921811,1.109816,0.019488,2.166564
8472,"refrigerated, fresh fruits",fresh vegetables,0.270809,0.669402,0.196952,0.727273,1.086451,0.015672,1.212192
...,...,...,...,...,...,...,...,...,...
19095,"packaged vegetables fruits, refrigerated, fresh dips tapenades",fresh vegetables,0.061254,0.669402,0.050117,0.818182,1.222258,0.009113,1.818288
19101,"refrigerated, fresh dips tapenades","packaged vegetables fruits, fresh vegetables",0.075615,0.504396,0.050117,0.662791,1.314028,0.011977,1.469721
22437,"refrigerated, fresh fruits, energy granola bars",milk,0.080598,0.485639,0.050117,0.621818,1.280413,0.010976,1.360089
22442,"refrigerated, energy granola bars","milk, fresh fruits",0.091149,0.432298,0.050117,0.549839,1.271899,0.010714,1.261110


_Almost same insights hold true for lunch meat and fresh vegetables or fruits_

In [None]:
custpromo[(custpromo['antecedents'].str.contains('lunch meat')) & (custpromo['confidence']>.50)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
17105,lunch meat,fresh fruits,0.272860,0.830598,0.237691,0.871106,1.048770,0.011053,1.314278
15067,lunch meat,packaged vegetables fruits,0.272860,0.660023,0.207796,0.761547,1.153818,0.027702,1.425758
16479,lunch meat,fresh vegetables,0.272860,0.669402,0.194607,0.713212,1.065446,0.011954,1.152759
15089,"packaged vegetables fruits, lunch meat",fresh fruits,0.207796,0.830598,0.189332,0.911142,1.096972,0.016737,1.906446
15090,"fresh fruits, lunch meat",packaged vegetables fruits,0.237691,0.660023,0.189332,0.796547,1.206847,0.032450,1.671036
...,...,...,...,...,...,...,...,...,...
15049,"fresh fruits, crackers, lunch meat",chips pretzels,0.082943,0.377784,0.050117,0.604240,1.599432,0.018783,1.572206
14989,"crackers, lunch meat","packaged cheese, fresh fruits, packaged vegetables fruits",0.092614,0.329132,0.050117,0.541139,1.644138,0.019635,1.462029
15055,"crackers, lunch meat","fresh fruits, chips pretzels",0.092614,0.333236,0.050117,0.541139,1.623894,0.019255,1.453086
19389,"fresh dips tapenades, lunch meat","packaged cheese, fresh vegetables",0.096424,0.366061,0.050117,0.519757,1.419864,0.014820,1.320038


__Combined level for Customer 4962__

In [None]:
##Find unique list of aisle visitted by this customer
pnames_4962 = sparkDf.where(sparkDf.uid==4962).agg(collect_set('pname_aisle_department'))

In [None]:
##Convert spark dataframe into python list for iteration
pnames_4962_list = pnames_4962.select("collect_set(pname_aisle_department)").rdd.flatMap(list).collect()

In [None]:
newlist_pname = pnames_4962_list[0]

In [None]:
len(newlist_pname)

129

In [None]:
#Run FPGrowth model using spark big list of items
fpGrowth = FPGrowth(itemsCol="collect_set(pname_aisle_department)", minSupport=0.01)
model = fpGrowth.fit(mylist_combined)

In [None]:
# Store frequent itemsets.
results_combined = model.freqItemsets.collect()
# Store frequent itemsets in pandas dataframe.
frequent_items_combined = pd.DataFrame(results_combined, columns=["itemsets", "freq"])
# Store length of each itemset in pandas dataframe.
frequent_items_combined['size'] = frequent_items_combined["itemsets"].apply(lambda x: len(x))

# Store support value which is an input for  association_rules function.
frequent_items_combined['support'] = frequent_items_combined["freq"]/3412
#specify suitable parameter values, random values are set below
rules_combined = association_rules(frequent_items_combined, metric="support", min_threshold=0.01)
rules_combined = rules_combined.sort_values(["support", "confidence", "lift", "conviction"], ascending = (False, False, False, False))

In [None]:
rules_combined.shape

(7452, 9)

In [None]:
##Convert frozenset string to unicode string
rules_combined["antecedents"] = rules_combined["antecedents"].apply(lambda x: ', '.join(list(x))).astype("unicode")
rules_combined["consequents"] = rules_combined["consequents"].apply(lambda x: ', '.join(list(x))).astype("unicode")

##Find all items purchased by customer id 4962
rules_combined['LHS'] = rules_combined['antecedents'].str.findall('(' + '|'.join(newlist_pname) + ')')
rules_combined['RHS'] = rules_combined['consequents'].str.findall('(' + '|'.join(newlist_pname) + ')')

##Remove blank rows where the items dont exist
custpromo_combined = rules_combined[rules_combined['LHS'].astype(bool) & rules_combined['RHS'].astype(bool)]
custpromo_combined = custpromo_combined.iloc[:, :9].sort_values(["support", "confidence", "lift", "conviction"], ascending = (False, False, False, False))

custpromo_combined.shape

(306, 9)

In [None]:
custpromo_combined['lift'].describe()

count    306.000000
mean       2.325428
std        2.968031
min        0.649333
25%        1.620133
50%        1.892518
75%        2.423128
max       37.516204
Name: lift, dtype: float64

In [None]:
custpromo_combined['conviction'].describe()

count    306.000000
mean       1.237258
std        0.326239
min        0.895930
25%        1.032637
50%        1.114310
75%        1.306565
max        3.311694
Name: conviction, dtype: float64

_Organic products go very well with each other with high confidence and complimentarity_

In [None]:
custpromo_combined[(custpromo_combined['lift']>2) & (custpromo_combined['confidence']> 0.60)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
420,"Organic Whole String Cheese::packaged cheese::dairy eggs, Bag of Organic Bananas::fresh fruits::produce",Organic Strawberries::fresh fruits::produce,0.034291,0.250293,0.021102,0.615385,2.458656,0.012519,1.949238
294,"Organic Large Extra Fancy Fuji Apple::fresh fruits::produce, Organic Hass Avocado::fresh fruits::produce",Bag of Organic Bananas::fresh fruits::produce,0.02755,0.248828,0.018757,0.680851,2.736235,0.011902,2.353673
940,"Honeycrisp Apple::fresh fruits::produce, Bag of Organic Bananas::fresh fruits::produce",Organic Strawberries::fresh fruits::produce,0.024326,0.250293,0.016706,0.686747,2.743771,0.010617,2.393295
933,"Honeycrisp Apple::fresh fruits::produce, Organic Hass Avocado::fresh fruits::produce",Organic Strawberries::fresh fruits::produce,0.024912,0.250293,0.016706,0.670588,2.679212,0.01047,2.275896
932,"Honeycrisp Apple::fresh fruits::produce, Organic Strawberries::fresh fruits::produce",Organic Hass Avocado::fresh fruits::produce,0.02755,0.213658,0.016706,0.606383,2.838105,0.01082,1.997735
941,"Honeycrisp Apple::fresh fruits::produce, Organic Strawberries::fresh fruits::produce",Bag of Organic Bananas::fresh fruits::produce,0.02755,0.248828,0.016706,0.606383,2.43696,0.009851,1.908384
376,"Organic Whole String Cheese::packaged cheese::dairy eggs, Organic Hass Avocado::fresh fruits::produce",Organic Strawberries::fresh fruits::produce,0.025791,0.250293,0.015826,0.613636,2.451671,0.009371,1.940418
912,"Honeycrisp Apple::fresh fruits::produce, Bag of Organic Bananas::fresh fruits::produce",Organic Hass Avocado::fresh fruits::produce,0.024326,0.213658,0.014947,0.614458,2.875899,0.00975,2.039575
68,"Organic Raspberries::packaged vegetables fruits::produce, Organic Strawberries::fresh fruits::produce, Organic Hass Avocado::fresh fruits::produce",Bag of Organic Bananas::fresh fruits::produce,0.01993,0.248828,0.012016,0.602941,2.423128,0.007057,1.891841
918,"Honeycrisp Apple::fresh fruits::produce, Bag of Organic Bananas::fresh fruits::produce, Organic Hass Avocado::fresh fruits::produce",Organic Strawberries::fresh fruits::produce,0.014947,0.250293,0.011137,0.745098,2.976902,0.007396,2.941158


_Some organic fruits like cucumber, pear, apples go well with other organic items_

In [None]:
custpromo_combined[(custpromo_combined['lift']<1.5)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
2,Bag of Organic Bananas::fresh fruits::produce,Organic Strawberries::fresh fruits::produce,0.248828,0.250293,0.074443,0.299176,1.195301,0.012163,1.06975
3,Organic Strawberries::fresh fruits::produce,Bag of Organic Bananas::fresh fruits::produce,0.250293,0.248828,0.074443,0.297424,1.195301,0.012163,1.069169
552,Organic Yellow Onion::fresh vegetables::produce,Organic Strawberries::fresh fruits::produce,0.074736,0.250293,0.025791,0.345098,1.378776,0.007085,1.144762
553,Organic Strawberries::fresh fruits::produce,Organic Yellow Onion::fresh vegetables::produce,0.250293,0.074736,0.025791,0.103044,1.378776,0.007085,1.03156
645,Organic Cucumber::fresh vegetables::produce,Bag of Organic Bananas::fresh fruits::produce,0.072392,0.248828,0.02374,0.327935,1.317921,0.005727,1.117708
576,Organic Garlic::fresh vegetables::produce,Bag of Organic Bananas::fresh fruits::produce,0.072978,0.248828,0.02374,0.325301,1.307335,0.005581,1.113345
644,Bag of Organic Bananas::fresh fruits::produce,Organic Cucumber::fresh vegetables::produce,0.248828,0.072392,0.02374,0.095406,1.317921,0.005727,1.025442
577,Bag of Organic Bananas::fresh fruits::produce,Organic Garlic::fresh vegetables::produce,0.248828,0.072978,0.02374,0.095406,1.307335,0.005581,1.024794
834,Organic Bartlett Pear::fresh fruits::produce,Organic Strawberries::fresh fruits::produce,0.062427,0.250293,0.021981,0.352113,1.406801,0.006356,1.157156
835,Organic Strawberries::fresh fruits::produce,Organic Bartlett Pear::fresh fruits::produce,0.250293,0.062427,0.021981,0.087822,1.406801,0.006356,1.02784


**Thank You**