# Phase 3: Generating Frequent Itemsets and Mining Association Rules 

**Summary**

1. Importing necessary libraries and the datasets


2. Transforming the datasets
    * Transforming order_products__prior.csv
        1. Dropped "add_to_cart_order" and "reordered" columns.
        ---
    * Transforming cleaned_orders.csv
        1. Dropped all the columns except, "order_id" and "user_id"
        2. Investigated minimum and maximum number of transactions made in the dataset.
        3. Performed random sampling and obtained 3 random order_ids from each user.
        ---
    * Obtained transactions from randomly sampled order_ids.
        1. Obtained transactions from the randomly sampled order_ids.
        2. Justified the data loss that occured while acquiring transactions.


3. Investigating missing products
    * Checking for missing products in transactions variable when compared to products.csv
        1. Found 1888 missing products
        ---
    * Checking for missing products in order_products in order_products__prior.csv
        1. Found 11 missing products
        ---
    * Checking for missing products in transactions variable when compared to order_products__prior.csv
        1. Found 1877 missing products
        ---
    * Summary of this section i.e. Investigating missing products
    ---
    * Obtained product with highest count/frequency among the 1877 missing products.
    ---


4. Prooved that all the missing products from the transactions variable can be ignored.


--> The doubt starts here

5. Deciding support and confidence

6. Using FP-Growth algorithm to mine strong association rules.

## ---- Importing necessary libraries and datasets ----

In [None]:
!pip install pyfpgrowth

Collecting pyfpgrowth
  Downloading pyfpgrowth-1.0.tar.gz (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 5.3 MB/s 
[?25hBuilding wheels for collected packages: pyfpgrowth
  Building wheel for pyfpgrowth (setup.py) ... [?25l[?25hdone
  Created wheel for pyfpgrowth: filename=pyfpgrowth-1.0-py2.py3-none-any.whl size=5504 sha256=e62c4fb959abd3335a5f4fb9f2bc0b3915dff64436210773840c9e0bfa73ea0d
  Stored in directory: /root/.cache/pip/wheels/73/97/4b/f12ac994f6bbb99597396255435824c73ad3916be1e678be55
Successfully built pyfpgrowth
Installing collected packages: pyfpgrowth
Successfully installed pyfpgrowth-1.0


In [7]:
import pandas as pd
import numpy as np
import seaborn as sns
from random import randint
import joblib
import pyfpgrowth as fp
from mlxtend.frequent_patterns import association_rules

In [9]:
order_prod_df = pd.read_csv("order_products__prior.csv") 
orders_df = pd.read_csv("cleaned_orders.csv")
products_df = pd.read_csv("products.csv")
aisles_df = pd.read_csv('aisles.csv')
depts_df = pd.read_csv('departments.csv')

## ---- Transforming the datasets ----

### ---> Transforming order_products__prior.csv

In [None]:
print(order_prod_df.shape)
order_prod_df.head()

(32434489, 4)


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


So there are approx 3 crores (30 Million) rows and 4 columns. We are going to drop the last two columns as we dont require them for this phase. 

In [None]:
order_prod_df.drop(['add_to_cart_order','reordered'],axis = 1, inplace = True)

In [None]:
print(order_prod_df.shape)

(32434489, 2)


### ---> Transforming cleaned_orders.csv

In [None]:
print(orders_df.shape)
orders_df.head()

(3214874, 8)


Unnamed: 0.1,Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,0,2539329,1,prior,1,2,8,0.0
1,1,2398795,1,prior,2,3,7,15.0
2,2,473747,1,prior,3,3,12,21.0
3,3,2254736,1,prior,4,4,7,29.0
4,4,431534,1,prior,5,4,15,28.0


Here we will drop all the columns except the order_id and user_id columns

In [None]:
columns_list = list(orders_df.columns)
columns_list

['Unnamed: 0',
 'order_id',
 'user_id',
 'eval_set',
 'order_number',
 'order_dow',
 'order_hour_of_day',
 'days_since_prior_order']

In [None]:
columns_list.remove('order_id')
columns_list.remove('user_id')

In [None]:
orders_df.drop(columns_list,axis = 1, inplace = True)

In [None]:
orders_df.shape

(3214874, 2)

Alright the operation was successfull. 
___
**Now we need to check the minimum & maximum number of transactions made in the datset. This is an important step, as this checks whether the transactions are being dominated or skewed towards a particular direction.** 

For e.g. Lets say user A and B have 3 & 5 trasactions respectively. But, user C has 20 transactions. This skewness in the number of transactions may lead to inaccurate conclusions and hence must be resolved.

In [None]:
df = orders_df.groupby('user_id').count()
print(df.shape)
user_id = list(df.index)
order_id = list(df.order_id.values)
orders_grouped_df = pd.DataFrame({'user_id':user_id, 'order_id':order_id})

In [None]:
print(orders_grouped_df.shape)
orders_grouped_df.head()

(206209, 2)


Unnamed: 0,user_id,order_id
0,1,10
1,2,14
2,3,12
3,4,5
4,5,4


In [None]:
print(f"The minimum number of transactions are: {orders_grouped_df.order_id.min()}")
print(f"The maximum number of transactions are: {orders_grouped_df.order_id.max()}")

The minimum number of transactions are: 3
The maximum number of transactions are: 99


I'm surprised that there are some users whose 99 transaction have been considered. Lets check how many users have made 99 transactions

In [None]:
print(f'Number of users that have 99 transaction: {orders_grouped_df[orders_grouped_df.order_id == 99].count()[0]}')

Number of users that have 99 transaction: 1374


That's a fairly large amount of users.
____
**To resolve this issue, we shall perform Random Sampling by selecting 3 order ids at random from users whose total number of transaction is greater than 3.**
_____
Firstly, we shall obtain 3 order_ids at random from each user, and then store these randomly obtained order_ids in a variable.

In [None]:
random_sampled_order_ids = []
user_ids = list(orders_df.user_id.unique())

In [None]:
# Code to obtain three random order ids from each user
for i in user_ids:
  temp = orders_df[orders_df.user_id == i].order_id.count() 
  if temp > 3:
    data = list(orders_df[orders_df.user_id == i].order_id)
    
    sample = data[randint(0,len(data)-1)]
    data.remove(sample)
    random_sampled_order_ids.append(sample)

    sample = data[randint(0,len(data)-1)]
    data.remove(sample)
    random_sampled_order_ids.append(sample)

    sample = data[randint(0,len(data)-1)]
    data.remove(sample)
    random_sampled_order_ids.append(sample)
  if temp == 3:
    data = list(orders_df[orders_df.user_id == i].order_id)

    random_sampled_order_ids.append(data[0])
    random_sampled_order_ids.append(data[1])
    random_sampled_order_ids.append(data[2])

In [1]:
# Checking whether the operation was successfull or not
# We know there are 206209 users

if len(set(random_sampled_order_ids)) == (206209*3):
    print('The operation to get random order_ids from each user was successfull')

The operation to get random order_ids from each user was successfull


In [None]:
# Lets save the obtained random order ids using joblib
joblib.dump(random_sampled_order_ids,'random_sampled_order_ids.pkl')

In [None]:
# Lets check whether previous operation was performed successfully or not
len(set(joblib.load('random_sampled_order_ids.pkl')))

### ---> Obtaining transactions from randomly sampled order_ids.

**We have successfully obtained 3 random order ids for each user. We will proceed to make a list of lists that will contain all the transactions.**


In [None]:
random_sampled_order_ids = joblib.load('random_sampled_order_ids.pkl') 

In [None]:
dumped = []
transactions = []
for i in random_sampled_order_ids:
  val = list(order_prod_df[order_prod_df.order_id == i].product_id.values)
  if len(val) == 1:
    dumped.append(i)
    continue
  transactions.append(val)

In [None]:
transactions = joblib.load('final_transaction.pkl')

In [None]:
print(f'Number of transactions expected:{206209*3} ')
print(f'Number of transactions attained: {len(transactions)}')
print(f'Number of transactions lost: {(206209*3) - len(transactions)}')

Number of transactions expected:618627 
Number of transactions attained: 583276
Number of transactions lost: 35351


Here is the explanation to why we have lost approx 35k transactions. It is simply because the trasactions that were dropped only contained a single item. Containing only 1 transaction in the dataset may cause some problem with FP-Growth algotihm and hence they were removed

## ---- Investigating missing products ----

in this phase, we will check for missing products in transactional dataset if any. Investigating this, will give us an idea on the quality of the sample obtained.

In [None]:
transactions = joblib.load('final_transaction.pkl')

In [None]:
print(products_df.shape)
products_df.head()

(49688, 4)


Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


### ---> Checking for missing products in "transactions" variable compared to products.csv

**We will check whether our sample has all the products listed from our main products dataframe i.e. "products.csv". If there are a few products missing and if they are popular then our sample will not be a good representative of populus and it may lead to inaccurate conclusions.** 

For convenience, we will convert transactions to a dataframe.

In [None]:
new_transactions = []
for i in range(len(transactions)):
  new_transactions.extend(transactions[i])

In [None]:
# Converting new_transactions to a Dataframe 
df = pd.DataFrame(data=new_transactions,columns=['product_id'])
df['val'] = range(len(df))

print(df.shape)
df.head()

(6125447, 2)


Unnamed: 0,product_id,val
0,196,0
1,12427,1
2,10258,2
3,25133,3
4,30450,4


In [None]:
if products_df.product_id.nunique() == df.product_id.nunique():
  print('All products are included in the sample')
else:
  print('Few products are missing from the sample')

Few products are missing from the sample


Lets store all the missing product ids in a list.

In [2]:
temp_original = set((products_df.product_id.unique())
temp_sample =   set((df.product_id.unique())
missing_prod_sample = temp_original - temp_sample
print(f'Total missing products from sample are: {len(missing_prod_sample)}')
print(missing_prod_sample)

Total missing products from sample are: 1888


Hmm, 1888 products are missing from the sample. There may be a chance that few products are also missing from the dataset ( i.e. order_products__prior.csv) from which the sample was drawn.

### ---> Checking for missing products in order_products__prior.csv

**Checking whether order_products__prior.csv has any missing products as compared to products.csv (i.e. main products dataframe).**

In [None]:
if products_df.product_id.nunique() == order_prod_df.product_id.nunique():
  print('All products are included in the order_prod_df')
else:
  print('Few products are missing from the order_prod_df')

Few products are missing from the order_prod_df


Intresting, even though we have a large amount of transactional data there are still few products that are missing in the transaction. Lets explore what products these are how many of them are there.

In [None]:
temp_prod = set(products_df.product_id.unique())
temp_order_prod = set(order_prod_df.product_id.unique())
missing_prod = temp_prod - temp_order_prod
print(f'Total missing products from order_prod_df are: {len(missing_prod)}')
print(missing_prod)

Total missing products from order_prod_df are: 11
{46625, 49540, 7045, 3718, 25383, 37703, 36233, 27499, 43725, 3630, 45971}


### ---> Checking for missing products in "transactions" variable compared to order_products__prior.csv

In [4]:
temp_sample = set(df.product_id.unique())
temp_order_prod = set(order_prod_df.product_id.unique())
missing_prod_sample_2 = temp_order_prod - temp_sample
print(f'Total missing products from sample are: {len(missing_prod_sample_2)}')
print(missing_prod_sample_2)

Total missing products from sample are: 1877


**Summarizing everything we did so far in this section i.e. Investigating missing products**

1. First, we obtained all the missing products in transactions variable when compared to products.csv. The total number of missing products were 1888.
2. On further analysis, order_products__prior.csv had 11 products missing when compared to products.csv
3. Finally we checked for missing products in transactions when compared to order_products__prior.csv and we got a total of 1877 missing products.

This makes sense, as you can see if we substract 11 from 1888 we get 1877.
___
Now we will get the highest count/freq of the products that are missing from transactions variable. In other words, we are going to get the product with highest freq/count from the 1877 products that are missing in transactions.

In [None]:
highest_freq = 0
product = 0
a = []
for i in missing_prod_sample_2:
  temp = order_prod_df[order_prod_df.product_id == i].count()[0]
  a.append(temp)
  if highest_freq < temp:
    highest_freq = temp
    product = i
print(f'Product {product} has the highest frequency of {highest_freq}')

Product 6389 has the highest frequency of 37


## ---- Frequency Comparsion ----

Now that we have identified the missing product ids in "transaction" we need check whether the missing product ids are going to affect our process of mining frequent itemsets and finding strong associations rules.
___
Lets start by creating a new dataframe containing product frequencies  with respect to order_products__prior.csv file.

In [21]:
freq = list(order_prod_df.groupby('product_id').count().order_id.values)
prod_id = list(order_prod_df.groupby('product_id').count().order_id.index)

grouped_order_prod_df= pd.DataFrame({'product_id':prod_id,
                                        'frequency':freq}) 
print(grouped_order_prod_df.shape)
grouped_order_prod_df.head()

(49677, 2)


Unnamed: 0,product_id,frequency
0,1,1852
1,2,90
2,3,277
3,4,329
4,5,15


Now we will check whether it is okay to ignore the product ids that are missing in "transactions".

We can confirm this by first obtaining the percentages of two products
1. Top product in order_prod_df  = A
2. product with the highest freq count which is absent in "transactions" = B

After obtaining the percentages, if B < A/2, then we can conclue that it is okay to ignore all the products that are missing in "transactions"

In [34]:
top_product = grouped_order_prod_df.sort_values(by='frequency',ascending = False).iloc[0][0]

In [35]:
A = order_prod_df[order_prod_df.product_id == top_product].count()[0]
B = order_prod_df[order_prod_df.product_id == product].count()[0]
total = grouped_order_prod_df.frequency.sum()

if (A/total*100)/2 > B/total*100:
  print('We can safely ignore all the products that are missing in transactions')
  print(f'\nThe highest selling product constitues about {round(A/total*100,3)}% of the entire dataset')
else:
  print('We cannot ignore the products the products that are missing in transactions')

We can safely ignore all the products that are missing in transactions

The highest selling product constitues about 1.457% of the entire dataset


As we can see in the output, it is proved that we can safely ignore all the product ids that are absent in "transactions".
___
Why was it safe to ignore ?
Simply because even if we set a support level that is 50% of top product even then the missing product with the highest freq won't be considered. Obviously the rest of the missing products with lower frequency count will be ignored.

## ---- Deciding Support & Confidence ----

In [None]:
new_df = pd.read_csv('Transactions_Freq.csv')
grouped_order_prod_df = pd.read_csv('Order_Prod_Df_Freq.csv')
new_df.drop('Unnamed: 0',axis=1,inplace = True)
grouped_order_prod_df.drop('Unnamed: 0',axis=1,inplace = True)

In [None]:
print(new_df.shape)
new_df.head()

(47800, 2)


Unnamed: 0,product_id,frequency
0,1,334
1,2,11
2,3,40
3,4,90
4,5,4


In [None]:
print(grouped_order_prod_df.shape)
grouped_order_prod_df.head()

(49677, 2)


Unnamed: 0,product_id,frequency
0,1,1852
1,2,90
2,3,277
3,4,329
4,5,15


To get the support, we are going to  get 75% of top products in both Original and Sample dataset

In [None]:
# xxxxxxxxxxxxxxxxxx For Sample Data xxxxxxxxxxxxxxxxxxx
top_prod_freq = new_df.sort_values(by='frequency',ascending=False).frequency.max()
total = new_df.frequency.sum()

prod_percent = top_prod_freq/total * 100
threshold_percent = 5/100*prod_percent
threshold = round(threshold_percent/100 * total)

print('-------------- Sample Data Support Calculation ------------')
print(f'The top product has frequency count of {top_prod_freq} and constitutes about {round(prod_percent,4)} %')
print(f'The Support freq. threshold  in percent is {threshold_percent}% and the freq. threshold is {threshold}')

# xxxxxxxxxxxxxxxxxx For Original Data xxxxxxxxxxxxxxxxxxx
top_prod_freq = grouped_order_prod_df.sort_values(by='frequency',ascending=False).frequency.max()
total = grouped_order_prod_df.frequency.sum()

prod_percent = top_prod_freq/total * 100
threshold_percent = 5/100*prod_percent
threshold = round(threshold_percent/100 * total)

print('\n-------------- Original Data Support Calculation ------------')
print(f'The top product has frequency count of {top_prod_freq} and constitutes about {round(prod_percent,4)} %')
print(f'The Support freq. threshold  in percent is {threshold_percent}% and the freq. threshold is {threshold}')

-------------- Sample Data Support Calculation ------------
The top product has frequency count of 85905 and constitutes about 1.4024 %
The Support freq. threshold  in percent is 0.070121413180132% and the freq. threshold is 4295

-------------- Original Data Support Calculation ------------
The top product has frequency count of 472565 and constitutes about 1.457 %
The Support freq. threshold  in percent is 0.07284915140793494% and the freq. threshold is 23628


## ---- FP-Growth Algorithm ----

**Sample Data Support Calculation**

**The Support freq. threshold  in percent is 0.280485652720528% and the freq. threshold is 17181**
_____
_____

**Original Data Support Calculation**

**The Support freq. threshold  in percent is 0.29139660563173975% and the freq. threshold is 94513.**
____
We are going to use the threshold set by the sample data.

In [None]:
round(0.5/new_df.frequency.sum()*100,100)

8.162669597826902e-06

In [None]:
transactions = joblib.load('final_transaction.pkl')

Generating Frequent Itemsets using FP Growth Algorithm

In [None]:
freq_itemset = fp.find_frequent_patterns(transactions=transactions,support_threshold=500)

In [None]:
# Converting generated freq_itemsets to a list of set
temp_itemsets = []
pos_val = []
for i in list(freq_itemset.keys()):
  val = set(i)
  temp_itemsets.append(val)

In [None]:
# Creating a dataframe from temp_itemsets. Both the process is done as a prerequisite for mlxtend association rule function
freq_itemset_df = pd.DataFrame({'support':list(freq_itemset.values()),
                                'itemsets':temp_itemsets})
print(freq_itemset_df.shape)
freq_itemset_df.head()

(4429, 2)


Unnamed: 0,support,itemsets
0,500,{8501}
1,500,{40287}
2,500,{29503}
3,500,{39247}
4,500,{42360}


In [None]:
if len(freq_itemset_df) == len(temp_itemsets) == len(freq_itemset):
  print('Previous operations were sucessfull')

Previous operations were sucessfull


#### ---- Generating rules using mlxtend ----

In [None]:
rules_mlx = association_rules(freq_itemset_df,support_only=True)
#rules_mlx['antecedents'] = rules_mlx['antecedents'].astype('string')
rules_mlx.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(23296),(44156),,,501.0,,,,
1,(44156),(23296),,,501.0,,,,
2,(196),(46149),,,570.0,,,,
3,(46149),(196),,,570.0,,,,
4,(44156),(33548),,,504.0,,,,


#### ---- Generating rules using pyfpgrowth ----

In [None]:
rules_fp = fp.generate_association_rules(freq_itemset,0.4)

In [None]:
rules_fp

{(4605, 16797): ((24852,), 0.46850044365572313),
 (8174, 47209): ((13176,), 0.4466501240694789),
 (8277, 47209): ((13176,), 0.4212095400340716),
 (16797, 28204): ((24852,), 0.4814534443603331),
 (16797, 45066): ((24852,), 0.4873096446700508),
 (16797, 47626): ((24852,), 0.4173262972735268),
 (16797, 49683): ((24852,), 0.41186537364517967),
 (19057, 27966): ((13176,), 0.43211920529801323),
 (19057, 47209): ((13176,), 0.4238121245221191),
 (20114, 28842): ((26209,), 0.41766109785202865),
 (20114, 47626): ((26209,), 0.40189642596644787),
 (21137, 49683): ((24852,), 0.40661387983232417),
 (21709, 35221): ((44632,), 0.4702861335289802),
 (21709, 44632): ((35221,), 0.44637883008356544),
 (22825, 47209): ((13176,), 0.4194528875379939),
 (23296,): ((44156,), 0.4099836333878887),
 (28204, 47766): ((24852,), 0.42105263157894735),
 (28204, 49683): ((24852,), 0.44605358435916004),
 (31717, 47626): ((26209,), 0.40095302927161336),
 (39928, 47209): ((13176,), 0.4116331096196868),
 (41065,): ((45007,