<a href="https://colab.research.google.com/github/Zeynep-Dogan/IE-423/blob/main/Task_7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <font color='#475468'> Bundling Purchasing Recommendations:</font>
### <font color='#475468'> Can you bundle products that go together based on historical transactions?</font>

## Initialize

In [2]:
import pandas as pd

## Load Data

Remember the retail transactions data set that we used for customer segmentation...

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
dfSales = pd.read_csv('/content/drive/MyDrive/ie 423/Black Friday Sales Data_Task3.csv', encoding = "ISO-8859-1")

In [5]:
dfSales.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969


Support = how likely is it for both items to show up together

Confidence = how likely is it for Product-Y to show up given Product-X showed up

Lift = Confidence compared to Product-Y showing up anyway

## Prepare Data

In order to extract relationships between items, the data is first rearranged into a binary table where each transaction is a row, each column is an item, and the values are set to 1 if the item was part of the transaction.

In [6]:
# Drop missing values
dfSales.dropna(inplace=True)

In [7]:
dfSales['clean_description'] = dfSales['Product_ID']
dfSales['clean_description'] = dfSales['clean_description'].str.replace(" ", "_")
dfSales['clean_description'].str.replace('\W', '')

1         P00248942
6         P00184942
13        P00145042
14        P00231342
16         P0096642
            ...    
545902    P00064042
545904    P00081142
545907    P00277642
545908    P00127642
545914    P00217442
Name: clean_description, Length: 166821, dtype: object

In [8]:
dfSales.dropna(inplace=True)

In [9]:
# Convert to list format
dfSalesList=dfSales.groupby('User_ID').clean_description.apply(list)
dfSalesList

User_ID
1000001    [P00248942, P00085942, P00102642, P00110842, P...
1000002    [P00289342, P00034742, P00177442, P00116842, P...
1000003    [P00128042, P00112142, P00182742, P00110742, P...
1000004    [P00184942, P00046742, P00329542, P00114942, P...
1000005    [P00145042, P00324442, P00036842, P00173342, P...
                                 ...                        
1006036    [P00294442, P00118342, P00243942, P00156742, P...
1006037    [P00177442, P00087042, P00025442, P00086442, P...
1006038                    [P00034742, P00086042, P00109542]
1006039    [P00088542, P00254242, P00202742, P00085942, P...
1006040    [P00148642, P00059442, P00024142, P00192042, P...
Name: clean_description, Length: 5870, dtype: object

We will be trying to predict what other products will be bought by the same user if we know what is already bought

This type of analysis is known as **Association Rule Mining** also commonly known as **Market Basket Analysis**.

## Build Model

### Association Rule Mining

Suppose we are interested in the relationship Product-X --> Product-Y

Then:

**Support**: Frequency of purchase that contain both X and Y = P(X,Y)

**Confidence**: Frequency of Y appearing in Purchase given X appeared

= Support / P(X) = P(Y|X)

**Lift**: Confidence given Y appeared = Confidence / P(B)

In [10]:
# Encode data as transaction matrix
from mlxtend.preprocessing import TransactionEncoder

mdlSalesTe = TransactionEncoder()
mdlSalesTe_array = mdlSalesTe.fit(dfSalesList).transform(dfSalesList)
dfPurchase = pd.DataFrame(mdlSalesTe_array, columns=mdlSalesTe.columns_)
dfPurchase

Unnamed: 0,P00000142,P00000242,P00000642,P00001042,P00001142,P00001542,P00002142,P00002242,P00003442,P00004242,...,P0096442,P0096542,P0096642,P0096742,P0096842,P0097342,P0099042,P0099742,P0099842,P0099942
0,True,False,False,False,False,False,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5865,False,False,True,False,True,False,False,False,True,False,...,False,True,False,True,False,True,False,False,False,False
5866,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
5867,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
5868,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


This shows which user has bought which product. If it it false the user didn't buy the product, if it is True the product is bought by the specific user.

In [None]:
%%time
# Determine the items and itemsets with at least 1% support (generates all itemsets and then filters by support)

from mlxtend.frequent_patterns import apriori

frequent_itemsets = apriori(dfPurchase, min_support=0.01, use_colnames=True)
frequent_itemsets

Since it uses all the RAM and crushes the session I am not gonna run this cell but instead use fpgrowth

In [1]:
!pip3 install mlxtend --upgrade

Collecting mlxtend
  Downloading mlxtend-0.23.1-py3-none-any.whl (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: mlxtend
  Attempting uninstall: mlxtend
    Found existing installation: mlxtend 0.22.0
    Uninstalling mlxtend-0.22.0:
      Successfully uninstalled mlxtend-0.22.0
Successfully installed mlxtend-0.23.1


In [11]:
%%time
# Speed up by using pattern fragment growth method for mining frequent itemsets (uses density to find good itemsets)

from mlxtend.frequent_patterns import fpgrowth

frequent_itemsets = fpgrowth(dfPurchase, min_support=0.01, use_colnames=True)
frequent_itemsets

CPU times: user 2min 3s, sys: 530 ms, total: 2min 4s
Wall time: 2min 11s


Unnamed: 0,support,itemsets
0,0.275128,(P00025442)
1,0.245315,(P00184942)
2,0.239523,(P00059442)
3,0.218228,(P00110842)
4,0.212266,(P00102642)
...,...,...
206858,0.010051,"(P00057442, P00050342)"
206859,0.012266,"(P00145042, P00050342)"
206860,0.010051,"(P00050342, P00034742)"
206861,0.010051,"(P00010742, P00050342)"


This shows the supports of different products. For instance, the probability of purchasing products P00057442 and P00050342 is 0.010051.

In [12]:
# Evaluate the metrics, and filter the items and itemsets that have at least 70% confidence

from mlxtend.frequent_patterns import association_rules

a_rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
a_rules.sort_values(by=['confidence'],ascending=False,inplace=True)
a_rules.head()

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
41,"(P00122442, P00073842, P00243942)",(P00057642),0.01414,0.250426,0.010733,0.759036,3.030981,0.007192,3.110733,0.679684
77,"(P00277642, P00129342, P00000142)",(P00145042),0.013969,0.239523,0.010562,0.756098,3.15668,0.007216,3.117956,0.692891
67,"(P00155442, P00209742)",(P00112542),0.016525,0.192675,0.012436,0.752577,3.905949,0.009252,3.26294,0.756481
84,"(P00032042, P00222942)",(P00145042),0.014991,0.239523,0.011244,0.75,3.131223,0.007653,3.041908,0.690995
34,"(P00112142, P00245642, P00144642)",(P00110742),0.013629,0.274617,0.010221,0.75,2.731079,0.006479,2.901533,0.642602


This shows us if the product X is purchased what is the probability of a user purchasing product Y.

From the first row we can say that is a user purchases products P00122442, P00073842, and P00243942 they would also be purchasing product P00057642 with approximately 76% confidence.