## Detecting-patterns-in-purchase-history-using-association-rule-learning-methods

### 1. Experimental Dataset as proof of concept

In [72]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)

In [73]:
data_simple_input =[\
["F", "G", "H", "I", "J", "K", "M"],
["F", "H", "I", "J", "K", "L", "M"],
["F", "H", "I", "N"],
["F", "G", "J", "L", "M", "N", "R"],
["F", "G", "J", "N", "R"],
["F", "G", "M", "N", "R"],
["F", "K", "N"],
["F", "G", "I", "R"],
["G", "H", "N"],
["G", "J", "R"]]

# Counts:
# F:8
# G:7
# N:6
# J:5
# R:5
# H:4
# I:4
# M:4
# K:3


### 1. Traditional alorthms from libraries

#### 1.1 Apriori
APriori is the first tradtional ruiles algorithm.

In [74]:
# Imports
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import fpgrowth

In [75]:
te = TransactionEncoder()
te_ary = te.fit(data_simple_input).transform(data_simple_input)
data_simple = pd.DataFrame(te_ary, columns=te.columns_)
data_simple

Unnamed: 0,F,G,H,I,J,K,L,M,N,R
0,True,True,True,True,True,True,False,True,False,False
1,True,False,True,True,True,True,True,True,False,False
2,True,False,True,True,False,False,False,False,True,False
3,True,True,False,False,True,False,True,True,True,True
4,True,True,False,False,True,False,False,False,True,True
5,True,True,False,False,False,False,False,True,True,True
6,True,False,False,False,False,True,False,False,True,False
7,True,True,False,True,False,False,False,False,False,True
8,False,True,True,False,False,False,False,False,True,False
9,False,True,False,False,True,False,False,False,False,True


In [76]:
from mlxtend.frequent_patterns import apriori

apriori(data_simple, min_support=0.5,use_colnames=True)

Unnamed: 0,support,itemsets
0,0.8,(F)
1,0.7,(G)
2,0.5,(J)
3,0.6,(N)
4,0.5,(R)
5,0.5,"(G, F)"
6,0.5,"(F, N)"
7,0.5,"(R, G)"


#### 1.2 FP-Growth
FP-Growth gives exactly the same output, but is much more efficient, especially when using big data.

In [77]:
from mlxtend.frequent_patterns import fpgrowth

fpgrowth(data_simple, min_support=0.5,use_colnames=True)

Unnamed: 0,support,itemsets
0,0.8,(F)
1,0.7,(G)
2,0.5,(J)
3,0.6,(N)
4,0.5,(R)
5,0.5,"(G, F)"
6,0.5,"(F, N)"
7,0.5,"(R, G)"


#### 2. Modified implementation of FP-Growth as Basis

#### 2.1 Classical implementation for proving correct algorithm

The modified implementation of FP-Growth will give back the same result as the classical algorithms.
This means, without any date & profit parameter for weighted support, basically thrmalgorithm works as expected.
One fundamental modification has been done, when finding the antecedent leading to the the combined new antecedent & consequent.
The modification only considers "one antecedent", instead of many. The "one antecedent" is taken from the strict tree-structure (path) from high support to low support.
##### Advantages:
- Clearer rules & No duplication of changing antecedent leading to the  same itemset
- Much better performance, especially for big data. All figures can be derived from one Loop through all transactions O(n)-complexity for whole dataset. Inside the paths there is O(n*2) complexity, but the paths are usually very small compared to the whole dataset. Alternatively all combinations of paths have to be looped again
##### Disadvantages:
- Potential loss of information, if onbe rule for a very special antecedent leading to the itemset was decisive

In [78]:
data_simple = pd.read_csv("datasets/proof_of_concept/transactions.csv")
df_data_simple = data_simple.groupby("transaction",dropna=True)["item"].agg([lambda x: list(x),"count"])
df_data_simple

Unnamed: 0_level_0,<lambda_0>,count
transaction,Unnamed: 1_level_1,Unnamed: 2_level_1
T01,"[F, G, H, I, J, K, M]",7
T02,"[F, H, I, J, K, L, M]",7
T03,"[F, H, I, N]",4
T04,"[F, G, J, L, M, N, R]",7
T05,"[F, G, J, N, R]",5
T06,"[F, G, M, N, R]",5
T07,"[F, K, N]",3
T08,"[F, G, I, R]",4
T09,"[G, H, N]",3
T10,"[G, J, R]",3


In [79]:
import local_libs.modified_fp_growth as mod_fp_growth

rules = mod_fp_growth.fpgrowthFromDataFrame(df_data_simple, minSupRatio=0.5, maxSupRatio=1, minConf=0, item_col=1) #Traditional Association Rules

print(rules) 

  antecedent sup_antecedent consequent  sup_consequent antecedent&consequent  sup_ant&cons  sup_perc_ant&cons confidence      lift improvement
7         []             NA        [F]               8                   [F]             8                0.8         NA        NA          NA
5         []             NA        [G]               7                   [G]             7                0.7         NA        NA          NA
3         []             NA        [N]               6                   [N]             6                0.6         NA        NA          NA
0         []             NA        [J]               5                   [J]             5                0.5         NA        NA          NA
1         []             NA        [R]               5                   [R]             5                0.5         NA        NA          NA
2        [G]              7        [R]               5                [R, G]             5                0.5   0.714286  1.428571    0.214286

#### 2.2 Weighted support with date-decay function

The idea of a date support decay function is, that recent transactions should have normally more weight, than older ones.
There are used the following paramters:
x will be determined between [0,1]
- max_date=datetime.datetime(2022, 11, 10),
-> This is the max date, x of max_date is 1
- date_range=10,
-> This is the range of x, max_date - range = 0
- date_sensitivity = lambda x: 1 / (1 + math.exp(-10*x+5))
-> This is the function used for date exemplatory. It is a modfied sigmoid, using the curve range [0,1] to represent date decay
-> This has still to be calibrated and could differ for every new Dataset. For example the curve could be rather flat around 1 for only a small effect
-> In the example with lambda x: 1 / (1 + math.exp(-10*x+5)), the curve is quite extreme and maybe overvalues recent events

In [80]:
data_simple = pd.read_csv("datasets/proof_of_concept/transactions.csv")
data_simple["date"] = pd.to_datetime(data_simple["date"],format='%Y-%m-%d')
df_data_simple_withdate = data_simple.groupby("transaction",dropna=True)["item","date"].agg([lambda x: list(x)])
df_data_simple_withdate

  df_data_simple_withdate = data_simple.groupby("transaction",dropna=True)["item","date"].agg([lambda x: list(x)])


Unnamed: 0_level_0,item,date
Unnamed: 0_level_1,<lambda>,<lambda>
transaction,Unnamed: 1_level_2,Unnamed: 2_level_2
T01,"[F, G, H, I, J, K, M]","[2022-11-01 00:00:00, 2022-11-01 00:00:00, 202..."
T02,"[F, H, I, J, K, L, M]","[2022-11-02 00:00:00, 2022-11-02 00:00:00, 202..."
T03,"[F, H, I, N]","[2022-11-02 00:00:00, 2022-11-03 00:00:00, 202..."
T04,"[F, G, J, L, M, N, R]","[2022-11-04 00:00:00, 2022-11-04 00:00:00, 202..."
T05,"[F, G, J, N, R]","[2022-11-05 00:00:00, 2022-11-05 00:00:00, 202..."
T06,"[F, G, M, N, R]","[2022-11-06 00:00:00, 2022-11-06 00:00:00, 202..."
T07,"[F, K, N]","[2022-11-07 00:00:00, 2022-11-07 00:00:00, 202..."
T08,"[F, G, I, R]","[2022-11-08 00:00:00, 2022-11-08 00:00:00, 202..."
T09,"[G, H, N]","[2022-11-09 00:00:00, 2022-11-09 00:00:00, 202..."
T10,"[G, J, R]","[2022-11-10 00:00:00, 2022-11-10 00:00:00, 202..."


In [81]:
import math
import datetime

rules = mod_fp_growth.fpgrowthFromDataFrame(\
    df_data_simple_withdate,
    minSupRatio=0.5,
    maxSupRatio=1,
    minConf=0,
    item_col=1,
    date_col=2,
    max_date=datetime.datetime(2022, 11, 10),
    date_range=10,
    date_sensitivity = lambda x: 1 / (1 + math.exp(-10*x+5))
    ) #Only Date


print(rules)

  antecedent sup_antecedent consequent  sup_consequent antecedent&consequent  sup_ant&cons  sup_perc_ant&cons confidence      lift improvement
4         []             NA        [G]        4.445881                   [G]      4.445881           0.809327         NA        NA          NA
2         []             NA        [F]        3.446209                   [F]      3.517986           0.640413         NA        NA          NA
0         []             NA        [R]        3.445881                   [R]      3.445881           0.627287         NA        NA          NA
1        [G]       4.445881        [R]        3.445881                [R, G]      3.445881           0.627287   0.775073  1.235595    0.147786
3         []             NA        [N]        3.482014                   [N]      3.410237           0.620798         NA        NA          NA


#### 2.3 Weighted support with profit-dependent function

The main driver for business it not the frequency of items, but the proftibility. One can normally assume, that more frequent items are lower in price, higher in margin, but can be equal with not frequent but highly priced articles.
I've reas in one article, one reason association rules are not applied that often is the lack of relevance. Frequency is only one part, but not the ultimate driver for business.
Instead of just counting each transaction, we weight each article of every transation and set it in a relationship with association rules.
The result has to be interpreted carefully. It cannot be interpreted the same as the traditional methods. The sup_ant&cons is just the profit resulting from a relation. The % of sup_ant&cons represents the percentage of the whole profit of sum of all articles.

The interference of profit and frequency could be solved post association rules creation by business when later connecting all frequencies with profit from articles. However, there we have the problem, that we sorted out the least frequent articles already because of performance or releance and lose crucial relations. Moreover this approach would be not that straightforward.


In [82]:
data_simple = pd.read_csv("datasets/proof_of_concept/transactions.csv")
df_data_simple_withprofit = data_simple.groupby("transaction",dropna=True)["item","profit"].agg([lambda x: list(x)])
df_data_simple_withprofit

  df_data_simple_withprofit = data_simple.groupby("transaction",dropna=True)["item","profit"].agg([lambda x: list(x)])


Unnamed: 0_level_0,item,profit
Unnamed: 0_level_1,<lambda>,<lambda>
transaction,Unnamed: 1_level_2,Unnamed: 2_level_2
T01,"[F, G, H, I, J, K, M]","[10, 20, 30, 40, 50, 60, 80]"
T02,"[F, H, I, J, K, L, M]","[10, 30, 40, 50, 60, 70, 80]"
T03,"[F, H, I, N]","[10, 30, 40, 90]"
T04,"[F, G, J, L, M, N, R]","[10, 20, 50, 70, 80, 90, 100]"
T05,"[F, G, J, N, R]","[10, 20, 50, 90, 100]"
T06,"[F, G, M, N, R]","[10, 20, 80, 90, 100]"
T07,"[F, K, N]","[10, 60, 90]"
T08,"[F, G, I, R]","[10, 20, 40, 100]"
T09,"[G, H, N]","[20, 30, 90]"
T10,"[G, J, R]","[20, 50, 100]"


In [85]:
import math
import datetime

rules = mod_fp_growth.fpgrowthFromDataFrame(\
    df_data_simple_withprofit,
    minSupRatio=0.02,
    maxSupRatio=1,
    minConf=0,
    item_col=1,
    profit_col=2,
    max_profit = 100,
    profit_sensitivity = lambda x : 1 * x
    ) #Only Date

print(rules)
rules.to_excel("fp_groth_out.xlsx")


    antecedent sup_antecedent consequent  sup_consequent antecedent&consequent  sup_ant&cons  sup_perc_ant&cons confidence       lift improvement
0           []             NA        [F]             0.8                   [F]           7.4           0.795699         NA         NA          NA
205         []             NA        [G]             1.4                   [G]           4.8           0.516129         NA         NA          NA
96         [N]            0.7        [F]             0.8                [F, N]           4.8           0.516129   6.857143  79.714286    6.771121
5          [G]            4.8        [F]             0.8                [F, G]           4.8           0.516129        1.0     11.625    0.913978
222        [R]            0.9        [G]             1.4                [R, G]           4.0           0.430108   4.444444   29.52381    4.293907
..         ...            ...        ...             ...                   ...           ...                ...        ...  

#### 2.4 Combined weighted suppport of date-decay and profit dependent function

#### 3. Analysis of Kaggle Dataset

#### 3.1 Investigating Dataset

#### 3.1 Description
Kaggle Dataset: https://www.kaggle.com/datasets/mkechinov/ecommerce-purchase-history-from-electronics-store <br>
The Dataset is Open-Source <br>
This Dataset contains purchase data from April 2020 to November 2020 from a large home appliances and electronics online store. <br>
Each row in the file represents an event. All events are related to products and users. Each event is like many-to-many relation between products and users. <br>

#### 3.2 Senisitivity Analysis