## ASSOCIATION RULE ANALYSIS WITH APRIORI ALGORITHM

### Data Understanding

This dataset contains information about foreign language courses.

We convert all datas to courses with `names=['courses']`

In [46]:
import pandas as pd
df = pd.read_csv("mydataset.csv", names=['courses'], header = None)

In [47]:
df.head()

Unnamed: 0,courses
0,"TURKISH,GERMAN,PORTUGUESE"
1,"GERMAN,PORTUGESE,JAPANESE"
2,"RUSSIAN,POLISH,TURKISH"
3,"GERMAN,SPANISH,TURKISH"
4,"KOREAN,FRENCH,ARABIC,POLISH"


We can observe dimensions of dataset with `shape`.

In [48]:
df.shape

(21, 1)

Our columns and their data types.

In [49]:
df.columns

Index(['courses'], dtype='object')

Observations of each rows.


In [50]:
df.values

array([['TURKISH,GERMAN,PORTUGUESE'],
       ['GERMAN,PORTUGESE,JAPANESE'],
       ['RUSSIAN,POLISH,TURKISH'],
       ['GERMAN,SPANISH,TURKISH'],
       ['KOREAN,FRENCH,ARABIC,POLISH'],
       ['KOREAN,RUSSIAN,POLISH'],
       ['SPANISH,TURKISH'],
       ['GERMAN,PORTUGUESE'],
       ['GERMAN,PORTUGUESE'],
       ['FRENCH,CHINESE,TURKISH,ARABIC'],
       ['RUSSIAN,TURKISH,GERMAN,SPANISH'],
       ['SPANISH,JAPANESE'],
       ['KOREAN,JAPANESE,CHINESE'],
       ['PORTUGUESE,SPANISH'],
       ['ENGLISH,TURKISH'],
       ['TURKISH,GERMAN'],
       ['ENGLISH,FRENCH,SPANISH'],
       ['ENGLISH,SPANISH,GERMAN'],
       ['ENGLISH,RUSSIAN,TURKISH'],
       ['CHINESE,ARABIC,POLISH'],
       ['CHINESE,ARABIC,POLISH']], dtype=object)

We will split data of each row with `,` so that we can observe each data individually.

In [51]:
data = list(df["courses"].apply(lambda x:x.split(',')))
data 

[['TURKISH', 'GERMAN', 'PORTUGUESE'],
 ['GERMAN', 'PORTUGESE', 'JAPANESE'],
 ['RUSSIAN', 'POLISH', 'TURKISH'],
 ['GERMAN', 'SPANISH', 'TURKISH'],
 ['KOREAN', 'FRENCH', 'ARABIC', 'POLISH'],
 ['KOREAN', 'RUSSIAN', 'POLISH'],
 ['SPANISH', 'TURKISH'],
 ['GERMAN', 'PORTUGUESE'],
 ['GERMAN', 'PORTUGUESE'],
 ['FRENCH', 'CHINESE', 'TURKISH', 'ARABIC'],
 ['RUSSIAN', 'TURKISH', 'GERMAN', 'SPANISH'],
 ['SPANISH', 'JAPANESE'],
 ['KOREAN', 'JAPANESE', 'CHINESE'],
 ['PORTUGUESE', 'SPANISH'],
 ['ENGLISH', 'TURKISH'],
 ['TURKISH', 'GERMAN'],
 ['ENGLISH', 'FRENCH', 'SPANISH'],
 ['ENGLISH', 'SPANISH', 'GERMAN'],
 ['ENGLISH', 'RUSSIAN', 'TURKISH'],
 ['CHINESE', 'ARABIC', 'POLISH'],
 ['CHINESE', 'ARABIC', 'POLISH']]

### Data Preprocessing

In [52]:
from mlxtend.preprocessing import TransactionEncoder

We convert our dataset as boolean data type. By doing this operation we will use `TransactionEncoder()`.

In [53]:
te = TransactionEncoder()
te_data = te.fit(data).transform(data)
df = pd.DataFrame(te_data,columns=te.columns_)
df

Unnamed: 0,ARABIC,CHINESE,ENGLISH,FRENCH,GERMAN,JAPANESE,KOREAN,POLISH,PORTUGESE,PORTUGUESE,RUSSIAN,SPANISH,TURKISH
0,False,False,False,False,True,False,False,False,False,True,False,False,True
1,False,False,False,False,True,True,False,False,True,False,False,False,False
2,False,False,False,False,False,False,False,True,False,False,True,False,True
3,False,False,False,False,True,False,False,False,False,False,False,True,True
4,True,False,False,True,False,False,True,True,False,False,False,False,False
5,False,False,False,False,False,False,True,True,False,False,True,False,False
6,False,False,False,False,False,False,False,False,False,False,False,True,True
7,False,False,False,False,True,False,False,False,False,True,False,False,False
8,False,False,False,False,True,False,False,False,False,True,False,False,False
9,True,True,False,True,False,False,False,False,False,False,False,False,True


### Data Association Rules

In [54]:
from mlxtend.frequent_patterns import apriori

#### Support
On our boolean type dataset, we will apply `support` method.

In [62]:
df1 = apriori(df, min_support=0.03, use_colnames = True, verbose=1)
df1 = df1.sort_values(by="support", ascending=False)
df1

Processing 156 combinations | Sampling itemset size 2Processing 387 combinations | Sampling itemset size 3Processing 200 combinations | Sampling itemset size 4Processing 15 combinations | Sampling itemset size 5


Unnamed: 0,support,itemsets
12,0.428571,(TURKISH)
4,0.380952,(GERMAN)
11,0.333333,(SPANISH)
7,0.238095,(POLISH)
0,0.190476,(ARABIC)
...,...,...
40,0.047619,"(JAPANESE, SPANISH)"
42,0.047619,"(RUSSIAN, KOREAN)"
44,0.047619,"(POLISH, TURKISH)"
45,0.047619,"(PORTUGUESE, SPANISH)"


This table shows us which course, what percentage of the total course takers took.
- %42 of all course takers took TURKISH
- %38 of all course takers took GERMAN
- %33 of all course takers took SPANISH
- %4 of all course takers took "GERMAN, RUSSIAN, TURKISH, SPANISH" together.

#### Confidence
On our boolean type dataset, we will apply `confidence` method.

In [57]:
from mlxtend.frequent_patterns import association_rules

Confidence gives us the rate at which one course is taken, the second course is taken.

In [63]:
association_rules(df1, metric = "confidence", min_threshold = 0.5)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(GERMAN),(TURKISH),0.380952,0.428571,0.190476,0.50,1.166667,0.027211,1.142857
1,(ARABIC),(CHINESE),0.190476,0.190476,0.142857,0.75,3.937500,0.106576,3.238095
2,(CHINESE),(ARABIC),0.190476,0.190476,0.142857,0.75,3.937500,0.106576,3.238095
3,(ARABIC),(POLISH),0.190476,0.238095,0.142857,0.75,3.150000,0.097506,3.047619
4,(POLISH),(ARABIC),0.238095,0.190476,0.142857,0.60,3.150000,0.097506,2.023810
...,...,...,...,...,...,...,...,...,...
87,"(GERMAN, RUSSIAN, SPANISH)",(TURKISH),0.047619,0.428571,0.047619,1.00,2.333333,0.027211,inf
88,"(GERMAN, TURKISH, SPANISH)",(RUSSIAN),0.095238,0.190476,0.047619,0.50,2.625000,0.029478,1.619048
89,"(RUSSIAN, TURKISH, SPANISH)",(GERMAN),0.047619,0.380952,0.047619,1.00,2.625000,0.029478,inf
90,"(GERMAN, RUSSIAN)","(TURKISH, SPANISH)",0.047619,0.142857,0.047619,1.00,7.000000,0.040816,inf


- GERMAN, TURKISH courses are 19% likely to be taken together and this is the duo with the highest support.
- 14% probability of taking ARABIC, CHINESE together.
- The probability of taking the GERMAN course alone is 38%.
- The probability of taking the TURKISH course alone is 42%.
- Those who take the GERMAN course have a 50% chance of taking the TURKISH course.
- After GERMAN was taken, the rate of taking TURKISH increased by 1.16.


#### Using Support and Confidence Metrics Together
- Support >= 0.05
- Confidence >= 0.7

In [68]:
rules = association_rules(df1, metric = "confidence", min_threshold = 0.7)
rules[ (rules['confidence'] >= 0.7) & (rules['support'] >= 0.05) ]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(ARABIC),(CHINESE),0.190476,0.190476,0.142857,0.75,3.9375,0.106576,3.238095
1,(CHINESE),(ARABIC),0.190476,0.190476,0.142857,0.75,3.9375,0.106576,3.238095
2,(ARABIC),(POLISH),0.190476,0.238095,0.142857,0.75,3.15,0.097506,3.047619
3,(PORTUGUESE),(GERMAN),0.190476,0.380952,0.142857,0.75,1.96875,0.070295,2.47619
4,(RUSSIAN),(TURKISH),0.190476,0.428571,0.142857,0.75,1.75,0.061224,2.285714
5,"(POLISH, CHINESE)",(ARABIC),0.095238,0.190476,0.095238,1.0,5.25,0.077098,inf


### Reporting

Now we will analyze our dataset.
- ARABIC, CHINESE courses are 14% likely to be taken together and this is the duo with the highest support.
- The probability of taking the ARABIC course alone is 19%.
- ARABIC, POLISH courses are 14% likely to be taken together and this is the duo with the highest support.
- The probability of taking the PORTUGUESE course alone is 19%.
- 75% of those who take the PORTUGUESE course also take the GERMAN course.
- After PORTUGUESE was taken, the rate of taking GERMAN increased by 1.96.
- After RUSSIAN was taken, the rate of taking TURKISH increased by 1.75.
- After POLISH and CHINESE were taken, the rate of taking ARABIC increased by 5.25.

### Action Idea

- If we give the CHINESE course as a gift to those who take the POLISH course, profit can be made from this campaign as it will also ensure that the ARABIC course is taken.
- If a general discount is applied to the CHINESE course, the purchase of ARABIC course may increase, which will increase the sales of the POLISH course. In other words, by making a discount on one course in this way, we will increase the sales of 2 courses.