# Association_Rule 關聯規則
- 以下使用R3.6
- 使用arules套件 install.packages("arules")

In [68]:
library("arules")

自定義一份資料集

In [69]:
df = data.frame(ID=c(1,2,3,4,5,6),
                  Onion=c(1,0,0,1,1,1),
                  Potato=c(1,1,0,1,1,1),
                  Burger=c(1,1,0,0,1,1),
                  Milk=c(0,1,1,1,0,1),
                  Beer=c(0,0,1,0,1,0))

In [70]:
df

ID,Onion,Potato,Burger,Milk,Beer
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,1,1,0,0
2,0,1,1,1,0
3,0,0,0,1,1
4,1,1,0,1,0
5,1,1,1,0,1
6,1,1,1,1,0


### 設置支持度來篩選頻繁項集
- apriori(df,parameter=list(support=0.5,minlen=2))
- 上面選擇最小支持度為50%，實務上會根據樣本量及資料特性設定

In [71]:
# 需要將項集轉為logical或factor才可以計算，或史最快的方法直接轉matrix
frequent_itemsets = apriori(as.matrix(df[,-1]),parameter=list(support=0.5,minlen=2))

Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.8    0.1    1 none FALSE            TRUE       5     0.5      2
 maxlen target   ext
     10  rules FALSE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 3 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[5 item(s), 6 transaction(s)] done [0.00s].
sorting and recoding items ... [4 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 done [0.00s].
writing ... [5 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].


In [72]:
# R要使用inspect語法顯示結果
inspect(frequent_itemsets)

    lhs               rhs      support   confidence lift count
[1] {Burger}       => {Potato} 0.6666667 1.0        1.2  4    
[2] {Potato}       => {Burger} 0.6666667 0.8        1.2  4    
[3] {Onion}        => {Potato} 0.6666667 1.0        1.2  4    
[4] {Potato}       => {Onion}  0.6666667 0.8        1.2  4    
[5] {Onion,Burger} => {Potato} 0.5000000 1.0        1.2  3    


In [73]:
# 將結果排序，找出lift值最相關者
sort.rule = sort(frequent_itemsets, by=c("lift"))
inspect(sort.rule)

    lhs               rhs      support   confidence lift count
[1] {Potato}       => {Burger} 0.6666667 0.8        1.2  4    
[2] {Potato}       => {Onion}  0.6666667 0.8        1.2  4    
[3] {Burger}       => {Potato} 0.6666667 1.0        1.2  4    
[4] {Onion}        => {Potato} 0.6666667 1.0        1.2  4    
[5] {Onion,Burger} => {Potato} 0.5000000 1.0        1.2  3    


返回的是各個的指標的數值，可以按照感興趣的指標排序觀察,但具體解釋還得參考實際數據的含義。

In [74]:
rules.sub = subset(frequent_itemsets, subset = lift > 1.125 & confidence > 0.8)
inspect(rules.sub)

    lhs               rhs      support   confidence lift count
[1] {Burger}       => {Potato} 0.6666667 1          1.2  4    
[2] {Onion}        => {Potato} 0.6666667 1          1.2  4    
[3] {Onion,Burger} => {Potato} 0.5000000 1          1.2  3    


得到幾條比較有價值的結果：
- (洋蔥和馬鈴薯)、(漢堡和馬鈴薯)可以搭配著來賣
- 如果洋蔥和漢堡都在購物籃中, 顧客買馬鈴薯的可能性也滿高的，如果他籃子裡面沒有，可以推薦一下.

### one-hot-encoding 編碼轉換
- 實務資料通常需要轉換成 one hot encoding
- 下面以逗號分隔的文字資料，為範例演練

In [75]:
retail = structure(list(ID=c(1,2,3,4,5,6),
                                        Basket=c("'Beer','Diaper','Pretzels','Chips','Aspirin'",
                                                 "'Diaper','Beer','Chips','Lotion','Juice','BabyFood','Milk'",
                                                 "'Soda','Chips','Milk'",
                                                 "'Soup','Beer','Diaper','Milk','IceCream'",
                                                 "'Soda','Coffee','Milk','Bread'",
                                                 "'Beer','Chips'")),
                                   row.names = c(NA, 6),
                                   class = "data.frame")
retail

Unnamed: 0_level_0,ID,Basket
Unnamed: 0_level_1,<dbl>,<chr>
1,1,"'Beer','Diaper','Pretzels','Chips','Aspirin'"
2,2,"'Diaper','Beer','Chips','Lotion','Juice','BabyFood','Milk'"
3,3,"'Soda','Chips','Milk'"
4,4,"'Soup','Beer','Diaper','Milk','IceCream'"
5,5,"'Soda','Coffee','Milk','Bread'"
6,6,"'Beer','Chips'"


In [76]:
# 先把ID欄位抽出來
retail_id = retail['ID']
retail_id

Unnamed: 0_level_0,ID
Unnamed: 0_level_1,<dbl>
1,1
2,2
3,3
4,4
5,5
6,6


In [77]:
# 用逗號分隔每個字詞
library('qdapTools')
retail_Basket = mtabulate(strsplit(retail$Basket, ",",fixed = TRUE))
names(retail_Basket) = gsub("'", "", names(retail_Basket))
retail_Basket

Aspirin,BabyFood,Beer,Bread,Chips,Coffee,Diaper,IceCream,Juice,Lotion,Milk,Pretzels,Soda,Soup
<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,0,1,0,1,0,1,0,0,0,0,1,0,0
0,1,1,0,1,0,1,0,1,1,1,0,0,0
0,0,0,0,1,0,0,0,0,0,1,0,1,0
0,0,1,0,0,0,1,1,0,0,1,0,0,1
0,0,0,1,0,1,0,0,0,0,1,0,1,0
0,0,1,0,1,0,0,0,0,0,0,0,0,0


In [78]:
# 把ID欄位組回來
retail = cbind(retail_id,retail_Basket)
retail

Unnamed: 0_level_0,ID,Aspirin,BabyFood,Beer,Bread,Chips,Coffee,Diaper,IceCream,Juice,Lotion,Milk,Pretzels,Soda,Soup
Unnamed: 0_level_1,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,1,1,0,1,0,1,0,1,0,0,0,0,1,0,0
2,2,0,1,1,0,1,0,1,0,1,1,1,0,0,0
3,3,0,0,0,0,1,0,0,0,0,0,1,0,1,0
4,4,0,0,1,0,0,0,1,1,0,0,1,0,0,1
5,5,0,0,0,1,0,1,0,0,0,0,1,0,1,0
6,6,0,0,1,0,1,0,0,0,0,0,0,0,0,0


In [79]:
#一樣先轉matrix
frequent_itemsets_2 = apriori(as.matrix(retail[,-1]),parameter=list(support=0.5,minlen=1))

Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.8    0.1    1 none FALSE            TRUE       5     0.5      1
 maxlen target   ext
     10  rules FALSE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 3 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[14 item(s), 6 transaction(s)] done [0.00s].
sorting and recoding items ... [4 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 done [0.00s].
writing ... [1 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].


In [80]:
inspect(frequent_itemsets_2)

    lhs         rhs    support confidence lift count
[1] {Diaper} => {Beer} 0.5     1          1.5  3    


根據演算法結果，可以發現，Diaper對Beer更有影響一些。

### 實戰演練-電影題材關聯
資料集：[MovieLens (small)](https://grouplens.org/datasets/movielens/)

In [81]:
movies = read.csv('movies.csv')
head(movies)

Unnamed: 0_level_0,movieId,title,genres
Unnamed: 0_level_1,<int>,<fct>,<fct>
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,2,Jumanji (1995),Adventure|Children|Fantasy
3,3,Grumpier Old Men (1995),Comedy|Romance
4,4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,5,Father of the Bride Part II (1995),Comedy
6,6,Heat (1995),Action|Crime|Thriller


資料中包括電影名字與電影類型的標籤，第一步還是先轉換成one-hot，因為genres是factor，我們先轉文字再處理。

In [82]:
movies_one = mtabulate(strsplit(as.character(movies$genres), "|",fixed = TRUE))
movies_one = cbind(movieId = movies$movieId,movies_one)
head(movies_one)

Unnamed: 0_level_0,movieId,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,...,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,1,0,0,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,0,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,4,0,0,0,0,0,1,0,0,1,...,0,0,0,0,0,1,0,0,0,0
5,5,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,6,0,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0


In [83]:
dim(movies_one)

資料集包括9125部電影，一共有20種不同類型。

In [84]:
# 實務上min_support會根據樣本數量、項集數量去預估，通常不會設定太高，不然會跑不出結果
frequent_itemsets_movies = apriori(as.matrix(movies_one[,-1]),parameter=list(support=0.025,confidence=0.4,minlen=2))

Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.4    0.1    1 none FALSE            TRUE       5   0.025      2
 maxlen target   ext
     10  rules FALSE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 228 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[20 item(s), 9125 transaction(s)] done [0.00s].
sorting and recoding items ... [16 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 done [0.00s].
writing ... [18 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].


In [85]:
inspect(frequent_itemsets_movies)

     lhs                 rhs         support    confidence lift      count
[1]  {War}            => {Drama}     0.03101370 0.7711172  1.6120147 283  
[2]  {Animation}      => {Children}  0.02706849 0.5525727  8.6487581 247  
[3]  {Children}       => {Animation} 0.02706849 0.4236707  8.6487581 247  
[4]  {Mystery}        => {Thriller}  0.03605479 0.6058932  3.1976723 329  
[5]  {Mystery}        => {Drama}     0.03167123 0.5322284  1.1126194 289  
[6]  {Children}       => {Adventure} 0.02926027 0.4579760  3.7412989 267  
[7]  {Children}       => {Comedy}    0.03287671 0.5145798  1.4164526 300  
[8]  {Horror}         => {Thriller}  0.04339726 0.4515393  2.3830517 396  
[9]  {Fantasy}        => {Adventure} 0.03068493 0.4281346  3.4975182 280  
[10] {Sci-Fi}         => {Action}    0.04098630 0.4722222  2.7890147 374  
[11] {Crime}          => {Thriller}  0.05786301 0.4800000  2.5332562 528  
[12] {Crime}          => {Drama}     0.06761644 0.5609091  1.1725763 617  
[13] {Adventure}      => 

In [86]:
# 找出給小孩題材相關電影
rules_movies = subset(frequent_itemsets_movies, lhs %pin% "Children")
inspect(rules_movies)

    lhs           rhs         support    confidence lift     count
[1] {Children} => {Animation} 0.02706849 0.4236707  8.648758 247  
[2] {Children} => {Adventure} 0.02926027 0.4579760  3.741299 267  
[3] {Children} => {Comedy}    0.03287671 0.5145798  1.416453 300  


Children和Animation 這兩個題材是最相關的了，其實滿符合現實常識。再來是冒險類型，相關性高也滿好理解。最後一項是Comedy喜劇，雖然有相關，但跟其他比起來相關性就沒這麼高。