## Association Rules - R
### - _Ankur Patel_

### Introduction:

Association rule mining is a technique to identify underlying relations between different items. An association rule has two parts: an Antecedent (if) and a Consequent (then). There are three major components of Apriori algorithm: Support,
Confidence, and Lift. In this data set, Churn will be only consequent while others can be either. 

The Churn data set will be used to predict Churn depending on VMail Plan, Intl Plan, and CustServ CallsSet (which will be set ordinal). 

### Preprocessing:

In [30]:
library(dplyr)

In [3]:
library(tidyverse)

-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
v ggplot2 3.2.1     v purrr   0.3.2
v tibble  2.1.3     v dplyr   0.8.3
v tidyr   1.0.0     v stringr 1.4.0
v readr   1.3.1     v forcats 0.4.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()


In [5]:
library(ggplot2)

In [4]:
library(arules)

Loading required package: Matrix

Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack


Attaching package: 'arules'

The following object is masked from 'package:dplyr':

    recode

The following objects are masked from 'package:base':

    abbreviate, write



In [6]:
library(arulesViz)

Loading required package: grid
Registered S3 method overwritten by 'seriation':
  method         from 
  reorder.hclust gclus


In [12]:
# read dataset
df <- read.csv('Churn.csv')
head(df)

State,Account.Length,Area.Code,Phone,Int.l.Plan,VMail.Plan,VMail.Message,Day.Mins,Day.Calls,Day.Charge,...,Eve.Calls,Eve.Charge,Night.Mins,Night.Calls,Night.Charge,Intl.Mins,Intl.Calls,Intl.Charge,CustServ.Calls,Churn.
KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False.
OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False.
NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False.
OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False.
OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False.
AL,118,510,391-8027,yes,no,0,223.4,98,37.98,...,101,18.75,203.9,118,9.18,6.3,6,1.7,0,False.


In [35]:
# subset of 3 predictors and Churn
df <- select(df, "VMail.Plan","Int.l.Plan","CustServ.Calls","Churn.")
head(df)

VMail.Plan,Int.l.Plan,CustServ.Calls,Churn.
yes,no,1,False.
yes,no,1,False.
no,no,0,False.
no,yes,2,False.
no,yes,3,False.
no,yes,0,False.


In [37]:
# change CustServ Calls to factor
df$CustServ.Calls <- ordered(as.factor(df$CustServ.Calls))
head(df)

VMail.Plan,Int.l.Plan,CustServ.Calls,Churn.
yes,no,1,False.
yes,no,1,False.
no,no,0,False.
no,yes,2,False.
no,yes,3,False.
no,yes,0,False.


In [45]:
# baseline distributions of the variables
t1 <- table(df$VMail.Plan)
t11 <- rbind(t1, round(prop.table(t1), 4))
colnames(t11) <- c("VMail.Plan = no", "VMail.Plan = yes")
rownames(t11) <- c("Count", "Proportion")
t2 <- table(df$Int.l.Plan)
t22 <- rbind(t2, round(prop.table(t2), 4))
colnames(t22) <- c("Intl.Plan = no", "Intl.Plan = yes")
rownames(t22) <- c("Count", "Proportion")
t3 <- table(df$CustServ.Calls)
t33 <- rbind(t3, round(prop.table(t3), 4))
colnames(t33) <- c("Calls 0","Calls 1","Calls 2","Calls 3")
rownames(t33) <- c("Count", "Proportion")
t4 <- table(df$Churn.)
t44 <- rbind(t4, round(prop.table(t4), 4))
colnames(t44) <- c("Churn = no", "Intl.Plan = yes")
rownames(t44) <- c("Churn", "Proportion")

In [49]:
t11

Unnamed: 0,VMail.Plan = no,VMail.Plan = yes
Count,2411.0,922.0
Proportion,0.7234,0.2766


In [50]:
t22

Unnamed: 0,Intl.Plan = no,Intl.Plan = yes
Count,3010.0,323.0
Proportion,0.9031,0.0969


In [51]:
t33

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
Count,697.0,1181.0,759.0,429.0,166.0,66.0,22.0,9.0,2.0,2.0
Proportion,0.2091,0.3543,0.2277,0.1287,0.0498,0.0198,0.0066,0.0027,0.0006,0.0006


In [52]:
t44

Unnamed: 0,Churn = no,Intl.Plan = yes
Churn,2850.0,483.0
Proportion,0.8551,0.1449


### Review:
- After importing libraries and the Churn dataset, it was sliced for a subset of 3 predictors and target Churn
- CustServ.Calls was set to ordinal by using ordered() inside as.factor()
- Distributions of the variables were created using table(), prop.table(), round()
    - e.g:
    - t1 - count of customers having VMail.Plan
    - t11 - matrix of the counts and proportions

### Find the association rule with the greatest lift.
### Report the following for the rule: Number of instances, Support %, Confidence %, Lift.

- minimum antecedent support - 1%
- minimum rule confidence - 5%
- maximum number of antecedents to 1

In [53]:
# convert table to transactions for arules
df1 <- as(df, "transactions")
summary(df1)

transactions as itemMatrix in sparse format with
 3333 rows (elements/itemsets/transactions) and
 16 columns (items) and a density of 0.25 

most frequent items:
   Int.l.Plan=no    Churn.=False.    VMail.Plan=no CustServ.Calls=1 
            3010             2850             2411             1181 
  VMail.Plan=yes          (Other) 
             922             2958 

element (itemset/transaction) length distribution:
sizes
   4 
3333 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      4       4       4       4       4       4 

includes extended item information - examples:
          labels  variables levels
1  VMail.Plan=no VMail.Plan     no
2 VMail.Plan=yes VMail.Plan    yes
3  Int.l.Plan=no Int.l.Plan     no

includes extended transaction information - examples:
  transactionID
1             1
2             2
3             3

In [56]:
# explore 4 transactions
inspect(head(df1, 4))

    items              transactionID
[1] {VMail.Plan=yes,                
     Int.l.Plan=no,                 
     CustServ.Calls=1,              
     Churn.=False.}                1
[2] {VMail.Plan=yes,                
     Int.l.Plan=no,                 
     CustServ.Calls=1,              
     Churn.=False.}                2
[3] {VMail.Plan=no,                 
     Int.l.Plan=no,                 
     CustServ.Calls=0,              
     Churn.=False.}                3
[4] {VMail.Plan=no,                 
     Int.l.Plan=yes,                
     CustServ.Calls=2,              
     Churn.=False.}                4


In [61]:
# target="rules" for association rules
# minlen=2 and maxlen=2 specifies that we want antecedents with exactly 1
rules <- apriori(df1, parameter=list(target="rules", supp=0.01, 
                                     conf=0.05, maxlen=2, minlen=2, ext=TRUE))

Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
       0.05    0.1    1 none FALSE            TRUE       5    0.01      2
 maxlen target  ext
      2  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 33 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[16 item(s), 3333 transaction(s)] done [0.00s].
sorting and recoding items ... [12 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2

"Mining stopped (maxlen reached). Only patterns up to a length of 2 returned!"

 done [0.00s].
writing ... [83 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].


In [63]:
inspect(head(rules))

    lhs                   rhs                support    confidence lhs.support
[1] {CustServ.Calls=5} => {Churn.=True.}     0.01200120 0.60606061 0.01980198 
[2] {Churn.=True.}     => {CustServ.Calls=5} 0.01200120 0.08281573 0.14491449 
[3] {CustServ.Calls=5} => {VMail.Plan=no}    0.01470147 0.74242424 0.01980198 
[4] {CustServ.Calls=5} => {Int.l.Plan=no}    0.01800180 0.90909091 0.01980198 
[5] {CustServ.Calls=4} => {Churn.=True.}     0.02280228 0.45783133 0.04980498 
[6] {Churn.=True.}     => {CustServ.Calls=4} 0.02280228 0.15734990 0.14491449 
    lift     count
[1] 4.182195 40   
[2] 4.182195 40   
[3] 1.026338 49   
[4] 1.006645 60   
[5] 3.159321 76   
[6] 3.159321 76   


In [65]:
# top 10 rules by support
inspect(head(rules, by="support", n=10))

     lhs                   rhs                support   confidence lhs.support
[1]  {Churn.=False.}    => {Int.l.Plan=no}    0.7992799 0.9347368  0.8550855  
[2]  {Int.l.Plan=no}    => {Churn.=False.}    0.7992799 0.8850498  0.9030903  
[3]  {VMail.Plan=no}    => {Int.l.Plan=no}    0.6540654 0.9041891  0.7233723  
[4]  {Int.l.Plan=no}    => {VMail.Plan=no}    0.6540654 0.7242525  0.9030903  
[5]  {VMail.Plan=no}    => {Churn.=False.}    0.6024602 0.8328494  0.7233723  
[6]  {Churn.=False.}    => {VMail.Plan=no}    0.6024602 0.7045614  0.8550855  
[7]  {CustServ.Calls=1} => {Int.l.Plan=no}    0.3207321 0.9051651  0.3543354  
[8]  {Int.l.Plan=no}    => {CustServ.Calls=1} 0.3207321 0.3551495  0.9030903  
[9]  {CustServ.Calls=1} => {Churn.=False.}    0.3177318 0.8966977  0.3543354  
[10] {Churn.=False.}    => {CustServ.Calls=1} 0.3177318 0.3715789  0.8550855  
     lift      count
[1]  1.0350425 2664 
[2]  1.0350425 2664 
[3]  1.0012167 2180 
[4]  1.0012167 2180 
[5]  0.9739955 2008 
[6]  

In [66]:
# top 10 rules by confidence
inspect(head(rules, by="confidence", n=10))

     lhs                   rhs             support   confidence lhs.support
[1]  {Churn.=False.}    => {Int.l.Plan=no} 0.7992799 0.9347368  0.85508551 
[2]  {CustServ.Calls=2} => {Int.l.Plan=no} 0.2091209 0.9183136  0.22772277 
[3]  {VMail.Plan=yes}   => {Churn.=False.} 0.2526253 0.9132321  0.27662766 
[4]  {CustServ.Calls=3} => {Int.l.Plan=no} 0.1173117 0.9114219  0.12871287 
[5]  {CustServ.Calls=5} => {Int.l.Plan=no} 0.0180018 0.9090909  0.01980198 
[6]  {CustServ.Calls=1} => {Int.l.Plan=no} 0.3207321 0.9051651  0.35433543 
[7]  {VMail.Plan=no}    => {Int.l.Plan=no} 0.6540654 0.9041891  0.72337234 
[8]  {VMail.Plan=yes}   => {Int.l.Plan=no} 0.2490249 0.9002169  0.27662766 
[9]  {CustServ.Calls=3} => {Churn.=False.} 0.1155116 0.8974359  0.12871287 
[10] {CustServ.Calls=1} => {Churn.=False.} 0.3177318 0.8966977  0.35433543 
     lift      count
[1]  1.0350425 2664 
[2]  1.0168569  697 
[3]  1.0680009  842 
[4]  1.0092257  391 
[5]  1.0066445   60 
[6]  1.0022975 1069 
[7]  1.0012167 21

In [67]:
# top 10 rules by lift
inspect(head(rules, by="lift", n=10))

     lhs                   rhs                support    confidence lhs.support
[1]  {CustServ.Calls=5} => {Churn.=True.}     0.01200120 0.60606061 0.01980198 
[2]  {Churn.=True.}     => {CustServ.Calls=5} 0.01200120 0.08281573 0.14491449 
[3]  {CustServ.Calls=4} => {Churn.=True.}     0.02280228 0.45783133 0.04980498 
[4]  {Churn.=True.}     => {CustServ.Calls=4} 0.02280228 0.15734990 0.14491449 
[5]  {Int.l.Plan=yes}   => {Churn.=True.}     0.04110411 0.42414861 0.09690969 
[6]  {Churn.=True.}     => {Int.l.Plan=yes}   0.04110411 0.28364389 0.14491449 
[7]  {Int.l.Plan=yes}   => {CustServ.Calls=0} 0.02490249 0.25696594 0.09690969 
[8]  {CustServ.Calls=0} => {Int.l.Plan=yes}   0.02490249 0.11908178 0.20912091 
[9]  {VMail.Plan=no}    => {Churn.=True.}     0.12091209 0.16715056 0.72337234 
[10] {Churn.=True.}     => {VMail.Plan=no}    0.12091209 0.83436853 0.14491449 
     lift     count
[1]  4.182195  40  
[2]  4.182195  40  
[3]  3.159321  76  
[4]  3.159321  76  
[5]  2.926889 137  


### Conclusion:
The Association Rule with the greatest Lift:
- Lift - 4.18
- Antecedent - {CustServ.Calls=5}
- Consequent - {Churn.=True.}
- Support - 1.2%
- Confidence - 60.6%.