### Introduction

* `dplyr` is fine if we want to use just one LHS item to predict a single RHS item  
* What about multiple LHS items?? 
* Best rule among all RHS items?? 
* Need a better "search" algorithm 

### Automation with `arules`

Automate the process with the arules library

In [2]:
install.packages("arules")

“installation of package ‘arules’ had non-zero exit status”Updating HTML index of packages in '.Library'
Making 'packages.html' ... done


In [3]:
library(arules)

ERROR: Error in library(arules): there is no package called ‘arules’


To use `arules` package, columns must be factors:

In [None]:
library(dplyr)

In [None]:
groceries <- read.csv('https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/data/Groceries.csv')
head(groceries)

In [None]:
groc_factors <-
  groceries %>%
  mutate_if(is.integer, as.factor)

### Compute the rules


* Use `apriori` function to compute rules; `parameter = ` to set min values
* Default: `parameter = list(support = .1, confidence = .8, maxlen = 10)`
    * `maxlen = 2` sets one item on LHS, one on RHS
    *  **Support filter refers to the JOINT support!  *SUPPORT{LHS, RHS}***

In [None]:
groc_rules <- apriori(groc_factors, 
                      parameter = list(supp = 0.05,
                                       conf = 0.2,
                                       maxlen = 2))

### Investigate

* `apriori` output are of S4 class `"rules"`
* use `@` symbols like `$`, or as functions

In [None]:
class(groc_rules)
str(groc_rules)

### Inspect

Use `inspect()` to get a feel for the structure:

In [None]:
inspect(groc_rules[1:10]) 

**Remember, `support` column is the JOINT support of {LHS,RHS}**

### Pull out the rules with whole.milk

* Use `subset()` function to filter rules
* Use `head(rules, n= , by = , decreasing = )` to select top-n and bottom-n rules 

In [None]:
milk_rules <-  subset(groc_rules, subset = rhs %in% 'whole.milk=1') 
inspect(head(milk_rules, n = 5, by = 'lift'))
inspect(head(milk_rules, n = 5, by = 'lift', decreasing = FALSE))

### Using piping

In [None]:
groc_rules %>%
  subset(rhs %in% 'whole.milk=1') %>%
  head(10, by = 'lift') %>% 
  inspect()

### Considering more than one item on LHS the number of rules

In [None]:
#control = list(verbose = FALSE) to suppress progress printing 
groc_rules2 <- apriori(groc_factors, 
                      parameter = list(supp = 0.05,
                                       conf = 0.2,
                                       maxlen = 4), 
                      control = list(verbose = FALSE))

#### Finding the 10 best rules for predicting whole milk, considering rules with at least 8% support.

In [None]:
milk_rules_8pct <-  subset(groc_rules2, subset = rhs %in% 'whole.milk=1' & support > .08) 
milk_rules_8pct %>% 
  head(n=10, by = 'lift') %>% 
  inspect() 

#### Finding the 10 best rules overall, among rules with at least 10% support. 

In [None]:
rules_10pct <-  subset(groc_rules2, subset = support > .1) 
rules_10pct %>% 
  head(10, by = 'lift') %>% 
  inspect()

> Interpretation of lift = 1.095: *Knowing that vegetables were NOT purchased and soda WAS purchased increases the likelihood that milk was NOT purchased by 9.5%, relative to the overall rate at which milk was NOT purchased.*

### Visualizing association rules


The `arulesViz` package can be used to visualize and interact with individual rules 

In [None]:
install.packages('arulesViz')

In [None]:
library(arulesViz)

In [None]:
plot(milk_rules_8pct)

In [None]:
#change the visual encoding:
plot(milk_rules_8pct, measure = c('support','lift'), shading = 'confidence')

Use `engine = 'interactive'` to highlight and inspect rules; double-clicking to shade then clicking "inspect" :

In [None]:
plot(milk_rules_8pct, measure = c('support','lift'), shading = 'confidence', 
     engine = 'interactive')

<img width="400" src="https://github.com/WSU-DataScience/DSCI_210_R_notebooks/raw/main/img/interactive-aviz.png"> 

Use `method = 'grouped'` useful for identifying small numbers of quality rules with various `RHS`:

In [None]:
top1000 <- rules_10pct %>% 
  head(1000, by = 'lift') 
plot(top1000, method = 'grouped')

In [None]:
plot(top1000, method = 'grouped', measure = 'lift',shading = 'support')