# Lab 2: Association Rules

Learn how to:

- Read transaction data into R
- Inspect and visualize transaction data
- Find association rules
- Inspect the rules we find
- We will use arules package
- You need to install and load the package before using it

In [None]:
#Load the package arules in R

# install.packages("arules") or install `r-arules` in Anaconda.
# Try loading the library first, it might has been installed
library(arules)

# Data Format

We cannot load transaction data using the traditional `read.csv`:

Is the following output computerable?

In [None]:
#
temp <- read.csv("coffeeshop.csv", header = FALSE)
temp

# Data Formats for Transactions

The data shown above is what we call a "item list" format, or "shopping basket".

It is human readable, just like how you will shop in a store. But it is not helpful in a computational sense.

Let's see how we can do better.

In [None]:
coffee_data = read.transactions("coffeeshop.csv", format = "basket", sep = ",", rm.duplicates = TRUE)

 - `format = "basket"` specify how the csv file is formatted.
 - `sep = ","` because the csv file is comma separated.
 - `rm.duplicates` remove duplicate items in a single transaction. (We usually do this)

`inspect()` can be used to see all transactions.

Let's see the new data representation:

In [None]:
inspect(coffee_data)

It still looks like a item list, but now it is formatted in a way that can be computed.

Now R understands each row as an itemset.

Each row is now a set (as in math). For example, you can easily compute what are the unique items:

In [None]:
itemInfo(coffee_data)

To learn how many items in each transaction, we can use

In [None]:
size(coffee_data)

## Support

To find the support percentage of each unique item (frequency)

In [None]:
itemFrequency(coffee_data)

In [None]:
itemFrequency(coffee_data, type = "absolute") # or get the support count

## Visualizing Support

To plot the support and get a quick glance of all 1-itemsets.

In [None]:
itemFrequencyPlot(coffee_data, ylim = c(0, 1), main = 
                    "Support %", col = "steelblue3")

we can have the items ordered based on support %  (or differently, if you have a large dataset, you can ask to see only the top N items, where N is a number of your choice)

In [None]:
itemFrequencyPlot(coffee_data, ylim = c(0, 1), main = "Support %", col = "steelblue3", topN = 5)

Rotate the graph to be horizontal.

In [None]:
itemFrequencyPlot(coffee_data, main = "Support %", col = "steelblue3", topN = 5, hor = TRUE, xlim = c(0,1))

## Visualization of entire dataset

On the horizontal axis, you have individual items; each column tells us in which
transaction the corresponding item appears; on the vertical axis, you have the transactions
each row tells us which items are included in the corresponding transaction

In [None]:
itemLabels <- c("bagel", "chocolate", "coffee", "cookie", "tea") # can you think of a better way when you have thousands of items? Hint: itemInfo
image(coffee_data, xlab = itemLabels)

In [None]:
itemInfo(coffee_data)

In [None]:
itemLabels <- itemInfo(coffee_data)$labels # automated
image(coffee_data, xlab = itemLabels)

This is also called "binary matrix" format (as opposed to "item list"). In the matrix, the dark areas are `1`s, and light areas are `0`s.

Let's try loading the dataset in binary matrix format.

In [None]:
coffee_binary = read.csv("coffeeshop_binary.csv")
coffee_binary[c("bagel", "chocolate", "coffee", "cookie", "tea")] # reorder the columns to compare with the figure above

However, we are not done yet. We need to let the R understand that this is a transaction data. First, transform from a data frame to a matrix.

In [None]:
coffee_matrix = as.matrix(coffee_binary)
coffee_matrix # note the difference shown in the header of the output

In [None]:
coffee_transaction = as(coffee_matrix, "transactions")
coffee_transaction

In [None]:
inspect(coffee_transaction)

## Subsetting data

What do coffee buyers also buy?

The operator `%in%` will look for transactions that contain the item specified. If more than one item is specified, it will look for transactions that have any of the item listed

In [None]:
temp <- subset(coffee_data, items %in% c("coffee"))
inspect(temp)

Transactions with either coffee or tea, or both.

In [None]:
temp <- subset(coffee_data, items %in% c("coffee", "tea"))
inspect(temp)

If we want to know the transactions that contain **ALL** the items listed we need to use the operator `%ain%`

Transactions with coffee **AND** tea:

In [None]:
temp <- subset(coffee_data, items %ain% c("coffee", "tea"))
inspect(temp) 

## Contingency Table with `crosstable`

What are the two-item pairs that are most likely to be purchased together (co-occur)?

You can get support, support count, lift of these item-pairs. However confidence is not supported. 

In [None]:
crossTable(coffee_data, sort = TRUE, measure = "support")

In [None]:
crossTable(coffee_data, sort = TRUE, measure = "count")

In [None]:
crossTable(coffee_data, sort = TRUE, measure = "probability")

In [None]:
crossTable(coffee_data, sort = TRUE, measure = "lift")

# Mining Itemsets

Find all the itemsets above a certain support threshold:

In [None]:
coffee_itm <- eclat(coffee_data, parameter = list(support = 0.5))

In [None]:
inspect(coffee_itm)

## Apriori

The `apriori()` function can be used to find frequent item-sets and association rules based on the Apriori algorithm

In [None]:
frequent <- apriori(coffee_data, parameter = list(supp = 0.5, target = "frequent"))

## Inspecting Mined Association Rules

`summary()` shows:

The summary shows:

- the most frequent item-sets
- how many frequent item-sets were found of different sizes
- Summary statistics for support and count
- The last line, shows summary information about the datasets: we use the data coffee. with a total of 4 transactions and parameters used in the analysis

In [None]:
summary(frequent)

## Further narrowing down with Confidence

Only look for association rules with `supp >= 0.5` and `conf >= 0.8`. (note, in the code you need to use `=`)

Also, the value of support and confidence must be enclosed in a `list()`

In [None]:
rules <- apriori(coffee_data, parameter = list(supp = 0.5, conf = 0.8, target = "rules"))

In [None]:
inspect(rules)

we can also specify the min number of items (`minlen`) that should be included in a association rule.

`minlen = 3` means we are only looking for association rules that include at least 3 items.

In [None]:
rules2 <- apriori(coffee_data, parameter = list(supp = 0.5, conf = 0.8, target = "rules", minlen = 3))

In [None]:
inspect(rules2)

## Post-process the rules

The `inspect()` function can be used together with `sort()` to sort the rules by support, or confidence, or lift

In [None]:
inspect(sort(rules, by = "support"))

We can specify whether we would like a certain item to appear in the antecedent (lhs) or antecedents (rhs)

In [None]:
inspect(subset(rules, lhs %in% "tea")) #lhs = left hand side

In [None]:
inspect(subset(rules, rhs %in% "tea"))

# Visualization for Association Rules

A useful library for association rule reporting is `arulesViz`.

In [None]:
# install.packages("arulesViz") or install in Anaconda
library(arulesViz)

Use inspectDT() to see rules in an HTML interactive table

In [None]:
inspectDT(rules) # may not work on Safari due to a bug in R.

In [None]:
plot(rules, method = "graph") # Plot rules as a Graph

How to read this graph?

Start from each node (circle). Each incoming link is the right hand side, the outgoing link is the left hand side. The color and the size of the node is the support and lift.

In [None]:
plot(rules, method = "graph", edgeCol = "#5E5E5EFF") # change the color for the edge

If you have a large dataset and a graph becomes infeasible
we can plot a scatterplot of the rules, where the color changes based 
on a chosen measure

In [None]:
plot(rules, measure = c("support", "lift"), shading = "confidence")

# Next: Hands on Exercise

Questions? Open the `exercise.ipynb`.