Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
356 lines (274 sloc) 11.1 KB
title: "Association Rule Mining"
toc: yes
toc_depth: 2
theme: readable
toc: yes
toc_depth: 3
date: "May 23, 2016"
```{r setup, include = FALSE}
# my set-up
knitr::opts_chunk$set(cache = TRUE, warning = FALSE, message = FALSE,
fig.align = 'center')
options(scipen = 999)
colleges = read.delim('./data/colleges.tsv',
sep = '\t',
header = TRUE)
# Preparing the data
In project 3 we're working with a new data set from the
[College Scorecard]( According to the
[data documentation](
> The College Scorecard is designed to increase transparency, putting the power
> in the hands of the public — from those choosing colleges to those improving
> college quality — to see how well different schools are serving their students.
There are a lot of variables available to us in this data set and we can use
association mining to discovery trends and patterns.
```{r, eval = FALSE}
install.packages('arules', dependencies = TRUE)
colleges = read.delim('./data/colleges.tsv',
sep = '\t',
header = TRUE)
## Feature engineering
Feature engineering is the process of using domain knowledge or expertise to
construct new variables for use in data exploration and modeling. In association rule mining, we're looking for associations between *categories* of features, so the inputs all need to be discrete. Many of the variables in your data set are
already discrete, like `state` or `locale`, but we'll need to use some feature
engineering to coerce other variables of interest.
### `discretize`
The `arules` package provides a function that can help you create discrete
variables from continuous ones, called `discretize`:
```{r discretize}
colleges =
colleges %>%
mutate(cost_category = discretize(cost))
# the default results in only 3 bins of uneven size
colleges =
colleges %>%
mutate(cost_category = discretize(cost, method = 'frequency', categories = 4))
# that's better
`discretize` can break your variable up with several different methods, *e.g.*
equal interval length, equal frequency, clustering-based, and custom interval
discretization. Also you can specify `categories` to get as many or as few as
you like. See `?discretize` for details.
colleges$earnings_quartiles = discretize(colleges$median_earnings,
method = 'frequency',
categories = 4,
labels = c('Q1', 'Q2', 'Q3', 'Q4'))
colleges$debt_quartiles = discretize(colleges$median_debt,
method = 'frequency',
categories = 4,
labels = c('Q1', 'Q2', 'Q3', 'Q4'))
### SAT performance spread
Let's make a couple new features based on the variables available. I'm curious
about the spread in student performance at schools, so to investigate that I
can compute the difference between the top 75% and the bottom 25% of performance
on the SATs.
# difference between Q3 and Q1.
colleges =
colleges %>%
mutate(sat_verbal_spread = discretize(sat_verbal_quartile_3 - sat_verbal_quartile_1,
method = 'frequency',
categories = 4,
labels = c('Q1', 'Q2', 'Q3', 'Q4')),
sat_math_spread = discretize(sat_math_quartile_3 - sat_math_quartile_1,
method = 'frequency',
categories = 4,
labels = c('Q1', 'Q2', 'Q3', 'Q4')),
sat_writing_spread = discretize(sat_writing_quartile_3 - sat_writing_quartile_1,
method = 'frequency',
categories = 4,
labels = c('Q1', 'Q2', 'Q3', 'Q4')))
```{r, echo = FALSE, results = 'hide'}
med_earnings_avg =
colleges %>%
group_by(sat_math_spread) %>%
summarize(mean = mean(median_earnings, na.rm = TRUE)) %>%
ggplot(na.omit(select(colleges, sat_math_spread, median_earnings)),
aes(x = median_earnings)) +
geom_histogram() +
geom_vline(data = med_earnings_avg, aes(xintercept = mean),
color = 'red', linetype = 2) +
facet_grid(sat_math_spread ~.) +
### STEM schools
I have a feeling that there are interesting patterns earnings patterns for
schools where students primarily study science, technology, engineering, and math
(STEM) fields. So, I can make a variable indicating whether a school has a large
proportion of STEM students.
colleges =
colleges %>%
mutate(stem_perc = architecture_major_perc + comm_tech_major_perc +
computer_science_major_perc + engineering_major_perc +
eng_tech_major_perc + bio_science_major_perc +
high_stem = ifelse(stem_perc >= 0.30, TRUE, FALSE))
# Generating rules
Before we start association rule mining, we need to select out the columns to
mine. Remember, this algorithm works on discrete, categorical data so we need to
remove or convert numerical columns.
college_features =
colleges %>%
select(state, locale, control, pred_deg, historically_black,
men_only, women_only, religious, online_only,
earnings_quartiles, debt_quartiles, high_stem, sat_verbal_spread,
sat_math_spread, sat_writing_spread)
Then we're going to load our data into a `transaction` object in the `arules`
package. Each row of `college_features` represents a 'transaction.'
college_features = as(college_features, 'transactions')
Once our data is a `transaction` object, we can inspect the `itemsets` which are
just the characteristics corresponding with each college.
```{r, warning = FALSE}
# view the itemsets
# plot the most frequent items
itemFrequencyPlot(college_features, topN = 20, cex = 0.70)
The summary of the data set provides an overview of the most frequent items
along with other distributional details. Further, we can see which
characteristics are the most common in the data set the `itemFrequencyPlot()`.
Next, use the `apriori()` function to find all rules with a specified minimum
support and confidence, *e.g.* support = 0.01 and confidence = 0.6.
# run the apriori algorithm
rules = apriori(college_features,
parameter = list(sup = 0.01, conf = 0.6, target = 'rules'))
# print distribution information
# view the rules
# sort the rules by lift
inspect(head(sort(rules, by = 'lift')))
# view quality metrics
quality(rules) = cbind(quality(rules), coverage = coverage(rules))
The result of mining the colleges data associations between the characteristics
expressed in the form of rules. We end up producing 930 such associations, or
For an overview of the rules `summary()` can be used. It shows the number of
rules, the most frequent items contained in the left-hand-side and the
right-hand-side and their respective length distributions and summary statistics
for the quality measures returned by the mining algorithm.
## Increasing rule size with `maxlen`
By default, rules will not exceed 6 objects in size, but you can change that
length constraint with `maxlen`.
# allow rules to up to size 10
long_rules = apriori(college_features,
parameter = list(sup = 0.001, conf = 0.6,
target = 'rules', maxlen = 10))
inspect(head(sort(long_rules, by = 'lift'), 2))
## Investigate interesting patterns
It's common for the result of association rule mining to produce a huge amount
of rules. The `subset()` function can be used to narrow in on the results by
filtering on characteristics of interest.
For example, let's filter for control=Private for profit to see what is associated with that.
high_earners = subset(rules, subset = rhs %in% 'earnings_quartiles=Q4' & lift > 1)
inspect(head(high_earners, n = 5, by = 'lift'))
# Visualizing with `arulesViz`
The authors of `arules` followed on that package with `arulesViz`, which
provides lots of great pre-packaged visualization for exploring your association
```{r, eval = FALSE}
# arules plot template
method = NULL,
measure = 'support',
shading = 'lift',
interactive = FALSE,
control = ...)
* *x*: is the set of rules to be visualized
* *method*: the visualization method
* *measure*: and shading contain the interest measures used by the plot
* *interactive*: indicates whether you want to interactively explore
* *data*: can contain the transaction data set used to mine the rules
* *control*: list with further control arguments to customize the plot
```{r, eval = FALSE}
install.packages('arulesViz', dependencies = TRUE)
# you might need to install Rgraphviz from this repository
We can start with a visualization of association rules as a scatter plot with
two measures on the axes. The default `plot()` for association rules
is a scatter plot using support and confidence on the axes. Lift is used as the
color of the points.
Another version of the scatter plot called two-key plot. Here support and
confidence are used for the x and y-axes and the color of the points indicates
the 'order,' *i.e.*, the number of items contained in the rule:
plot(rules, shading = 'order')
We can filter to narrow in on rules of interest:
```{r, results = 'hide', fig.width = 10, fig.height = 10}
subrules = rules[quality(rules)$confidence > 0.9]
plot(subrules, method = 'grouped')
plot(subrules, method = 'grouped', control = list(k = 50))
```{r, results = 'hide'}
method = 'matrix',
measure = 'lift')
# reorder based on
method = 'matrix',
measure = 'lift',
control = list(reorder = TRUE))
method = 'matrix3D',
measure = 'lift',
control = list(reorder = TRUE))
```{r, fig.width = 10, fig.height = 10}
# plot a graph
plot(subrules, method = 'graph')
# Project 3, Part 1
## Feature engineering
1. Use feature engineering to construct 3 additional variables to include in
your association mining.
2. Plot your new features to view their distributions.
## `arules`
3. Mine for rules with data that includes your new features from 1. Filter the
rules for your features and describe the patterns you find. **Note:** you might
need to try different values of support and confidence to generate rules with
your newly created features.