# Module 8 Lab 1 - Frequent Pattern Mining and Association Rule Mining

This lab will teach you how to execute association rule mining (ARM) for mining possible associations from clinical data.  This can be a hypothesis generation tool, or to help with feature selection.  Almost always, you will follow ARM with some other machine learning methods.

Below we will install the `mlxtend` python package in your container.  `mlxtend` contains an implementations of the Apriori ARM algorithm, described in Rakesh Agrawal's [seminal paper](../resources/Mining-association-rules-between-sets-of-items-in-large-databases.pdf).  There are many other algorithms available, and spark's `MLlib` contains more implementations, but `mlxtend` is nice and lightweight for illustrating the concepts.

In [1]:
import pandas as pd
import numpy as np
import sys
!{sys.executable} -m pip install mlxtend
import mlxtend


Collecting mlxtend
[?25l  Downloading https://files.pythonhosted.org/packages/2a/4f/11a257bc17f675691080219c6fe3525e49c7077535c3d64c0c2afc79cfc9/mlxtend-0.19.0-py2.py3-none-any.whl (1.3MB)
[K     |████████████████████████████████| 1.4MB 3.6MB/s eta 0:00:01
Collecting joblib>=0.13.2 (from mlxtend)
[?25l  Downloading https://files.pythonhosted.org/packages/3e/d5/0163eb0cfa0b673aa4fe1cd3ea9d8a81ea0f32e50807b0c295871e4aab2e/joblib-1.1.0-py2.py3-none-any.whl (306kB)
[K     |████████████████████████████████| 307kB 23.0MB/s eta 0:00:01
Installing collected packages: joblib, mlxtend
Successfully installed joblib-1.1.0 mlxtend-0.19.0


For this lab we will use the diabetes dataset from sklearn.  For ARM, we need the target value in the dataset as well.  This dataset has already been scaled for us (excluding target, but including sex which has just 2 possible values of 1 or 2), so we will scale target.  (Refer to the source URL in the dataset description for additional information including access to the unscaled data.) 

In [2]:
import sklearn.datasets as d
from sklearn import preprocessing

db = d.load_diabetes()
target = preprocessing.scale(db.target)
data = pd.DataFrame(np.c_[db.data, target], columns = np.append(db.feature_names, ['target']))
print(db.DESCR)
display(data.head())

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - Age
      - Sex
      - Body mass index
      - Average blood pressure
      - S1
      - S2
      - S3
      - S4
      - S5
      - S6

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Brad

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,-0.014719
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,-1.001659
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593,-0.14458
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,0.699513
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,-0.222496


ARM requires that our data be categorical in nature, so we will need to bin the continuous values.  Refer to the supplemental lab covering sci-kit learn preprocessing.  We will apply different techniques to different sets of columns.

When binning data for ARM, there is a fine line between using a small number of bins and the resulting information loss, or a large number of bins ending up with a lot of features.  More features means more rules and longer processing time.  There is no rule of thumb, but this can be fine tuned iteratively.

Below, we are using generic data binning techniques to categorize the continuous values because these data have been normalized and we don't know the true values; however this isn't always the best approach if you have raw data.  For example, age can be binned specifically with predetermined age ranges using custom intervals (see lab 3).  This approach is preferred because it applies some meaning to the categorization that make interpretation of the mied rules easier.

For data that is already categorical, custom `apply` functions can be used to condense the already categorical data into smaller bins to cut down on the number of features and rules generated.   

In [3]:
binary_col = ['sex']
quant_col = ['age', 'bmi']
kmeans_col = ['bp', 's1', 's2', 's3', 's4', 's5', 's6', 'target']

binary_data = preprocessing.Binarizer(0).fit_transform(data[binary_col].values.reshape(-1,1))
quant_data = preprocessing.KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile').fit_transform(data[quant_col])
kmeans_data = preprocessing.KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='kmeans').fit_transform(data[kmeans_col])

# join back into a single dataframe.  hstack the individual numpy arrays together.  force datatype to be int
data = pd.DataFrame(np.hstack((binary_data, quant_data, kmeans_data)), columns = binary_col+quant_col+kmeans_col, dtype=int)
display(data.head())

Unnamed: 0,sex,age,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,1,3,4,2,1,1,0,1,2,2,1
1,0,2,0,1,2,1,3,0,0,0,0
2,1,4,4,2,1,1,1,1,2,1,1
3,0,0,2,1,2,2,0,2,2,2,2
4,0,2,1,2,2,2,2,1,1,1,1


## Frequent itemsets
There are two concepts to understand in ARM.  The first is the idea of frequent itemsets.  Given a transaction (in our case a patient), and a list of items (in our case the attributes of that patient), frequent itemsets will identify the sets of items that appear most frequently in the data.  The frequent itemsets can be returned along with their _support_ level, or returned only if above a threshold _support_.  Support is simply the proportion of records that contain the items.

We can better understand how ARM works with a "shopping cart" analogy.  Imagine a shopping cart that has eggs, beer, and bread in it.  This can be represented as an array `['eggs', 'beer', 'bread']`.  Now imagine a different shoppers cart with eggs, diapers, and wine: `['eggs', 'diapers', 'wine']`.  In this data, `eggs` has a support value of 1, because it appears in 100% of the shopping carts.  `Beer`, `bread`, `diapers`, and `wine` have supports of 0.5, because they appear in 50% of the carts.  The data tell us that the possible items that can be in a shopping cart are `eggs`, `beer`, `wine`, `bread`, and `diapers`.

With our example diabetes dataset, our possible items will consist of one item for each feature and value combination.  For example, we binned the age column into five categories.  We now have five age items: `age-0, age-1, age-2, age-3, age-4`.  Similarly for the other columns, we will have one possible item for each feature and value combination.  Each patient will have one of each of these binned feature items in their "transaction" (i.e. one of the `age` items, one of the `bmi` items, etc.)

The mlxtend library requires one-hot encoding of all possible items in a transaction, so we will apply the pandas method for one-hot encoding our columns, called `get_dummies`.  This method works on already created categorical data. 

**NOTE:** ARM requires a _dense_ one hot encoding, unlike regression models that require a k-1 one-hot encoding.  This means we do not drop any of the generated columns.

In [4]:
data = pd.get_dummies(data, columns = data.columns)
display(data.head())

Unnamed: 0,sex_0,sex_1,age_0,age_1,age_2,age_3,age_4,bmi_0,bmi_1,bmi_2,...,s6_0,s6_1,s6_2,s6_3,s6_4,target_0,target_1,target_2,target_3,target_4
0,0,1,0,0,0,1,0,0,0,0,...,0,0,1,0,0,0,1,0,0,0
1,1,0,0,0,1,0,0,1,0,0,...,1,0,0,0,0,1,0,0,0,0
2,0,1,0,0,0,0,1,0,0,0,...,0,1,0,0,0,0,1,0,0,0
3,1,0,1,0,0,0,0,0,0,1,...,0,0,1,0,0,0,0,1,0,0
4,1,0,0,0,1,0,0,0,1,0,...,0,1,0,0,0,0,1,0,0,0


Here you can see that whereas we started with just 11 features, we now have 52 after the binning and one-hot encoding.  We can now proceed to get the frequent itemsets.  We will look for a minimum support of 0.05.  In other words, we are looking for itemsets that appear in 5% or more of the rows in our data.

In [5]:
from mlxtend.frequent_patterns import apriori

frequent_itemsets = apriori(data, min_support=0.05, use_colnames=True)
display(frequent_itemsets)


Unnamed: 0,support,itemsets
0,0.531674,(sex_0)
1,0.468326,(sex_1)
2,0.192308,(age_0)
3,0.205882,(age_1)
4,0.183258,(age_2)
...,...,...
842,0.058824,"(s1_1, s3_1, s4_1, s2_1)"
843,0.061086,"(s4_0, s1_1, s2_1, s3_2)"
844,0.058824,"(s4_0, s2_1, s1_2, s3_3)"
845,0.054299,"(s4_0, s2_1, target_0, s1_2)"


## Extracting information from frequent itemsets
Since we are using ARM to further our goals of creating a preditive model, what we are really interested in are itemsets that contain the target feature.  Let's see which frequent itemsets contain the target feature.  The `itemsets` column in our resulting dataframe is of type `frozenset` (see the documentation for [apriori](http://rasbt.github.io/mlxtend/api_subpackages/mlxtend.frequent_patterns/)).

We use list comprehension to dynamically build the list of target column names.  This will protect us from having to maintain a hand built list if we decide to change the number of bins we used for target.

We then build a mask using the [intersection](https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset) method of `frozenset`.  Finally, we get the rows where the mask is true.

In [6]:
target_names = [x for x in data.columns if 'target' in x]
print(target_names)
mask = [True if x.intersection(target_names) else False for x in frequent_itemsets.itemsets.values]
frequent_itemsets[mask]

['target_0', 'target_1', 'target_2', 'target_3', 'target_4']


Unnamed: 0,support,itemsets
43,0.328054,(target_0)
44,0.235294,(target_1)
45,0.185520,(target_2)
46,0.151584,(target_3)
47,0.099548,(target_4)
...,...,...
835,0.067873,"(s4_0, s5_1, target_0, sex_0)"
836,0.054299,"(s4_0, s6_1, target_0, sex_0)"
840,0.052036,"(s4_0, s2_0, bmi_0, target_0)"
841,0.054299,"(s4_0, s5_0, bmi_0, target_0)"


Frequent itemsets that contain just one, or a few items aren't that interesting.  We can tell from the output above that about 32.8% of the rows contain target_0, but we don't need to use ARM to give us class distributions.  So lets look at more interesting frequent itemsets, say those itemsets with 4 or more items.  We slightly modify our mask to include the length of the `frozenset` as well.

In [7]:
mask = [True if x.intersection(target_names) and len(x) >= 4 else False for x in frequent_itemsets.itemsets.values]
frequent_itemsets[mask]

Unnamed: 0,support,itemsets
820,0.076923,"(s4_0, bmi_0, target_0, sex_0)"
821,0.054299,"(s4_0, bp_1, target_0, sex_0)"
824,0.056561,"(s4_0, s1_1, target_0, sex_0)"
827,0.056561,"(s4_0, s2_0, target_0, sex_0)"
831,0.061086,"(s4_0, s2_1, target_0, sex_0)"
833,0.067873,"(s4_0, s3_3, target_0, sex_0)"
834,0.061086,"(s4_0, s5_0, target_0, sex_0)"
835,0.067873,"(s4_0, s5_1, target_0, sex_0)"
836,0.054299,"(s4_0, s6_1, target_0, sex_0)"
840,0.052036,"(s4_0, s2_0, bmi_0, target_0)"


From this list we can see that s4_0 appears with target_0 frequently, along with a few other features.  Those could be an indication of a relationship.  We can explore this further by mining for association rules.

## Association Rules
Association rules take the form of `{A1..An} -> {C1..Cn}` where `Ax` is an _antecedent_ item and `Cx` is a _consequent_ item.  Interpret the rule thus: Whenever we see the _antecedent_ set `{A1..Ax}`, we also see the _consequent_ set `{C1..Cx}`.  This defines an _association_ between the antecedents and consequents.

To generate assocation rules, we simply need to pass a list of frequent itemsets, along with a metric and the metric threshold to the [association_rules](http://rasbt.github.io/mlxtend/api_subpackages/mlxtend.frequent_patterns/) method.

The metrics that are defined are:
* **support**: the support for the antecedent set plus the consequent set
* **confidence**: the support for the antecedent set plus the consequent set divided by the support for the antecedent set.  This is the proportion of transactions that contain the antecedent that also contain the consequent
* **lift**: The ratio of the rule's confidence to the unconditional probability of the consequent.  It is a measure of dependence of the antecedent and consequent.  Items are likely correlated if lift > 1.  If lift < 1, then it implies a negative correlation.
* **leverage**: Measures how much more often the antecedent and consequent appear together than if the antecedent and consequent were independent.  This is a way to look for dependencies between the antecedent and consequent.
* **conviction**: This is the ratio of the frequency that the antecedent occurs without the consequent (1 - consequent support), divided by the observed frequency of incorrect predictions (1 - confidence). A value above 1 indicates the frequency that a given rule would be incorrect if the antecedent and consequent were associated purely by random chance. The value is unbounded, and can range from 0.5 to infinity.

The choice of which metric to use to analyze the association rules depends largely on what you're looking for; however, lift, leverage, and conviction are useful for the purposes of finding relationships in the data that could lead to a trainable predictive model.  Support and confidence give you some idea of the prevalence of the associations.  We will use the confidence and conviction metrics below to filter our rules.

In [8]:
from mlxtend.frequent_patterns import association_rules

association_rules(frequent_itemsets, metric="confidence", min_threshold=0.8)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(bp_0),(sex_0),0.131222,0.531674,0.106335,0.810345,1.524138,0.036568,2.469354
1,(s3_3),(sex_0),0.147059,0.531674,0.126697,0.861538,1.620426,0.048509,3.382353
2,(s1_0),(s2_0),0.106335,0.185520,0.104072,0.978723,5.275558,0.084345,38.280543
3,(s1_0),(s4_0),0.106335,0.382353,0.090498,0.851064,2.225859,0.049840,4.147059
4,(s2_0),(s4_0),0.185520,0.382353,0.153846,0.829268,2.168856,0.082912,3.617647
...,...,...,...,...,...,...,...,...,...
112,"(s1_2, s2_1, s3_3)",(s4_0),0.058824,0.382353,0.058824,1.000000,2.615385,0.036332,inf
113,"(s2_1, s3_3)","(s4_0, s1_2)",0.067873,0.113122,0.058824,0.866667,7.661333,0.051146,6.651584
114,"(s1_2, s3_3)","(s4_0, s2_1)",0.065611,0.185520,0.058824,0.896552,4.832632,0.046651,7.873303
115,"(s4_0, s1_2, target_0)",(s2_1),0.065611,0.404977,0.054299,0.827586,2.043537,0.027728,3.451131


In [9]:
rules = association_rules(frequent_itemsets, metric="conviction", min_threshold=2)
display(rules)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(bp_0),(sex_0),0.131222,0.531674,0.106335,0.810345,1.524138,0.036568,2.469354
1,(s3_3),(sex_0),0.147059,0.531674,0.126697,0.861538,1.620426,0.048509,3.382353
2,(s3_0),(sex_1),0.251131,0.468326,0.187783,0.747748,1.596640,0.070172,2.107708
3,(bmi_0),(s4_0),0.201357,0.382353,0.158371,0.786517,2.057044,0.081381,2.893189
4,(target_4),(bmi_4),0.099548,0.203620,0.063348,0.636364,3.125253,0.043079,2.190045
...,...,...,...,...,...,...,...,...,...
225,"(s4_2, s3_0, s1_2)",(s2_2),0.076923,0.312217,0.052036,0.676471,2.166667,0.028019,2.125874
226,"(s4_2, s1_2, s2_2)",(s3_0),0.079186,0.251131,0.052036,0.657143,2.616731,0.032150,2.184201
227,"(s4_2, s3_0, s2_2)",(s1_2),0.076923,0.352941,0.052036,0.676471,1.916667,0.024887,2.000000
228,"(s3_0, s1_2, s2_2)",(s4_2),0.058824,0.264706,0.052036,0.884615,3.341880,0.036465,6.372549


## Interpretation of the rules
We got a fairly large list of rules.  Let's examine one of the rules:

| antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| (bp_0) | (sex_0) | 0.131222 | 0.531674 | 0.106335 | 0.810345 | 1.524138 | 0.036568 | 2.469354 |

The rule can be phrased as so: a sex of class zero is associated with a bp class of 0.  The antecedent support tells us that about 13% of the frequent itemsets contain a bp of class 0, and the consequent support tells us that about 53% of the frequent itemsets contain a sex of class 0.  The support for the rule is about 10%, which means that this combination of data occurs in about 10% of the frequent itemsets.

So, these give us some idea of how frequent this data is.

Now lets examine the remaining metrics:

* **confidence** - The confidence for this rule is about 81%.  This tells us that, of all the frequent itemsets that have a bp of class 0 (13% of the records), 81% also have a sex of class 0.

* **lift** - The lift is 1.52.  This means that when we see the consequent sex of class 0, it has a positive effect on the appearance of the antecedent bp of class 0 (more likely to appear than by chance alone).

* **leverage** - Leverage is about 3.6%.  This means that we see sex 0 and bp 0 together 3.6% more often than we would expect to see them if they were independent.

* **conviction** - Conviction is 2.47.  This means that the rule would be wrong 147% more often (or 2.47 times more often) if the association bp_0 -> sex_0 was just random occurrences.

In the following lab, you will use these metrics to whittle down a large set of association rules.