# Module 8 Lab 2 - Filtering association rules

One of the main challenges with using association rule mining is dealing with the potentially immense set of rules that are generated.  In many cases, it is easy to generate millions of association rules, which is both unwieldy and not useful.

In this lab you will learn how to filter association rules so that you can make sense of them and put them to further use.

In [1]:
import pandas as pd
import numpy as np
import sys
!{sys.executable} -m pip install mlxtend
import mlxtend




## Load the data
We will continue to use the sample breast cancer data set to generate our rules.

In [2]:
import sklearn.datasets as d
from sklearn import preprocessing

db = d.load_diabetes()
target = preprocessing.scale(db.target)
data = pd.DataFrame(np.c_[db.data, target], columns = np.append(db.feature_names, ['target']))
display(data.head())

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,-0.014719
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,-1.001659
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593,-0.14458
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,0.699513
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,-0.222496


## Preprocess for ARM
We will apply the same preprocessing as in lab 1.

In [3]:
binary_col = ['sex']
quant_col = ['age', 'bmi']
kmeans_col = ['bp', 's1', 's2', 's3', 's4', 's5', 's6', 'target']

binary_data = preprocessing.Binarizer(0).fit_transform(data[binary_col].values.reshape(-1,1))
quant_data = preprocessing.KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile').fit_transform(data[quant_col])
kmeans_data = preprocessing.KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='kmeans').fit_transform(data[kmeans_col])

# join back into a single dataframe.  hstack the individual numpy arrays together.  force datatype to be int
data = pd.DataFrame(np.hstack((binary_data, quant_data, kmeans_data)), columns = binary_col+quant_col+kmeans_col, dtype=int)
display(data.head())

Unnamed: 0,sex,age,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,1,3,4,2,1,1,0,1,2,2,1
1,0,2,0,1,2,1,3,0,0,0,0
2,1,4,4,2,1,1,1,1,2,1,1
3,0,0,2,1,2,2,0,2,2,2,2
4,0,2,1,2,2,2,2,1,1,1,1


## One-hot encode the data
This will produce the required format to use the mlextend package.

In [4]:
data = pd.get_dummies(data, columns = data.columns)
display(data.head())

Unnamed: 0,sex_0,sex_1,age_0,age_1,age_2,age_3,age_4,bmi_0,bmi_1,bmi_2,...,s6_0,s6_1,s6_2,s6_3,s6_4,target_0,target_1,target_2,target_3,target_4
0,0,1,0,0,0,1,0,0,0,0,...,0,0,1,0,0,0,1,0,0,0
1,1,0,0,0,1,0,0,1,0,0,...,1,0,0,0,0,1,0,0,0,0
2,0,1,0,0,0,0,1,0,0,0,...,0,1,0,0,0,0,1,0,0,0
3,1,0,1,0,0,0,0,0,0,1,...,0,0,1,0,0,0,0,1,0,0
4,1,0,0,0,1,0,0,0,1,0,...,0,1,0,0,0,0,1,0,0,0


## Get the frequent itemsets 
This step will produce the frequent itemsets from which we can extract association rules.  We are using a small min_support to get more frequent itemsets.

In [5]:
from mlxtend.frequent_patterns import apriori

frequent_itemsets = apriori(data, min_support=0.005, use_colnames=True)
display(frequent_itemsets)

Unnamed: 0,support,itemsets
0,0.531674,(sex_0)
1,0.468326,(sex_1)
2,0.192308,(age_0)
3,0.205882,(age_1)
4,0.183258,(age_2)
...,...,...
52423,0.006787,"(s5_0, s1_0, target_0, sex_1, s3_1, s4_0, s6_1..."
52424,0.006787,"(s5_0, s1_0, target_0, s3_1, s4_0, age_0, bp_1..."
52425,0.006787,"(s5_0, s1_0, target_0, s3_1, s4_0, s6_1, bp_1,..."
52426,0.006787,"(s5_0, s1_0, target_0, sex_1, s3_1, s4_0, age_..."


## Get the association rules
We will use confidence to extract the rules from the frequent itemsets.  

Recall that these are the available metrics we can use:
* **support**: the support for the antecedent set plus the consequent set
* **confidence**: the support for the antecedent set plus the consequent set divided by the support for the antecedent set.  This is the proportion of transactions that contain the antecedent that also contain the consequent
* **lift**: The ratio of the rule's confidence to the unconditional probability of the consequent.  It is a measure of dependence of the antecedent and consequent.  Items are likely correlated if lift > 1.  If lift < 1, then it implies a negative correlation.
* **leverage**: Measures how much more often the antecedent and consequent appear together than if the antecedent and consequent were independent.  This is a way to look for dependencies between the antecedent and consequent.
* **conviction**: This is the ratio of the frequency that the antecedent occurs without the consequent (1 - consequent support), divided by the observed frequency of incorrect predictions (1 - confidence). A value above 1 indicates the frequency that a given rule would be incorrect if the antecedent and consequent were associated purely by random chance. The value is unbounded, and can range from 0.5 to infinity.

In [6]:
from mlxtend.frequent_patterns import association_rules

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)

display(rules)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(age_0),(sex_0),0.192308,0.531674,0.119910,0.623529,1.172766,0.017664,1.243990
1,(age_1),(sex_0),0.205882,0.531674,0.128959,0.626374,1.178116,0.019497,1.253460
2,(age_2),(sex_0),0.183258,0.531674,0.104072,0.567901,1.068138,0.006639,1.083840
3,(bmi_0),(sex_0),0.201357,0.531674,0.149321,0.741573,1.394788,0.042265,1.812217
4,(bmi_3),(sex_0),0.196833,0.531674,0.106335,0.540230,1.016092,0.001684,1.018609
...,...,...,...,...,...,...,...,...,...
285849,"(bp_1, s4_0, s3_1, s6_1)","(s5_0, s1_0, target_0, sex_1, bmi_0, s2_0)",0.013575,0.011312,0.006787,0.500000,44.200000,0.006634,1.977376
285850,"(bp_1, s2_0, s3_1, s6_1)","(s5_0, s1_0, target_0, sex_1, s4_0, bmi_0)",0.009050,0.009050,0.006787,0.750000,82.875000,0.006705,3.963801
285851,"(bp_1, bmi_0, s3_1, s6_1)","(s5_0, s1_0, target_0, sex_1, s4_0, s2_0)",0.009050,0.013575,0.006787,0.750000,55.250000,0.006664,3.945701
285852,"(bmi_0, s2_0, s3_1, s6_1)","(s5_0, s1_0, target_0, sex_1, s4_0, bp_1)",0.013575,0.013575,0.006787,0.500000,36.833333,0.006603,1.972851


## Using rules for feature selection

As you see above, there can be quite a few rules generated.  It is not practical to look at over 280,000 rules looking for something useful to stand out.  So how can we use all that information to help make predictions?  First, by understanding the format of a rule as `{A1..An} -> {C1..Cn}` or more succintly `A -> C`, we can make the observation that the antecedents are predicting the prescence of the consequent, with some measure of "interestingness" like confidence or conviction.  If we are interested in predicting the `target` as the dependent variable, then we really are interested in finding rules where target is the only item in the consequent.

We can use a masking technique similar to one we used in lab 1 to find such rules.  We will use the rules given by the conviction metric threshold.  Conviction has the advantge over confidence in that it accounts for the direction of the rule, i.e. conviction(A -> C) ≠ conviction(C -> A).  This is useful to us because we only care about the antecedent relationship to the presence of the consequent (target in our case), and not the other way around.


In [7]:
target_names = [x for x in data.columns if 'target' in x]
mask = [True if c.intersection(target_names) and len(c) == 1 else False for c in rules.consequents.values]
target_rules = rules[mask]
display(target_rules)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
57,(bmi_0),(target_0),0.201357,0.328054,0.126697,0.629213,1.918016,0.060641,1.812217
63,(bp_0),(target_0),0.131222,0.328054,0.072398,0.551724,1.681807,0.029350,1.498956
88,(s3_3),(target_0),0.147059,0.328054,0.081448,0.553846,1.688276,0.033205,1.506085
92,(s3_4),(target_0),0.029412,0.328054,0.022624,0.769231,2.344828,0.012976,2.911765
97,(s4_0),(target_0),0.382353,0.328054,0.203620,0.532544,1.623342,0.078187,1.437453
...,...,...,...,...,...,...,...,...,...
283972,"(s5_0, s1_0, sex_1, s3_1, s4_0, s6_1, bp_1, s2_0)",(target_0),0.006787,0.328054,0.006787,1.000000,3.048276,0.004561,inf
284214,"(s5_0, s1_0, bmi_0, s3_1, s4_0, age_0, bp_1, s...",(target_0),0.006787,0.328054,0.006787,1.000000,3.048276,0.004561,inf
284426,"(s5_0, s1_0, bmi_0, s3_1, s4_0, s6_1, bp_1, s2_0)",(target_0),0.006787,0.328054,0.006787,1.000000,3.048276,0.004561,inf
284648,"(s5_0, s1_0, sex_1, s3_1, bmi_0, s4_0, age_0, ...",(target_0),0.006787,0.328054,0.006787,1.000000,3.048276,0.004561,inf


Now we have a list of rules where target is the only consequent.  Lets get a list of the distinct features in the antecedent.  This will be a list of features that target is potentially dependent on, and can be used for feature selection purposes.

We can't just use the unique method on the `antecedents` column because we don't want the unique sets, we want the unique items across all sets.  We will use the `*` operator in python for this. `*` will take an array-like object (such as a pandas series), and turn each element into a paramter to the function being called.  In this case, we will be passing every antecedent as a set to the union method of `frozenset`, which will union them into one set containing all the unique features.

In [8]:
frozenset.union(*target_rules['antecedents'])

frozenset({'age_0',
           'age_1',
           'age_2',
           'age_3',
           'age_4',
           'bmi_0',
           'bmi_1',
           'bmi_2',
           'bmi_3',
           'bmi_4',
           'bp_0',
           'bp_1',
           'bp_2',
           'bp_3',
           'bp_4',
           's1_0',
           's1_1',
           's1_2',
           's1_3',
           's1_4',
           's2_0',
           's2_1',
           's2_2',
           's2_3',
           's3_0',
           's3_1',
           's3_2',
           's3_3',
           's3_4',
           's4_0',
           's4_1',
           's4_2',
           's4_3',
           's4_4',
           's5_0',
           's5_1',
           's5_2',
           's5_3',
           's5_4',
           's6_0',
           's6_1',
           's6_2',
           's6_3',
           's6_4',
           'sex_0',
           'sex_1'})

As you can see, this is the entire feature set.  So, our rules currently are not providing a lot of value, and we still have more than 10,000 rules.  We filtered the original rules using a confidence greater than 0.5.  Lets turn to some other more interesting measures to filter the above list of rules further.

## Finding "interesting" rules
Rules that we are interested in include those that have some metric support that they are more than just random occurrences.  There are several metrics that can identify these rules for us.  First, we will start with lift.  Lift is kind of like correlation.  With lift, we want to see values far from 1.  Values above 1 imply a positive correlation, and below 1 indicate negative.  

In [9]:
rules1 = target_rules[target_rules['lift'] > 10]
display(rules1)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
8485,"(s4_2, sex_0, s1_1)",(target_4),0.006787,0.099548,0.006787,1.0,10.045455,0.006112,inf
27985,"(bp_2, s5_4, bmi_4)",(target_4),0.006787,0.099548,0.006787,1.0,10.045455,0.006112,inf
32271,"(bp_4, s6_4, s1_3)",(target_4),0.006787,0.099548,0.006787,1.0,10.045455,0.006112,inf
32358,"(bp_4, s4_2, s3_0)",(target_4),0.006787,0.099548,0.006787,1.0,10.045455,0.006112,inf
42619,"(s5_3, age_3, bmi_4, sex_0)",(target_4),0.006787,0.099548,0.006787,1.0,10.045455,0.006112,inf
49215,"(s5_3, bp_1, bmi_4, sex_0)",(target_4),0.00905,0.099548,0.00905,1.0,10.045455,0.008149,inf
49878,"(s5_3, bmi_4, sex_0, s3_0)",(target_4),0.00905,0.099548,0.00905,1.0,10.045455,0.008149,inf
54402,"(s4_2, sex_0, s1_1, s3_0)",(target_4),0.006787,0.099548,0.006787,1.0,10.045455,0.006112,inf
57428,"(s5_3, s4_2, sex_0, s3_0)",(target_4),0.006787,0.099548,0.006787,1.0,10.045455,0.006112,inf
69081,"(bp_2, s5_4, bmi_4, sex_1)",(target_4),0.006787,0.099548,0.006787,1.0,10.045455,0.006112,inf


From these results, we see that there are some rules with very high lift, and the conviction for these is also infinite.  This may seem exciting, but if you look at the support for the rule, it is very small, indicating that these rules appear in just 0.7% of the itemsets.  So these rules are very interesting but only for a very small set of the data.  If the dataset is small this may represent just a few rows of the original dataset.

Lets take another approach by also considering the support along with lift.

In [10]:
rules2 = target_rules[(target_rules['lift'] > 2) & (target_rules['support'] > 0.05)] 
display(rules2)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
105,(s5_0),(target_0),0.151584,0.328054,0.10181,0.671642,2.047349,0.052082,2.04638
2021,"(s5_0, bmi_0)",(target_0),0.081448,0.328054,0.058824,0.722222,2.201533,0.032104,2.419005
2029,"(bmi_0, s6_1)",(target_0),0.072398,0.328054,0.052036,0.71875,2.190948,0.028286,2.38914
2525,"(s4_0, bp_0)",(target_0),0.072398,0.328054,0.052036,0.71875,2.190948,0.028286,2.38914
3724,"(s4_0, s5_0)",(target_0),0.128959,0.328054,0.090498,0.701754,2.139141,0.048192,2.252994
9849,"(s4_0, s5_0, sex_0)",(target_0),0.09276,0.328054,0.061086,0.658537,2.007401,0.030656,1.967841
24652,"(s4_0, s5_0, bmi_0)",(target_0),0.076923,0.328054,0.054299,0.705882,2.151724,0.029064,2.284615


Now we have a smaller set of rules that appear in at least 5% of our data, and also have a lift that indicates some dependency between the antecedent and consequent.

## Conviction

Conviction has the distinct advantage over other metrics in that it's score is dependent on the direction of the rule.  The antecedent and consequent are taken into account when computing conviction, and this is useful when we want interesting rules whose consequent is the target feature we care about.

In [11]:
rules3 = target_rules[(target_rules['conviction'] > 2) & (target_rules['support'] > 0.05)] 
display(rules3)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
105,(s5_0),(target_0),0.151584,0.328054,0.10181,0.671642,2.047349,0.052082,2.04638
2021,"(s5_0, bmi_0)",(target_0),0.081448,0.328054,0.058824,0.722222,2.201533,0.032104,2.419005
2029,"(bmi_0, s6_1)",(target_0),0.072398,0.328054,0.052036,0.71875,2.190948,0.028286,2.38914
2525,"(s4_0, bp_0)",(target_0),0.072398,0.328054,0.052036,0.71875,2.190948,0.028286,2.38914
3724,"(s4_0, s5_0)",(target_0),0.128959,0.328054,0.090498,0.701754,2.139141,0.048192,2.252994
24652,"(s4_0, s5_0, bmi_0)",(target_0),0.076923,0.328054,0.054299,0.705882,2.151724,0.029064,2.284615


As you can see, we got nearly the same set of rules as we did using lift, with just one exception.  We eliminated the `(sex_0, s4_0, s5_0) -> (target_0)` rule because its conviction was just below 2.

## Feature Selection: get the unique antecedents in these rules
A primary use case of ARM is to mine for features that are associated with a target.  If you have a large dataset and/or a lot of features, this method can help to narrow down which features might prove the most useful to include in statistical modeling.  

The code below will give us a list of features that we could investigate further with other methods like regression or decision trees, for example.  There is less utility in this approach if you are planning to use a machine learning method that is robust to large feature sets like random forest, XGBoost, or neural networks, but those methods are plagued by a lack of transparency in how the predictions are made.  We've already discussed the issue of non-transparent models in the health care space previously in this course.

In [12]:
frozenset.union(*rules3['antecedents'])

frozenset({'bmi_0', 'bp_0', 's4_0', 's5_0', 's6_1'})

These results indicate that bmi, bp, s4, s5, and s6 may be strong predictors of the target.  You can experiment with using different metrics and thresholds to see the effect on this final set of features.  The identified features can be run through other machine learning models such as regression as selected features, or you can simply use the rules themselves to make a prediction.

## Using rules for prediction

While the primary usecase of ARM is not to create a predictive model, using the generated rules for prediction is straightforward.  Pick a rule, such as this rule `(s4_0, s5_0, bmi_0) -> (target_0)`.  When you see data such that a patient has bmi class 0, s4 class 0, and s5 class 0, you can predict target class 0 with a confidence of about 70.5%.

You will note that from the target table, the support for the rule is only about 5.4%, meaning this rule covers 5.4% of the frequent itemsets.  The confidence of the rule is 70.5%, meaning that 70.5% of the time the frequent itemsets contains bmi class 0, s4 class 0, and s5 class 0, it also contains target class 0.  This can be interpreted as a measure of accuracy.  There may be other rules that are associated with different classes, but with much lower support or confidence.  

You have to be careful with this approach becasue it is easy to overfit the data for predictions, and in fact we've used the entire dataset to generate these rules, so the full set of rules is actually fitted fairly completely to the original data.

Some practical considerations of using rules for classification generally preclude this approach.  One consideration is that the new data you wish to classify must be itemized in the same way as the training data set.  If you have used KMeans to bin data as we have, then you must save that transformation and reuse it, otherwise the bins could shift with the addition of new data.  This leads to the second problem, which isn't unique to ARM but more obvious.  New data may be outside the original ranges used for itemization, and therefore unable to be fit into the existing association rules.

In short, for this data set, with our current metric threshold, we have covered very little in the way of predicting diabetes disease progression accurately using just association rules.  When this is the case, it may be better to focus on the feature selection method and proceed with other machine learning methods that can handle continuous valued variables.