#  Frequent Itemset Mining: Apriori Alternatives

In this notebook, we will apply **apriori**, **FP-Growth**, and **maximal frequent itemset** methods on the congressional voting records dataset. You can learn more about this dataset here: https://archive.ics.uci.edu/ml/datasets/congressional+voting+records

 ### Import required Libraries

In [108]:
import pandas as pd
import numpy as np
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules, fpgrowth, fpmax
import matplotlib.pyplot as plt
%matplotlib inline

### T1: Data Loading

The data is located here: `/dsa/data/DSA-8410/association-mining/house-vote/house-votes-84.csv`


In [109]:
df = pd.read_csv('/dsa/data/DSA-8410/association-mining/house-vote/house-votes-84.csv') 
df

Unnamed: 0,Class Name,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
430,republican,n,n,y,y,y,y,n,n,y,y,n,y,y,y,n,y
431,democrat,n,n,y,n,n,n,y,y,y,y,n,n,n,n,n,y
432,republican,n,?,n,y,y,y,n,n,n,n,y,y,y,y,n,y
433,republican,n,n,n,y,y,y,?,?,?,?,n,y,y,y,n,y


### T2: Show the number of transactions

In [110]:
print(f"Num of transactions = {df.shape[0]}")

Num of transactions = 435


### T3: Transform the dataset to a binary incidence matrix for applying itemset mining methods

In [111]:
trans_data_enc = pd.get_dummies(df)

In [112]:
trans_data_enc.head()

Unnamed: 0,Class Name_democrat,Class Name_republican,handicapped-infants_?,handicapped-infants_n,handicapped-infants_y,water-project-cost-sharing_?,water-project-cost-sharing_n,water-project-cost-sharing_y,adoption-of-the-budget-resolution_?,adoption-of-the-budget-resolution_n,...,superfund-right-to-sue_y,crime_?,crime_n,crime_y,duty-free-exports_?,duty-free-exports_n,duty-free-exports_y,export-administration-act-south-africa_?,export-administration-act-south-africa_n,export-administration-act-south-africa_y
0,0,1,0,1,0,0,0,1,0,1,...,1,0,0,1,0,1,0,0,0,1
1,0,1,0,1,0,0,0,1,0,1,...,1,0,0,1,0,1,0,1,0,0
2,1,0,1,0,0,0,0,1,0,0,...,1,0,0,1,0,1,0,0,1,0
3,1,0,0,1,0,0,0,1,0,0,...,1,0,1,0,0,1,0,0,0,1
4,1,0,0,0,1,0,0,1,0,0,...,1,0,0,1,0,0,1,0,0,1


### T4: Indentify Frequent Patterns with FP-Growth Method. Use min_support = 0.3. Show the number of itemsets per itemset length.

In [113]:
freq_items = fpgrowth(trans_data_enc, min_support=0.3, use_colnames=True)

In [114]:
freq_items.shape

(973, 2)

In [115]:
freq_items = freq_items.reindex(columns=['itemsets', 'support'])
freq_items['length'] = freq_items['itemsets'].apply(lambda x: len(x))

In [116]:
freq_items

Unnamed: 0,itemsets,support,length
0,( religious-groups-in-schools_y),0.625287,1
1,( export-administration-act-south-africa_y),0.618391,1
2,( crime_y),0.570115,1
3,( handicapped-infants_n),0.542529,1
4,( duty-free-exports_n),0.535632,1
...,...,...,...
968,"( aid-to-nicaraguan-contras_y, religious-grou...",0.305747,3
969,"( el-salvador-aid_n, religious-groups-in-scho...",0.301149,3
970,"(Class Name_democrat, aid-to-nicaraguan-contr...",0.303448,4
971,"( water-project-cost-sharing_n, export-admini...",0.303448,2


In [117]:
 freq_items.groupby(['length']).size() # number of itemsets per length

length
1     33
2    174
3    313
4    270
5    134
6     43
7      6
dtype: int64

### T5: Generate Association Rules from Frequent Itemsets with min 90% confidence.

* Show the total number of rules

In [119]:
rules = association_rules(freq_items, metric="confidence", min_threshold=0.9)
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,"( crime_y, duty-free-exports_n)",( religious-groups-in-schools_y),0.432184,0.625287,0.390805,0.904255,1.446144,0.120565,3.913665
1,"( handicapped-infants_n, duty-free-exports_n)",( religious-groups-in-schools_y),0.335632,0.625287,0.314943,0.938356,1.50068,0.105076,6.078672
2,"( handicapped-infants_n, duty-free-exports_n)",( crime_y),0.335632,0.570115,0.305747,0.910959,1.597851,0.114398,4.82794
3,( el-salvador-aid_y),( religious-groups-in-schools_y),0.487356,0.625287,0.452874,0.929245,1.486109,0.148136,5.295939
4,( el-salvador-aid_y),( crime_y),0.487356,0.570115,0.445977,0.915094,1.605105,0.168128,5.063091


In [120]:
len(rules)

2990

In [121]:
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,"( crime_y, duty-free-exports_n)",( religious-groups-in-schools_y),0.432184,0.625287,0.390805,0.904255,1.446144,0.120565,3.913665
1,"( handicapped-infants_n, duty-free-exports_n)",( religious-groups-in-schools_y),0.335632,0.625287,0.314943,0.938356,1.500680,0.105076,6.078672
2,"( handicapped-infants_n, duty-free-exports_n)",( crime_y),0.335632,0.570115,0.305747,0.910959,1.597851,0.114398,4.827940
3,( el-salvador-aid_y),( religious-groups-in-schools_y),0.487356,0.625287,0.452874,0.929245,1.486109,0.148136,5.295939
4,( el-salvador-aid_y),( crime_y),0.487356,0.570115,0.445977,0.915094,1.605105,0.168128,5.063091
...,...,...,...,...,...,...,...,...,...
2985,"(Class Name_democrat, religious-groups-in-sch...",( aid-to-nicaraguan-contras_y),0.305747,0.556322,0.303448,0.992481,1.784005,0.133354,59.009195
2986,"( aid-to-nicaraguan-contras_y, religious-grou...",(Class Name_democrat),0.305747,0.613793,0.303448,0.992481,1.616964,0.115783,51.365517
2987,"(Class Name_democrat, religious-groups-in-sch...","( aid-to-nicaraguan-contras_y, physician-fee-...",0.310345,0.485057,0.303448,0.977778,2.015798,0.152913,23.172414
2988,"( aid-to-nicaraguan-contras_y, religious-grou...","(Class Name_democrat, physician-fee-freeze_n)",0.326437,0.563218,0.303448,0.929577,1.650474,0.119593,6.202299


### T6: Identify the top 5 rules with high confidence where `consequents` are only `Class Name_democrat`. Similarly, infer the top 5 rules with high confidence where `consequents` are only `Class Name_republican`. 

* Iterate over these two subsets of rules and print only antecedents, consequents, and confidence.
* Based on these rules, characterize democrat and republican congress members

In [122]:
top_democrat = rules[rules['consequents'] == {'Class Name_democrat'}].sort_values(by=['confidence'], ascending=False).head()
top_democrat[['antecedents', 'consequents', 'confidence']]

Unnamed: 0,antecedents,consequents,confidence
1359,"( duty-free-exports_y, adoption-of-the-budget...",(Class Name_democrat),1.0
2631,"( aid-to-nicaraguan-contras_y, superfund-righ...",(Class Name_democrat),1.0
2679,"( superfund-right-to-sue_n, anti-satellite-te...",(Class Name_democrat),1.0
2706,"( aid-to-nicaraguan-contras_y, superfund-righ...",(Class Name_democrat),1.0
1404,"( duty-free-exports_y, aid-to-nicaraguan-cont...",(Class Name_democrat),1.0


In [124]:
top_republican = rules[rules['consequents'] == {'Class Name_republican'}].sort_values(by=['confidence'], ascending=False).head()
top_republican[['antecedents', 'consequents', 'confidence']]

Unnamed: 0,antecedents,consequents,confidence
666,"( synfuels-corporation-cutback_n, physician-f...",(Class Name_republican),0.978261
604,"( el-salvador-aid_y, physician-fee-freeze_y, ...",(Class Name_republican),0.971631
623,"( mx-missile_n, physician-fee-freeze_y, adop...",(Class Name_republican),0.97037
608,"( crime_y, physician-fee-freeze_y, adoption-...",(Class Name_republican),0.963768
595,"( physician-fee-freeze_y, adoption-of-the-bud...",(Class Name_republican),0.958904


It cuts off the antecedents because the length is too long, but this is very useful to see how Republicans and Democrats vote. For example, we can see the Democrats are very likely to vote yes on duty free exports and aid to Nicaraguan contras, and to vote no on superfund right to sue. Whereas Republicans are very likely to vote yes on physician fee freeze, 'crime' (whatever that is), and El Salvador aid, and to vote no on Synfuels Corporation cutback and mx missile.

### T7. Show the number of maximal frequent itemsets for min support = 0.3 

In [125]:
max_patterns = fpmax(trans_data_enc, min_support=0.3, use_colnames=True)

In [126]:
max_patterns = max_patterns.reindex(columns=['itemsets', 'support'])
max_patterns['length'] = max_patterns['itemsets'].apply(lambda x: len(x))

In [127]:
print(f"Total number of maximal frequent patterns = {max_patterns.shape[0]}")
max_patterns

Total number of maximal frequent patterns = 179


Unnamed: 0,itemsets,support,length
0,( synfuels-corporation-cutback_y),0.344828,1
1,"( education-spending_n, religious-groups-in-s...",0.301149,2
2,"( adoption-of-the-budget-resolution_y, religi...",0.303448,2
3,"( el-salvador-aid_n, religious-groups-in-scho...",0.301149,3
4,"(Class Name_democrat, aid-to-nicaraguan-contr...",0.303448,4
...,...,...,...
174,"( crime_y, export-administration-act-south-af...",0.340230,2
175,"( synfuels-corporation-cutback_n, crime_y, r...",0.328736,3
176,"( synfuels-corporation-cutback_n, adoption-of...",0.305747,2
177,"( synfuels-corporation-cutback_n, export-admi...",0.381609,2
