# Apriori


The Apriori algorithm is used for mining frequent itemsets and devising association rules from a transactional database. The parameters “support” and “confidence” are used. Support refers to items’ frequency of occurrence; confidence is a conditional probability.

A key concept in Apriori algorithm is the anti-monotonicity of the support measure. It assumes that

1. All subsets of a frequent itemset must be frequent
2. Similarly, for any infrequent itemset, all its supersets must be infrequent too


###  Algorithm
The following are the main steps of the algorithm:

1. Calculate the support of item sets (of size k = 1) in the transactional database (note that support is the frequency of 
   occurrence of an itemset). This is called generating the candidate set.
2. Prune the candidate set by eliminating items with a support less than the given threshold.
3. Join the frequent itemsets to form sets of size k + 1, and repeat the above sets until no more itemsets can be formed. This 
   will happen when the set(s) formed have a support less than​ the given support.

### Libraries useful in Ensemble are listed below

### Install library for apriori algorithm using:
!pip install mlxtend

In [146]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
from sklearn import preprocessing
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score


from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import accuracy_score

import matplotlib.pyplot as plt

### Load the "basket" data

In [147]:
# Load dataset and display first five rows.
df = pd.read_csv("BASKETS1n")
df.head()

Unnamed: 0,cardid,value,pmethod,sex,homeown,income,age,fruitveg,freshmeat,dairy,cannedveg,cannedmeat,frozenmeal,beer,wine,softdrink,fish,confectionery
0,39808,42.7123,CHEQUE,M,NO,27000,46,F,T,T,F,F,F,F,F,F,F,T
1,67362,25.3567,CASH,F,NO,30000,28,F,T,F,F,F,F,F,F,F,F,T
2,10872,20.6176,CASH,M,NO,13200,36,F,F,F,T,F,T,T,F,F,T,F
3,26748,23.6883,CARD,F,NO,12200,26,F,F,T,F,F,F,F,T,F,F,F
4,91609,18.8133,CARD,M,YES,11000,24,F,F,F,F,F,F,F,F,F,F,F


### Perform pre-processing (if required)

In [148]:
#selecting only products columns and replacing boolean values
df_prod = df[['fruitveg','freshmeat','dairy','cannedveg','cannedmeat','frozenmeal','beer','wine','softdrink','fish','confectionery']].copy()
df_prod=(df_prod=='T').astype(bool)
# df.fruitveg=(df.fruitveg=='T').astype(bool)
# df.freshmeat=(df.freshmeat=='T').astype(bool)
# df.dairy=(df.dairy=='T').astype(bool)
# df.cannedveg=(df.cannedveg=='T').astype(bool)
# df.cannedmeat=(df.cannedmeat=='T').astype(bool)
# df.frozenmeal=(df.frozenmeal=='T').astype(bool)
# df.frozenmeal=(df.frozenmeal=='T').astype(bool)
df_prod

Unnamed: 0,fruitveg,freshmeat,dairy,cannedveg,cannedmeat,frozenmeal,beer,wine,softdrink,fish,confectionery
0,False,True,True,False,False,False,False,False,False,False,True
1,False,True,False,False,False,False,False,False,False,False,True
2,False,False,False,True,False,True,True,False,False,True,False
3,False,False,True,False,False,False,False,True,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...
995,False,False,False,True,False,False,False,False,False,False,False
996,False,False,False,True,False,False,False,False,False,True,False
997,False,True,False,False,False,False,False,False,False,False,False
998,True,False,False,False,False,False,False,True,False,False,True


### Q1. Find frequent itemsets in the dataset using Apriori

In [149]:
#apriori with min support 0.1 and confidence 0.1
ap = apriori(df_prod, min_support=0.1,use_colnames=True)
ap

Unnamed: 0,support,itemsets
0,0.299,(fruitveg)
1,0.183,(freshmeat)
2,0.177,(dairy)
3,0.303,(cannedveg)
4,0.204,(cannedmeat)
5,0.302,(frozenmeal)
6,0.293,(beer)
7,0.287,(wine)
8,0.184,(softdrink)
9,0.292,(fish)


### Q2. Find the assoiation rules in the dataset having min confidence 10%

In [150]:
# find rules
ar=association_rules(ap,metric='confidence',support_only=False,min_threshold=0.1)
ar

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(fruitveg),(fish),0.299,0.292,0.145,0.48495,1.660787,0.057692,1.374623
1,(fish),(fruitveg),0.292,0.299,0.145,0.496575,1.660787,0.057692,1.392463
2,(frozenmeal),(cannedveg),0.302,0.303,0.173,0.572848,1.890586,0.081494,1.631736
3,(cannedveg),(frozenmeal),0.303,0.302,0.173,0.570957,1.890586,0.081494,1.626877
4,(cannedveg),(beer),0.303,0.293,0.167,0.551155,1.881075,0.078221,1.575154
5,(beer),(cannedveg),0.293,0.303,0.167,0.569966,1.881075,0.078221,1.620802
6,(frozenmeal),(beer),0.302,0.293,0.17,0.562914,1.921208,0.081514,1.61753
7,(beer),(frozenmeal),0.293,0.302,0.17,0.580205,1.921208,0.081514,1.662715
8,(confectionery),(wine),0.276,0.287,0.144,0.521739,1.817906,0.064788,1.490818
9,(wine),(confectionery),0.287,0.276,0.144,0.501742,1.817906,0.064788,1.453063


### Q3. Find association rules having minimum antecedent_len 2 & confidence greater than 0.75

In [151]:
#rules having minimum antecedent_len 2 and confidence greater than 0.75
ar=association_rules(ap,metric='confidence',support_only=False,min_threshold=0.75)
for i in ar.index:
    if(len(ar.antecedents.loc[i])<2):
        ar.drop(i)
ar

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,"(frozenmeal, cannedveg)",(beer),0.173,0.293,0.146,0.843931,2.880309,0.095311,4.530037
1,"(frozenmeal, beer)",(cannedveg),0.17,0.303,0.146,0.858824,2.834401,0.09449,4.937083
2,"(cannedveg, beer)",(frozenmeal),0.167,0.302,0.146,0.874251,2.894873,0.095566,5.550762


### Load the "zoo" data

In [152]:
# load the dataset and display first five rows
df = pd.read_csv("zoo.txt",header=None)	
df.columns=['animal_name','hair','feathers','eggs','milk','airborne','aquatic','predator','toothed','backbone','breathes','venomous','fins','legs','tail','domestic','catsize','type']	
df.head()

Unnamed: 0,animal_name,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs,tail,domestic,catsize,type
0,aardvark,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
1,antelope,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1
2,bass,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4
3,bear,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
4,boar,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1


### Q4. Perform pre-processing (if required)

In [153]:
#dropping first column - name
df=df.drop(['animal_name'],axis=1)
#one hot encoding column legs
df = pd.concat([df,pd.get_dummies(df['legs'], prefix='legs')],axis=1)
df.drop(['legs'],axis=1, inplace=True)
#replacing class type and one hot encoding it
df = pd.concat([df,pd.get_dummies(df['type'], prefix='type')],axis=1)
df.drop(['type'],axis=1, inplace=True)
df


Unnamed: 0,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,...,legs_5,legs_6,legs_8,type_1,type_2,type_3,type_4,type_5,type_6,type_7
0,1,0,0,1,0,0,1,1,1,1,...,0,0,0,1,0,0,0,0,0,0
1,1,0,0,1,0,0,0,1,1,1,...,0,0,0,1,0,0,0,0,0,0
2,0,0,1,0,0,1,1,1,1,0,...,0,0,0,0,0,0,1,0,0,0
3,1,0,0,1,0,0,1,1,1,1,...,0,0,0,1,0,0,0,0,0,0
4,1,0,0,1,0,0,1,1,1,1,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,1,0,0,1,0,0,0,1,1,1,...,0,0,0,1,0,0,0,0,0,0
97,1,0,1,0,1,0,0,0,0,1,...,0,1,0,0,0,0,0,0,1,0
98,1,0,0,1,0,0,1,1,1,1,...,0,0,0,1,0,0,0,0,0,0
99,0,0,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1


### Q5. Find frequent itemsets in zoo dataset having min support 0.5 

In [154]:
#apriori with min support 0.5 and confidence 0.5
ap = apriori(df, min_support=0.5,use_colnames=True)
ap

Unnamed: 0,support,itemsets
0,0.584158,(eggs)
1,0.554455,(predator)
2,0.60396,(toothed)
3,0.821782,(backbone)
4,0.792079,(breathes)
5,0.742574,(tail)
6,0.60396,"(backbone, toothed)"
7,0.514851,"(tail, toothed)"
8,0.683168,"(backbone, breathes)"
9,0.732673,"(backbone, tail)"


### Q6. Find frequent association rules having min confidence 0.5

In [155]:
# Find and display rules
ar=association_rules(ap,metric='confidence',support_only=False,min_threshold=0.5)
ar

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(backbone),(toothed),0.821782,0.60396,0.60396,0.73494,1.216867,0.107637,1.494149
1,(toothed),(backbone),0.60396,0.821782,0.60396,1.0,1.216867,0.107637,inf
2,(tail),(toothed),0.742574,0.60396,0.514851,0.693333,1.147978,0.066366,1.291433
3,(toothed),(tail),0.60396,0.742574,0.514851,0.852459,1.147978,0.066366,1.744774
4,(backbone),(breathes),0.821782,0.792079,0.683168,0.831325,1.049548,0.032252,1.232673
5,(breathes),(backbone),0.792079,0.821782,0.683168,0.8625,1.049548,0.032252,1.29613
6,(backbone),(tail),0.821782,0.742574,0.732673,0.891566,1.200643,0.122439,2.374037
7,(tail),(backbone),0.742574,0.821782,0.732673,0.986667,1.200643,0.122439,13.366337
8,(tail),(breathes),0.742574,0.792079,0.60396,0.813333,1.026833,0.015783,1.113861
9,(breathes),(tail),0.792079,0.742574,0.60396,0.7625,1.026833,0.015783,1.083898


### Q7. Convert the dataset into two classes "Mammal" and "others"

In [156]:
# Take mammal class column as the class column and drop others.
df=df.drop(['type_2','type_3','type_4','type_5','type_6','type_7',],axis=1)
df

Unnamed: 0,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,...,tail,domestic,catsize,legs_0,legs_2,legs_4,legs_5,legs_6,legs_8,type_1
0,1,0,0,1,0,0,1,1,1,1,...,0,0,1,0,0,1,0,0,0,1
1,1,0,0,1,0,0,0,1,1,1,...,1,0,1,0,0,1,0,0,0,1
2,0,0,1,0,0,1,1,1,1,0,...,1,0,0,1,0,0,0,0,0,0
3,1,0,0,1,0,0,1,1,1,1,...,0,0,1,0,0,1,0,0,0,1
4,1,0,0,1,0,0,1,1,1,1,...,1,0,1,0,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,1,0,0,1,0,0,0,1,1,1,...,1,0,1,0,1,0,0,0,0,1
97,1,0,1,0,1,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
98,1,0,0,1,0,0,1,1,1,1,...,1,0,1,0,0,1,0,0,0,1
99,0,0,1,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,0,0


### Q8. Partition the dataset into training and testing part (70:30)

In [157]:
#partition the data
df_train,df_test = train_test_split(df, test_size=0.30, random_state = 30)

### Q9. Generate association rules for "mammal" class (training data) with min support 0.4 and confidence as 1

In [158]:
# frequent itemsets 
ap1 = apriori(df_train, min_support=0.4,use_colnames=True)
ap1

Unnamed: 0,support,itemsets
0,0.428571,(hair)
1,0.571429,(eggs)
2,0.4,(milk)
3,0.571429,(predator)
4,0.628571,(toothed)
5,0.8,(backbone)
6,0.757143,(breathes)
7,0.7,(tail)
8,0.414286,(catsize)
9,0.4,(type_1)


In [159]:
# find frequent rules
ar1=association_rules(ap1,metric='confidence',support_only=False,min_threshold=1.0)
ar1

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(hair),(breathes),0.428571,0.757143,0.428571,1.0,1.320755,0.104082,inf
1,(milk),(toothed),0.400000,0.628571,0.400000,1.0,1.590909,0.148571,inf
2,(milk),(backbone),0.400000,0.800000,0.400000,1.0,1.250000,0.080000,inf
3,(milk),(breathes),0.400000,0.757143,0.400000,1.0,1.320755,0.097143,inf
4,(milk),(type_1),0.400000,0.400000,0.400000,1.0,2.500000,0.240000,inf
...,...,...,...,...,...,...,...,...,...
110,"(milk, type_1)","(breathes, backbone, toothed)",0.400000,0.485714,0.400000,1.0,2.058824,0.205714,inf
111,"(backbone, type_1)","(breathes, milk, toothed)",0.400000,0.400000,0.400000,1.0,2.500000,0.240000,inf
112,"(type_1, breathes)","(backbone, milk, toothed)",0.400000,0.400000,0.400000,1.0,2.500000,0.240000,inf
113,(milk),"(type_1, breathes, backbone, toothed)",0.400000,0.400000,0.400000,1.0,2.500000,0.240000,inf


In [160]:
# selecting rules having consequents as class mammal
for i in ar1.index:
    if(list(ar1.consequents.loc[i])!=['type_1']):
        ar1.drop(i,inplace=True)
#print(str(ar1.consequents))
ar1

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
4,(milk),(type_1),0.4,0.4,0.4,1.0,2.5,0.24,inf
16,"(milk, toothed)",(type_1),0.4,0.4,0.4,1.0,2.5,0.24,inf
24,"(backbone, milk)",(type_1),0.4,0.4,0.4,1.0,2.5,0.24,inf
31,"(milk, breathes)",(type_1),0.4,0.4,0.4,1.0,2.5,0.24,inf
52,"(backbone, milk, toothed)",(type_1),0.4,0.4,0.4,1.0,2.5,0.24,inf
66,"(breathes, milk, toothed)",(type_1),0.4,0.4,0.4,1.0,2.5,0.24,inf
77,"(backbone, milk, breathes)",(type_1),0.4,0.4,0.4,1.0,2.5,0.24,inf
92,"(breathes, backbone, milk, toothed)",(type_1),0.4,0.4,0.4,1.0,2.5,0.24,inf


### Q10. Test the rules generated on testing dataset and find precision and recall for the rule based classifier

In [161]:
#applying rules on test data
i = 0
flag = False
predictions=[]
for row in df_test.itertuples(index=False):
    for rule in ar1.itertuples(index=False):
        cols=[row[df_test.columns.get_loc(j)] for j in rule[0]]
        if all(cols):
            predictions.append(1)
            flag=True
            break
    if not flag:
        predictions.append(0)
    flag=False

In [162]:
# evaluation measures
print("Confusion Matrix")
print(confusion_matrix(df_test['type_1'],predictions))
print("\n Accuracy")
print(accuracy_score(df_test['type_1'],predictions))

Confusion Matrix
[[18  0]
 [ 0 13]]

 Accuracy
1.0


In [163]:
# print classification report
print(classification_report(Y_test,predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        18
           1       1.00      1.00      1.00        13

    accuracy                           1.00        31
   macro avg       1.00      1.00      1.00        31
weighted avg       1.00      1.00      1.00        31



### Q11. Apply decision tree on the dataset and calculate the performance evaluation measures

In [164]:
# Select the independent variables and target column
X = df[df.columns[:-1]] # Selecting the independent variables
Y=df[df.columns[len(df.columns)-1]] # selecting only the target lableled column
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)

In [165]:
# Apply decision tree
from sklearn.tree import DecisionTreeClassifier
dtree= DecisionTreeClassifier(criterion='entropy', random_state = 30)
dtree = dtree.fit(X_train,Y_train)

In [166]:
# Find predictions by decision tree
predictions = dtree.predict(X_test)

In [167]:
# Evaluation measures and classification report
print(classification_report(Y_test,predictions))
print("Confusion Matrix")
print(confusion_matrix(Y_test,predictions))
print("\n Accuracy")
print(accuracy_score(Y_test,predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        18
           1       1.00      1.00      1.00        13

    accuracy                           1.00        31
   macro avg       1.00      1.00      1.00        31
weighted avg       1.00      1.00      1.00        31

Confusion Matrix
[[18  0]
 [ 0 13]]

 Accuracy
1.0


### Q12. Which out of the two classifiers performs better.

In [None]:
#Name of the classifier with accuracy value.
Both the classifier perform equaly good having 100% accuracy_score.