# Apriori


The Apriori algorithm is used for mining frequent itemsets and devising association rules from a transactional database. The parameters “support” and “confidence” are used. Support refers to items’ frequency of occurrence; confidence is a conditional probability.

A key concept in Apriori algorithm is the anti-monotonicity of the support measure. It assumes that

1. All subsets of a frequent itemset must be frequent
2. Similarly, for any infrequent itemset, all its supersets must be infrequent too


###  Algorithm
The following are the main steps of the algorithm:

1. Calculate the support of item sets (of size k = 1) in the transactional database (note that support is the frequency of 
   occurrence of an itemset). This is called generating the candidate set.
2. Prune the candidate set by eliminating items with a support less than the given threshold.
3. Join the frequent itemsets to form sets of size k + 1, and repeat the above sets until no more itemsets can be formed. This 
   will happen when the set(s) formed have a support less than​ the given support.

### Libraries useful in Apriori are listed below

### Install library for apriori algorithm using:
!pip install mlxtend

In [1]:
import warnings
warnings.filterwarnings('ignore')
!pip3 install mlxtend



In [2]:
#importing libraries
import pandas as pd
from sklearn import preprocessing
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

#For Prediction and Evaluation
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import accuracy_score

#Ploting the graph
import matplotlib.pyplot as plt

from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

### Load the "basket" data

In [6]:
# Load dataset and display first five rows.
df = pd.read_csv("BASKETS1n", header=0)
df.head()

Unnamed: 0,cardid,value,pmethod,sex,homeown,income,age,fruitveg,freshmeat,dairy,cannedveg,cannedmeat,frozenmeal,beer,wine,softdrink,fish,confectionery
0,39808,42.7123,CHEQUE,M,NO,27000,46,F,T,T,F,F,F,F,F,F,F,T
1,67362,25.3567,CASH,F,NO,30000,28,F,T,F,F,F,F,F,F,F,F,T
2,10872,20.6176,CASH,M,NO,13200,36,F,F,F,T,F,T,T,F,F,T,F
3,26748,23.6883,CARD,F,NO,12200,26,F,F,T,F,F,F,F,T,F,F,F
4,91609,18.8133,CARD,M,YES,11000,24,F,F,F,F,F,F,F,F,F,F,F


### Perform pre-processing (if required)

In [7]:
#selecting only products colum

0


In [8]:
products = df[df.columns[7:]]
label_encoder = preprocessing.LabelEncoder()
products['fruitveg'] = label_encoder.fit_transform(products['fruitveg'])
products['freshmeat'] = label_encoder.fit_transform(products['freshmeat'])
products['dairy'] = label_encoder.fit_transform(products['dairy'])
products['cannedveg'] = label_encoder.fit_transform(products['cannedveg'])
products['cannedmeat'] = label_encoder.fit_transform(products['cannedmeat'])
products['frozenmeal'] = label_encoder.fit_transform(products['frozenmeal'])
products['beer'] = label_encoder.fit_transform(products['beer'])
products['wine'] = label_encoder.fit_transform(products['wine'])
products['softdrink'] = label_encoder.fit_transform(products['softdrink'])
products['fish'] = label_encoder.fit_transform(products['fish'])
products['confectionery'] = label_encoder.fit_transform(products['confectionery'])
products.head()

Unnamed: 0,fruitveg,freshmeat,dairy,cannedveg,cannedmeat,frozenmeal,beer,wine,softdrink,fish,confectionery
0,0,1,1,0,0,0,0,0,0,0,1
1,0,1,0,0,0,0,0,0,0,0,1
2,0,0,0,1,0,1,1,0,0,1,0
3,0,0,1,0,0,0,0,1,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0


### Q1. Find frequent itemsets in the dataset using Apriori

In [9]:
#apriori with min support 0.1 and confidence 0.1
frequent_items = apriori(products, min_support = 0.1, use_colnames = True, max_len=None, verbose=0, low_memory=False)
frequent_items

Unnamed: 0,support,itemsets
0,0.299,(fruitveg)
1,0.183,(freshmeat)
2,0.177,(dairy)
3,0.303,(cannedveg)
4,0.204,(cannedmeat)
5,0.302,(frozenmeal)
6,0.293,(beer)
7,0.287,(wine)
8,0.184,(softdrink)
9,0.292,(fish)


### Q2. Find the assoiation rules in the dataset having min confidence 10%

In [10]:
# find rules
rules = association_rules(frequent_items, metric='confidence', min_threshold=0.1, support_only=False)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(fruitveg),(fish),0.299,0.292,0.145,0.48495,1.660787,0.057692,1.374623
1,(fish),(fruitveg),0.292,0.299,0.145,0.496575,1.660787,0.057692,1.392463
2,(cannedveg),(frozenmeal),0.303,0.302,0.173,0.570957,1.890586,0.081494,1.626877
3,(frozenmeal),(cannedveg),0.302,0.303,0.173,0.572848,1.890586,0.081494,1.631736
4,(cannedveg),(beer),0.303,0.293,0.167,0.551155,1.881075,0.078221,1.575154
5,(beer),(cannedveg),0.293,0.303,0.167,0.569966,1.881075,0.078221,1.620802
6,(frozenmeal),(beer),0.302,0.293,0.17,0.562914,1.921208,0.081514,1.61753
7,(beer),(frozenmeal),0.293,0.302,0.17,0.580205,1.921208,0.081514,1.662715
8,(wine),(confectionery),0.287,0.276,0.144,0.501742,1.817906,0.064788,1.453063
9,(confectionery),(wine),0.276,0.287,0.144,0.521739,1.817906,0.064788,1.490818


### Q3. Find association rules having minimum antecedent_len 2 & confidence greater than 0.75

In [11]:
#rules having minimum antecedent_len 2 and confidence greater than 0.75
rules["antecedent_len"] = rules["antecedents"].apply(lambda x: len(x))
rules[ (rules['antecedent_len'] >= 2) &
       (rules['confidence'] > 0.75)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len
10,"(cannedveg, frozenmeal)",(beer),0.173,0.293,0.146,0.843931,2.880309,0.095311,4.530037,2
11,"(cannedveg, beer)",(frozenmeal),0.167,0.302,0.146,0.874251,2.894873,0.095566,5.550762,2
12,"(frozenmeal, beer)",(cannedveg),0.17,0.303,0.146,0.858824,2.834401,0.09449,4.937083,2


### Load the "zoo" data

In [13]:
# load the dataset and display first five rows
zoo_dataset = pd.read_csv("zoo.data", header=None,
                    names=['animal_name', 'hair', 'feathers', 'eggs', 'milk', 'airborne', 'aquatic', 'predator', 'toothed', 'backbone',
                          'breathes', 'venomous', 'fins', 'legs', 'tail', 'domestic', 'catsize', 'type'])
zoo_dataset.head()

Unnamed: 0,animal_name,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs,tail,domestic,catsize,type
0,aardvark,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
1,antelope,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1
2,bass,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4
3,bear,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
4,boar,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1


### Q4. Perform pre-processing (if required)

In [14]:
#dropping first column - name
#one hot encoding column legs
#replacing class type and one hot encoding it
# zoo_df = zoo_dataset[zoo_dataset.columns[1:]]
# zoo_df = pd.concat([zoo_df,pd.get_dummies(zoo_df['legs'], prefix='LEGS')],axis=1)
# zoo_df.drop(['legs'],axis=1, inplace=True)
# zoo_df['type'] = zoo_df['type'].replace({1: 'Mammal', 2: 'Bird', 3: 'Reptile',4: 'Fish',5: 'Amphibia',6: 'Bug',7: 'Invertebrate'})
# zoo_df =pd.concat([zoo_df,pd.get_dummies(zoo_df['type'],prefix='CLASS')],axis=1)
# zoo_df.drop(['type'],axis=1, inplace=True)
# zoo_df.head()

df=zoo_dataset[zoo_dataset.columns[1:]]

#one hot encoding column legs
df=pd.concat([df,pd.get_dummies(df['legs'],prefix='LEGS')],axis=1)
df.drop(['legs'],axis=1, inplace=True)

#for other columns
df=pd.concat([df,pd.get_dummies(df['hair'],prefix='HAIR')],axis=1)
df.drop(['hair'],axis=1, inplace=True)

df=pd.concat([df,pd.get_dummies(df['feathers'],prefix='FEATHERS')],axis=1)
df.drop(['feathers'],axis=1, inplace=True)

df=pd.concat([df,pd.get_dummies(df['eggs'],prefix='EGGS')],axis=1)
df.drop(['eggs'],axis=1, inplace=True)

df=pd.concat([df,pd.get_dummies(df['milk'],prefix='MILK')],axis=1)
df.drop(['milk'],axis=1, inplace=True)

df=pd.concat([df,pd.get_dummies(df['airborne'],prefix='AIRBORNE')],axis=1)
df.drop(['airborne'],axis=1, inplace=True)

df=pd.concat([df,pd.get_dummies(df['aquatic'],prefix='AQUATIC')],axis=1)
df.drop(['aquatic'],axis=1, inplace=True)

df=pd.concat([df,pd.get_dummies(df['predator'],prefix='PREDATOR')],axis=1)
df.drop(['predator'],axis=1, inplace=True)

df=pd.concat([df,pd.get_dummies(df['toothed'],prefix='TOOTHED')],axis=1)
df.drop(['toothed'],axis=1, inplace=True)

df=pd.concat([df,pd.get_dummies(df['backbone'],prefix='BACKBONE')],axis=1)
df.drop(['backbone'],axis=1, inplace=True)

df=pd.concat([df,pd.get_dummies(df['breathes'],prefix='BREATHES')],axis=1)
df.drop(['breathes'],axis=1, inplace=True)

df=pd.concat([df,pd.get_dummies(df['venomous'],prefix='VENOMOUS')],axis=1)
df.drop(['venomous'],axis=1, inplace=True)

df=pd.concat([df,pd.get_dummies(df['fins'],prefix='FINS')],axis=1)
df.drop(['fins'],axis=1, inplace=True)

df=pd.concat([df,pd.get_dummies(df['tail'],prefix='TAIL')],axis=1)
df.drop(['tail'],axis=1, inplace=True)

df=pd.concat([df,pd.get_dummies(df['domestic'],prefix='DOMESTIC')],axis=1)
df.drop(['domestic'],axis=1, inplace=True)

df=pd.concat([df,pd.get_dummies(df['catsize'],prefix='CATSIZE')],axis=1)
df.drop(['catsize'],axis=1, inplace=True)


#replacing class type and one hot encoding it
df['type'] = df['type'].replace({1: 'Mammal', 2: 'Bird', 3: 'Reptile',4: 'Fish',5: 'Amphibia',6: 'Bug',7: 'Invertebrate'})
df=pd.concat([df,pd.get_dummies(df['type'],prefix='CLASS')],axis=1)
df.drop(['type'],axis=1, inplace=True)

df.head()

Unnamed: 0,LEGS_0,LEGS_2,LEGS_4,LEGS_5,LEGS_6,LEGS_8,HAIR_0,HAIR_1,FEATHERS_0,FEATHERS_1,...,DOMESTIC_1,CATSIZE_0,CATSIZE_1,CLASS_Amphibia,CLASS_Bird,CLASS_Bug,CLASS_Fish,CLASS_Invertebrate,CLASS_Mammal,CLASS_Reptile
0,0,0,1,0,0,0,0,1,1,0,...,0,0,1,0,0,0,0,0,1,0
1,0,0,1,0,0,0,0,1,1,0,...,0,0,1,0,0,0,0,0,1,0
2,1,0,0,0,0,0,1,0,1,0,...,0,1,0,0,0,0,1,0,0,0
3,0,0,1,0,0,0,0,1,1,0,...,0,0,1,0,0,0,0,0,1,0
4,0,0,1,0,0,0,0,1,1,0,...,0,0,1,0,0,0,0,0,1,0


### Q5. Find frequent itemsets in zoo dataset having min support 0.5 

In [15]:
#apriori with min support 0.5 and confidence 0.5
frequent_items = apriori(df, min_support = 0.5, use_colnames = True, max_len=None, verbose=0, low_memory=False)
frequent_items

Unnamed: 0,support,itemsets
0,0.574257,(HAIR_0)
1,0.801980,(FEATHERS_0)
2,0.584158,(EGGS_1)
3,0.594059,(MILK_0)
4,0.762376,(AIRBORNE_0)
...,...,...
163,0.594059,"(DOMESTIC_0, FINS_0, VENOMOUS_0, BREATHES_1)"
164,0.544554,"(AIRBORNE_0, BACKBONE_1, TOOTHED_1, VENOMOUS_0..."
165,0.514851,"(AQUATIC_0, BACKBONE_1, VENOMOUS_0, BREATHES_1..."
166,0.554455,"(BACKBONE_1, VENOMOUS_0, BREATHES_1, TAIL_1, F..."


### Q6. Find frequent association rules having min confidence 0.5

In [16]:
# Find and display rules
rules = association_rules(frequent_items, metric='confidence', min_threshold=0.5, support_only=False)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(HAIR_0),(EGGS_1),0.574257,0.584158,0.534653,0.931034,1.593805,0.199196,6.029703
1,(EGGS_1),(HAIR_0),0.584158,0.574257,0.534653,0.915254,1.593805,0.199196,5.023762
2,(MILK_0),(HAIR_0),0.594059,0.574257,0.554455,0.933333,1.625287,0.213312,6.386139
3,(HAIR_0),(MILK_0),0.574257,0.594059,0.554455,0.965517,1.625287,0.213312,11.772277
4,(HAIR_0),(VENOMOUS_0),0.574257,0.920792,0.514851,0.896552,0.973674,-0.013920,0.765677
...,...,...,...,...,...,...,...,...,...
1057,(BACKBONE_1),"(DOMESTIC_0, FINS_0, VENOMOUS_0, BREATHES_1)",0.821782,0.594059,0.514851,0.626506,1.054618,0.026664,1.086873
1058,(VENOMOUS_0),"(DOMESTIC_0, FINS_0, BACKBONE_1, BREATHES_1)",0.920792,0.534653,0.514851,0.559140,1.045798,0.022547,1.055542
1059,(BREATHES_1),"(VENOMOUS_0, DOMESTIC_0, FINS_0, BACKBONE_1)",0.792079,0.514851,0.514851,0.650000,1.262500,0.107048,1.386139
1060,(FINS_0),"(VENOMOUS_0, DOMESTIC_0, BACKBONE_1, BREATHES_1)",0.831683,0.554455,0.514851,0.619048,1.116497,0.053720,1.169554


### Q7. Convert the dataset into two classes "Mammal" and "others"

In [17]:
# Take mammal class column as the class column and drop others.
#dropping first column - name
df2 = df.copy(deep=True)
df.drop(["CLASS_Amphibia", "CLASS_Bird", "CLASS_Bug", "CLASS_Fish", "CLASS_Invertebrate", "CLASS_Reptile"], axis = 1, inplace = True)
df.head()

Unnamed: 0,LEGS_0,LEGS_2,LEGS_4,LEGS_5,LEGS_6,LEGS_8,HAIR_0,HAIR_1,FEATHERS_0,FEATHERS_1,...,VENOMOUS_1,FINS_0,FINS_1,TAIL_0,TAIL_1,DOMESTIC_0,DOMESTIC_1,CATSIZE_0,CATSIZE_1,CLASS_Mammal
0,0,0,1,0,0,0,0,1,1,0,...,0,1,0,1,0,1,0,0,1,1
1,0,0,1,0,0,0,0,1,1,0,...,0,1,0,0,1,1,0,0,1,1
2,1,0,0,0,0,0,1,0,1,0,...,0,0,1,0,1,1,0,1,0,0
3,0,0,1,0,0,0,0,1,1,0,...,0,1,0,1,0,1,0,0,1,1
4,0,0,1,0,0,0,0,1,1,0,...,0,1,0,0,1,1,0,0,1,1


### Q8. Partition the dataset into training and testing part (70:30)

In [18]:
#partition the data
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.30, random_state = 30)
train


Unnamed: 0,LEGS_0,LEGS_2,LEGS_4,LEGS_5,LEGS_6,LEGS_8,HAIR_0,HAIR_1,FEATHERS_0,FEATHERS_1,...,VENOMOUS_1,FINS_0,FINS_1,TAIL_0,TAIL_1,DOMESTIC_0,DOMESTIC_1,CATSIZE_0,CATSIZE_1,CLASS_Mammal
32,0,1,0,0,0,0,0,1,1,0,...,0,1,0,1,0,1,0,0,1,1
73,1,0,0,0,0,0,1,0,1,0,...,0,0,1,0,1,1,0,1,0,0
70,0,0,1,0,0,0,0,1,1,0,...,0,1,0,0,1,0,1,0,1,1
29,0,1,0,0,0,0,0,1,1,0,...,0,1,0,1,0,0,1,0,1,1
30,0,0,0,0,1,0,1,0,1,0,...,0,1,0,1,0,1,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12,1,0,0,0,0,0,1,0,1,0,...,0,0,1,0,1,1,0,1,0,0
98,0,0,1,0,0,0,0,1,1,0,...,0,1,0,0,1,1,0,0,1,1
45,0,0,1,0,0,0,0,1,1,0,...,0,1,0,0,1,1,0,0,1,1
100,0,1,0,0,0,0,1,0,0,1,...,0,1,0,0,1,1,0,1,0,0


### Q9. Generate association rules for "mammal" class (training data) with min support 0.4 and confidence as 1

In [19]:
# frequent itemsets 
frequent_items = apriori(train, min_support = 0.4, use_colnames = True, max_len=None, verbose=0, low_memory=False)
frequent_items

Unnamed: 0,support,itemsets
0,0.571429,(HAIR_0)
1,0.428571,(HAIR_1)
2,0.828571,(FEATHERS_0)
3,0.428571,(EGGS_0)
4,0.571429,(EGGS_1)
...,...,...
717,0.400000,"(EGGS_0, BACKBONE_1, TOOTHED_1, VENOMOUS_0, BR..."
718,0.400000,"(BACKBONE_1, MILK_1, TOOTHED_1, VENOMOUS_0, BR..."
719,0.400000,"(AIRBORNE_0, BACKBONE_1, TOOTHED_1, VENOMOUS_0..."
720,0.400000,"(EGGS_0, BACKBONE_1, MILK_1, TOOTHED_1, VENOMO..."


In [20]:
# find frequent rules
rules = association_rules(frequent_items, metric='confidence', min_threshold=1, support_only=False)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(HAIR_1),(FEATHERS_0),0.428571,0.828571,0.428571,1.0,1.206897,0.073469,inf
1,(HAIR_1),(BREATHES_1),0.428571,0.757143,0.428571,1.0,1.320755,0.104082,inf
2,(EGGS_0),(FEATHERS_0),0.428571,0.828571,0.428571,1.0,1.206897,0.073469,inf
3,(MILK_1),(FEATHERS_0),0.400000,0.828571,0.400000,1.0,1.206897,0.068571,inf
4,(TOOTHED_1),(FEATHERS_0),0.628571,0.828571,0.628571,1.0,1.206897,0.107755,inf
...,...,...,...,...,...,...,...,...,...
4328,"(CLASS_Mammal, VENOMOUS_0)","(EGGS_0, BACKBONE_1, MILK_1, TOOTHED_1, BREATH...",0.400000,0.400000,0.400000,1.0,2.500000,0.240000,inf
4329,"(CLASS_Mammal, BREATHES_1)","(EGGS_0, BACKBONE_1, MILK_1, TOOTHED_1, VENOMO...",0.400000,0.400000,0.400000,1.0,2.500000,0.240000,inf
4330,"(CLASS_Mammal, FEATHERS_0)","(EGGS_0, BACKBONE_1, MILK_1, TOOTHED_1, VENOMO...",0.400000,0.400000,0.400000,1.0,2.500000,0.240000,inf
4331,(MILK_1),"(EGGS_0, BACKBONE_1, TOOTHED_1, VENOMOUS_0, BR...",0.400000,0.400000,0.400000,1.0,2.500000,0.240000,inf


In [21]:
# selecting rules having consequents as class mammal
rules[rules["consequents"] == {"CLASS_Mammal"}]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
14,(MILK_1),(CLASS_Mammal),0.4,0.4,0.4,1.0,2.5,0.24,inf
64,"(FEATHERS_0, MILK_1)",(CLASS_Mammal),0.4,0.4,0.4,1.0,2.5,0.24,inf
107,"(EGGS_0, MILK_1)",(CLASS_Mammal),0.4,0.4,0.4,1.0,2.5,0.24,inf
126,"(EGGS_0, VENOMOUS_0)",(CLASS_Mammal),0.4,0.4,0.4,1.0,2.5,0.24,inf
144,"(TOOTHED_1, MILK_1)",(CLASS_Mammal),0.4,0.4,0.4,1.0,2.5,0.24,inf
...,...,...,...,...,...,...,...,...,...
3721,"(EGGS_0, BACKBONE_1, MILK_1, VENOMOUS_0, BREAT...",(CLASS_Mammal),0.4,0.4,0.4,1.0,2.5,0.24,inf
3826,"(EGGS_0, BACKBONE_1, TOOTHED_1, VENOMOUS_0, BR...",(CLASS_Mammal),0.4,0.4,0.4,1.0,2.5,0.24,inf
3911,"(BACKBONE_1, MILK_1, TOOTHED_1, VENOMOUS_0, BR...",(CLASS_Mammal),0.4,0.4,0.4,1.0,2.5,0.24,inf
4014,"(EGGS_0, BACKBONE_1, MILK_1, TOOTHED_1, VENOMO...",(CLASS_Mammal),0.4,0.4,0.4,1.0,2.5,0.24,inf


### Q10. Test the rules generated on testing dataset and find precision and recall for the rule based classifier

In [22]:
#applying rules on test data

In [23]:
# evaluation measures

In [24]:
# print classification report

### Q11. Apply decision tree on the dataset and calculate the performance evaluation measures

In [25]:
# Select the independent variables and target column
Y_train = train['CLASS_Mammal']
X_train = train
X_train.drop(['CLASS_Mammal'], axis=1, inplace = True)

In [26]:
# Apply decision tree
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(X_train, Y_train)

DecisionTreeClassifier()

In [None]:
# Find predictions by decision tree
X_test = test
Y_test = test['CLASS_Mammal']
X_test.drop(['CLASS_Mammal'], axis=1, inplace = True)
X_test.drop(['predicted'], axis=1, inplace = True)

In [28]:
# Evaluation measures and classification report
predictions = dtree.predict(X_test)

### Q12. Which out of the two classifiers performs better.

In [25]:
# Name of the classifier with accuracy value.