# Apriori algorithm

#### we used the FIM package. You can find it here for Ananconda installation https://anaconda.org/conda-forge/pyfim. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from fim import apriori

#### Pattern mining on relational data

In [42]:
dataset = pd.read_csv("./datasets/small_transactions.csv", skipinitialspace=True, sep=',', nrows=2000)
dataset.head()

Unnamed: 0,SCONTRINO_ID,COD_MKT_ID
0,2558064013053,1580
1,2558064013053,1661
2,2558064013053,2068
3,2558064013053,2556
4,2558064013053,2650


##### This is a relational representation: each row of the dataframe has a basket ID and one product bought. For our analysis, we need the transactions: the list of product bought for each basket. 

In [3]:
transactions = dataset.groupby('SCONTRINO_ID')['COD_MKT_ID'].apply(list)
transactions.head()

SCONTRINO_ID
2558064013053                 [1580, 1661, 2068, 2556, 2650, 4225]
2558064013054    [437, 1278, 1614, 2089, 2243, 2245, 2443, 2551...
2558064013055                         [151, 595, 2650, 4600, 4872]
2558064013056             [142, 437, 2499, 2515, 3458, 3675, 4044]
2558064013057                                          [437, 3087]
Name: COD_MKT_ID, dtype: object

In [4]:
baskets = transactions.values

In [5]:
baskets[0:5]

array([list([1580, 1661, 2068, 2556, 2650, 4225]),
       list([437, 1278, 1614, 2089, 2243, 2245, 2443, 2551, 3448, 6172]),
       list([151, 595, 2650, 4600, 4872]),
       list([142, 437, 2499, 2515, 3458, 3675, 4044]), list([437, 3087])],
      dtype=object)

#### We can now run the apriori algorithm, implemented in fim. To know the options available:

In [6]:
help(apriori)

Help on built-in function apriori in module fim:

apriori(...)
    apriori (tracts, target='s', supp=10, zmin=1, zmax=None, report='a',
             eval='x', agg='x', thresh=10, prune=None, algo='b', mode='',
             border=None)
    Find frequent item sets with the Apriori algorithm.
    tracts  transaction database to mine (mandatory)
            The database must be an iterable of transactions;
            each transaction must be an iterable of items;
            each item must be a hashable object.
            If the database is a dictionary, the transactions are
            the keys, the values their (integer) multiplicities.
    target  type of frequent item sets to find     (default: s)
            s/a   sets/all   all     frequent item sets
            c     closed     closed  frequent item sets
            m     maximal    maximal frequent item sets
            g     gens       generators
            r     rules      association rules
    supp    minimum support of an i

In [None]:
# An itemset is maximal frequent if none of its immediate supersets is frequent. 
# An itemset is closed if none of its immediate supersets has the same support as the itemset .

#### First, we want to extract the itemsets: we do it setting target=a

In [25]:
itemsets = apriori(baskets, supp=1, zmin=2, zmax=5, target='a', eval='s') 

In [26]:
# frequent itemset, support
itemsets

[((3086, 2443), 3),
 ((441, 2050), 3),
 ((441, 2650), 3),
 ((1278, 2243), 4),
 ((4805, 437), 3),
 ((396, 2650), 3),
 ((476, 2650), 3),
 ((4163, 2650), 3),
 ((2498, 920), 3),
 ((385, 2729), 3),
 ((2490, 923), 3),
 ((2490, 2650), 3),
 ((1094, 3750), 3),
 ((1094, 2243), 3),
 ((1094, 445), 3),
 ((1094, 2650), 3),
 ((4182, 445), 3),
 ((4182, 445, 2650), 3),
 ((4182, 2650), 3),
 ((624, 597), 3),
 ((4461, 920), 3),
 ((142, 393), 3),
 ((1258, 2650), 3),
 ((2518, 2499), 3),
 ((2518, 920), 3),
 ((599, 597), 3),
 ((2550, 2650), 3),
 ((4136, 6172), 3),
 ((3074, 437), 3),
 ((1658, 445), 4),
 ((1658, 445, 2650), 3),
 ((1658, 2650), 3),
 ((1429, 1428), 4),
 ((207, 2650), 3),
 ((563, 2194), 3),
 ((4600, 3739), 3),
 ((5087, 445), 3),
 ((5087, 445, 2650), 3),
 ((5087, 920), 3),
 ((5087, 2650), 4),
 ((4047, 2729), 3),
 ((4047, 2650), 4),
 ((621, 4801), 3),
 ((621, 1579), 3),
 ((1101, 437), 4),
 ((2556, 1580), 3),
 ((2556, 445), 3),
 ((2556, 920), 3),
 ((2556, 2650), 3),
 ((2181, 3000442), 3),
 ((2245, 30

#### We can now see the itemsets obtained and their support

In [21]:
print('Number of itemsets:', len(itemsets))

Number of itemsets: 392


In [23]:
itemsets[20:35]

[((4461, 920), 3),
 ((142, 393), 3),
 ((1258, 2650), 3),
 ((2518, 2499), 3),
 ((2518, 920), 3),
 ((599, 597), 3),
 ((2550, 2650), 3),
 ((4136, 6172), 3),
 ((3074, 437), 3),
 ((1658, 445), 4),
 ((1658, 445, 2650), 3),
 ((1658, 2650), 3),
 ((1429, 1428), 4),
 ((207, 2650), 3),
 ((563, 2194), 3)]

### Rules

#### With apriori we can also extract the rules, by setting target=r. Remeber that in this case we need to set a support value as well as a confidence one

In [11]:
help(apriori)

Help on built-in function apriori in module fim:

apriori(...)
    apriori (tracts, target='s', supp=10, zmin=1, zmax=None, report='a',
             eval='x', agg='x', thresh=10, prune=None, algo='b', mode='',
             border=None)
    Find frequent item sets with the Apriori algorithm.
    tracts  transaction database to mine (mandatory)
            The database must be an iterable of transactions;
            each transaction must be an iterable of items;
            each item must be a hashable object.
            If the database is a dictionary, the transactions are
            the keys, the values their (integer) multiplicities.
    target  type of frequent item sets to find     (default: s)
            s/a   sets/all   all     frequent item sets
            c     closed     closed  frequent item sets
            m     maximal    maximal frequent item sets
            g     gens       generators
            r     rules      association rules
    supp    minimum support of an i

High support can mean that the consequence is simply very popular, and not because I am buying them together, so there is no real correlation. 
Confidence: how many time you find the consequence, given the premise

In [32]:
rules = apriori(baskets, supp=1, target='r', conf=3, eval='s') 

In [33]:
print('Number of rule:', len(rules))

Number of rule: 20494


In [29]:
#visualization of one rule
# consequent, premise, confidence
rules[100]

(1188, (4271,), 1)

#### Given a rule, we can print the baksets that contain the premises of the rule

In [35]:
# premise
rules[9][1]

(3664,)

In [36]:
# consequent
rules[9][0]

217

In [37]:
for b in baskets:
    if set(rules[9][1]) < set(b) and rules[9][0] in set(b):
            print(b)

[217, 3664, 4658, 5028]


# Pattern mining on tabular data

In [49]:
df = pd.read_csv("./datasets/titanic.csv", skipinitialspace=True, sep=',')

In [50]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Some pre-processing steps

In [51]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [52]:
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])

df['Age'] = df['Age'].groupby([df['Sex'], df['Pclass']], group_keys=False).apply(
    lambda x: x.fillna(x.median()))
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1 # size of the family= siblings + parch + and the person itself

df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
FamilySize       0
dtype: int64

In [53]:
column2drop = ['PassengerId', 'Name', 'Cabin', 'SibSp', 
               'Parch', 'Ticket']
df.drop(column2drop, axis=1, inplace=True)

df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,FamilySize
0,0,3,male,22.0,7.25,S,2
1,1,1,female,38.0,71.2833,C,2
2,1,3,female,26.0,7.925,S,1
3,1,1,female,35.0,53.1,S,2
4,0,3,male,35.0,8.05,S,1


Discretize into bins the continuous values as we need bins for apriori algorithm.

In [54]:
df['AgeBin'] = pd.cut(df['Age'].astype(int), 10, right=False)
df['FareBin'] = pd.cut(df['Fare'].astype(int), 10, right=False)

df.drop(['Age', 'Fare'], axis=1, inplace=True)

df.head()

Unnamed: 0,Survived,Pclass,Sex,Embarked,FamilySize,AgeBin,FareBin
0,0,3,male,S,2,"[16.0, 24.0)","[0.0, 51.2)"
1,1,1,female,C,2,"[32.0, 40.0)","[51.2, 102.4)"
2,1,3,female,S,1,"[24.0, 32.0)","[0.0, 51.2)"
3,1,1,female,S,2,"[32.0, 40.0)","[51.2, 102.4)"
4,0,3,male,S,1,"[32.0, 40.0)","[0.0, 51.2)"


All vars are categorical at this point, but we need them in string format for the apriori algorithm. Add name of var to be understandable.

In [55]:
df['Survived'] = df['Survived'].map(
    {0: 'Not Survived', 1: 'Survived'}).astype(str)
df['Pclass'] = df['Pclass'].map(
    {1: '1st', 2: '2nd', 3: '3rd'}).astype(str)

df['FamilySize'] = df['FamilySize'].astype(str) + '_Family'
df['AgeBin'] = df['AgeBin'].astype(str) + '_Age'
df['FareBin'] = df['FareBin'].astype(str) + '_Fare'

df.head()

Unnamed: 0,Survived,Pclass,Sex,Embarked,FamilySize,AgeBin,FareBin
0,Not Survived,3rd,male,S,2_Family,"[16.0, 24.0)_Age","[0.0, 51.2)_Fare"
1,Survived,1st,female,C,2_Family,"[32.0, 40.0)_Age","[51.2, 102.4)_Fare"
2,Survived,3rd,female,S,1_Family,"[24.0, 32.0)_Age","[0.0, 51.2)_Fare"
3,Survived,1st,female,S,2_Family,"[32.0, 40.0)_Age","[51.2, 102.4)_Fare"
4,Not Survived,3rd,male,S,1_Family,"[32.0, 40.0)_Age","[0.0, 51.2)_Fare"


In [56]:
baskets = df.values.tolist()

In [57]:
baskets

[['Not Survived',
  '3rd',
  'male',
  'S',
  '2_Family',
  '[16.0, 24.0)_Age',
  '[0.0, 51.2)_Fare'],
 ['Survived',
  '1st',
  'female',
  'C',
  '2_Family',
  '[32.0, 40.0)_Age',
  '[51.2, 102.4)_Fare'],
 ['Survived',
  '3rd',
  'female',
  'S',
  '1_Family',
  '[24.0, 32.0)_Age',
  '[0.0, 51.2)_Fare'],
 ['Survived',
  '1st',
  'female',
  'S',
  '2_Family',
  '[32.0, 40.0)_Age',
  '[51.2, 102.4)_Fare'],
 ['Not Survived',
  '3rd',
  'male',
  'S',
  '1_Family',
  '[32.0, 40.0)_Age',
  '[0.0, 51.2)_Fare'],
 ['Not Survived',
  '3rd',
  'male',
  'Q',
  '1_Family',
  '[24.0, 32.0)_Age',
  '[0.0, 51.2)_Fare'],
 ['Not Survived',
  '1st',
  'male',
  'S',
  '1_Family',
  '[48.0, 56.0)_Age',
  '[0.0, 51.2)_Fare'],
 ['Not Survived',
  '3rd',
  'male',
  'S',
  '5_Family',
  '[0.0, 8.0)_Age',
  '[0.0, 51.2)_Fare'],
 ['Survived',
  '3rd',
  'female',
  'S',
  '3_Family',
  '[24.0, 32.0)_Age',
  '[0.0, 51.2)_Fare'],
 ['Survived',
  '2nd',
  'female',
  'C',
  '2_Family',
  '[8.0, 16.0)_Age',
  

In [58]:
baskets[0]

['Not Survived',
 '3rd',
 'male',
 'S',
 '2_Family',
 '[16.0, 24.0)_Age',
 '[0.0, 51.2)_Fare']

#### We apply apriori on the baskets

In [71]:
rules = apriori(baskets, supp=2, target='r', conf=2, eval='s') 

In [72]:
rules

[('[72.0, 80.08)_Age',
  ('1st', 'Survived', '1_Family', 'male', '[0.0, 51.2)_Fare'),
  1),
 ('[72.0, 80.08)_Age', ('1st', 'Survived', '1_Family', 'male'), 1),
 ('[72.0, 80.08)_Age',
  ('1st', 'Survived', '1_Family', 'S', '[0.0, 51.2)_Fare'),
  1),
 ('[72.0, 80.08)_Age', ('1st', 'Survived', '1_Family', 'S'), 1),
 ('[72.0, 80.08)_Age', ('1st', 'Survived', '1_Family', '[0.0, 51.2)_Fare'), 1),
 ('[72.0, 80.08)_Age', ('1st', 'Survived', 'male', 'S'), 1),
 ('[72.0, 80.08)_Age', ('1st', 'Survived', 'male', '[0.0, 51.2)_Fare'), 1),
 ('[72.0, 80.08)_Age', ('1st', 'Survived', 'male'), 1),
 ('[72.0, 80.08)_Age', ('1st', 'Survived', 'S', '[0.0, 51.2)_Fare'), 1),
 ('[72.0, 80.08)_Age', ('1st', 'Survived', '[0.0, 51.2)_Fare'), 1),
 ('[72.0, 80.08)_Age',
  ('Survived', '1_Family', 'male', 'S', '[0.0, 51.2)_Fare'),
  1),
 ('[153.6, 204.8)_Fare', ('[40.0, 48.0)_Age', '1st', 'Survived'), 1),
 ('[153.6, 204.8)_Fare', ('[40.0, 48.0)_Age', '1st', 'S'), 1),
 ('[153.6, 204.8)_Fare', ('[40.0, 48.0)_Age', 'fe

In [73]:
for r in rules:
    if r[0] == 'Survived':
        print(r)

('Survived', ('4_Family', 'female'), 16)
('Survived', ('4_Family', 'S'), 15)
('Survived', ('4_Family', '[0.0, 51.2)_Fare'), 14)
('Survived', ('4_Family',), 21)
('Survived', ('[102.4, 153.6)_Fare', '1st', 'female'), 18)
('Survived', ('[102.4, 153.6)_Fare', '1st'), 23)
('Survived', ('[102.4, 153.6)_Fare', 'female'), 18)
('Survived', ('[102.4, 153.6)_Fare',), 23)
('Survived', ('[0.0, 8.0)_Age', 'female', 'S'), 12)
('Survived', ('[0.0, 8.0)_Age', 'female', '[0.0, 51.2)_Fare'), 18)
('Survived', ('[0.0, 8.0)_Age', 'female'), 18)
('Survived', ('[0.0, 8.0)_Age', '3rd', '[0.0, 51.2)_Fare'), 17)
('Survived', ('[0.0, 8.0)_Age', '3rd'), 17)
('Survived', ('[0.0, 8.0)_Age', 'male', 'S', '[0.0, 51.2)_Fare'), 12)
('Survived', ('[0.0, 8.0)_Age', 'male', 'S'), 14)
('Survived', ('[0.0, 8.0)_Age', 'male', '[0.0, 51.2)_Fare'), 14)
('Survived', ('[0.0, 8.0)_Age', 'male'), 16)
('Survived', ('[0.0, 8.0)_Age', 'S', '[0.0, 51.2)_Fare'), 24)
('Survived', ('[0.0, 8.0)_Age', 'S'), 26)
('Survived', ('[0.0, 8.0)_Age

In [74]:
for r in rules:
    if r[0] == 'Not Survived':
        print(r)

('Not Survived', ('6_Family', '[0.0, 51.2)_Fare'), 17)
('Not Survived', ('6_Family',), 19)
('Not Survived', ('[48.0, 56.0)_Age', '1_Family', 'male', 'S', '[0.0, 51.2)_Fare'), 16)
('Not Survived', ('[48.0, 56.0)_Age', '1_Family', 'male', 'S'), 16)
('Not Survived', ('[48.0, 56.0)_Age', '1_Family', 'male', '[0.0, 51.2)_Fare'), 16)
('Not Survived', ('[48.0, 56.0)_Age', '1_Family', 'male'), 16)
('Not Survived', ('[48.0, 56.0)_Age', 'male', 'S', '[0.0, 51.2)_Fare'), 17)
('Not Survived', ('[48.0, 56.0)_Age', 'male', 'S'), 20)
('Not Survived', ('[48.0, 56.0)_Age', 'male', '[0.0, 51.2)_Fare'), 17)
('Not Survived', ('[48.0, 56.0)_Age', 'male'), 23)
('Not Survived', ('Q', '[24.0, 32.0)_Age', '3rd', '1_Family', 'male', '[0.0, 51.2)_Fare'), 21)
('Not Survived', ('Q', '[24.0, 32.0)_Age', '3rd', '1_Family', 'male'), 21)
('Not Survived', ('Q', '[24.0, 32.0)_Age', '3rd', '1_Family', '[0.0, 51.2)_Fare'), 22)
('Not Survived', ('Q', '[24.0, 32.0)_Age', '3rd', '1_Family'), 22)
('Not Survived', ('Q', '[24.0