<a href="https://colab.research.google.com/github/flo-s99/ABA_16/blob/main/ABA_16.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Import the packages
import numpy as np  # for computing and manipulating data
import pandas as pd # for dataframes and operations
import matplotlib.pyplot as plt # for plotting
from mlxtend.frequent_patterns import apriori, association_rules # tools for DM, especially association analysis
from mlxtend.preprocessing import TransactionEncoder

Read the dataset which was given in Moodle (db 16) and uploaded to GitHub for easy accessibility across devices.

In [2]:
url = 'https://raw.githubusercontent.com/flo-s99/ABA_16/main/database_16.csv'
df = pd.read_csv(url, header=None, index_col=0)
# Print the top 5 entries to inspect whether pandas imported correctly
df.head(5)

Unnamed: 0_level_0,1,2,3,4,5,6
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,cadmium,copper,insecticides,lead,,
2,cadmium,copper,herbicides,insecticides,lead,zinc
3,copper,zinc,,,,
4,cadmium,fungicides,,,,
5,copper,herbicides,insecticides,zinc,,


In [3]:
# Find the item base
# Define the items as a set, then iterate through data columns and check for unique values, while also dropping empty values

items = set()
for col in df:
    items.update(df[col].dropna().unique())
# Print the Item Base
print("The Item Base is as follows: ",items)

The Item Base is as follows:  {'lead', 'cadmium', 'zinc', 'copper', 'insecticides', 'fungicides', 'herbicides'}


In [5]:
# Data preprocessing to use mlxtend package: more on http://rasbt.github.io/mlxtend/
item_base = set(items)
# We need to do a binary encoding as Apriori takes binary values as input
# Construct a new list to contain the encoded df
encoded_vals = []
# Iterate through index and rows to encode using labels True and False (0 and 1 also possible)
for index, row in df.iterrows():
    rowset = set(row) 
    labels = {}
    uncommons = list(item_base - rowset)
    commons = list(item_base.intersection(rowset))
    for uc in uncommons:
        labels[uc] = False
    for com in commons:
        labels[com] = True
    encoded_vals.append(labels)
encoded_vals[0]
# construct new dataframe containing encoded data
processed_df = pd.DataFrame(encoded_vals)
processed_df.head(5)

Unnamed: 0,herbicides,zinc,fungicides,cadmium,lead,insecticides,copper
0,False,False,False,True,True,True,True
1,True,True,False,True,True,True,True
2,False,True,False,False,False,False,True
3,False,False,True,True,False,False,False
4,True,True,False,False,False,True,True


In [6]:
# Find frequent itemsets using preprocessed df
# Full algorithm in cell below
# Only consider ones that satisfy the min_support given in the exercise sheet of 0,4
# We use the encoded df and display with column names rather than indices
frequent_items = apriori(processed_df, min_support=0.4, use_colnames=True)
# Additionally apply lambda function to determine length (Exercise Sheet: Consider no more than length 5)
frequent_items['length'] = frequent_items['itemsets'].apply(lambda x: len(x))
print("The following are the candidate frequent itemsets determined by the Apriori Algo:")
frequent_items

# Result interpretation:
# The support values allow us to understand how frequently these itemsets appear in the dataset.
# The most frequent itemset is (insecticides) with a support of 0.666667.
# It also found several two-item frequent itemsets. The most frequent among them is (herbicides, insecticides) with a support of 0.5.
# Similarly, it also found several three and four item frequent itemsets.

The following are the candidate frequent itemsets determined by the Apriori Algo:


Unnamed: 0,support,itemsets,length
0,0.566667,(herbicides),1
1,0.666667,(zinc),1
2,0.466667,(cadmium),1
3,0.466667,(lead),1
4,0.666667,(insecticides),1
5,0.6,(copper),1
6,0.5,"(zinc, herbicides)",2
7,0.5,"(insecticides, herbicides)",2
8,0.433333,"(herbicides, copper)",2
9,0.5,"(zinc, insecticides)",2


Please find the source code for the apriori algorithm package below. Note: We added comments to explain steps as asked in exercise sheet

In [None]:
# Sebastian Raschka 2014-2018
# mlxtend Machine Learning Library Extensions
# Author: Sebastian Raschka <sebastianraschka.com>
#
# License: BSD 3 clause

import numpy as np
import pandas as pd

# This support function generates all combinations using old_combinations, which basically
# are all combinations that satisfy the defined support threshold from prior iterations.              
# Returns a matrix that contains row values that represent combinations

def generate_new_combinations(old_combinations):

    items_types_in_previous_step = np.unique(old_combinations.flatten())
    for old_combination in old_combinations:
        max_combination = max(old_combination)
        for item in items_types_in_previous_step:
            if item > max_combination:
                res = tuple(old_combination) + (item,)
                yield res

# Function to apply apriori algorithm: Gwtting the frequent itemsets
# The algorithm checks if the support of each combination is greater than the minimum support threshold and keeps track of the frequent itemsets and their support in itemset_dict and support_dict dictionaries respectively.
# The loop terminates when there are no more frequent itemsets of the current length or the maximum length is reached.
# Finally, the function returns a DataFrame containing the support and itemsets of all frequent itemsets in the input DataFrame.

def apriori(df, min_support=0.5, use_colnames=False, max_len=None, n_jobs=1):

    allowed_val = {0, 1, True, False}
    unique_val = np.unique(df.values.ravel())
    for val in unique_val:
        if val not in allowed_val:
            s = ('The allowed values for a DataFrame'
                 ' are True, False, 0, 1. Found value %s' % (val))
            raise ValueError(s)

    is_sparse = hasattr(df, "to_coo")
    if is_sparse:
        X = df.to_coo().tocsc()
        support = np.array(np.sum(X, axis=0) / float(X.shape[0])).reshape(-1)
    else:
        X = df.values
        support = (np.sum(X, axis=0) / float(X.shape[0]))

    ary_col_idx = np.arange(X.shape[1])
    support_dict = {1: support[support >= min_support]}
    itemset_dict = {1: ary_col_idx[support >= min_support].reshape(-1, 1)}
    max_itemset = 1
    rows_count = float(X.shape[0])

    if max_len is None:
        max_len = float('inf')

    while max_itemset and max_itemset < max_len:
        next_max_itemset = max_itemset + 1
        combin = generate_new_combinations(itemset_dict[max_itemset])
        frequent_items = []
        frequent_items_support = []

        if is_sparse:
            all_ones = np.ones((X.shape[0], next_max_itemset))
        for c in combin:
            if is_sparse:
                together = np.all(X[:, c] == all_ones, axis=1)
            else:
                together = X[:, c].all(axis=1)
            support = together.sum() / rows_count
            if support >= min_support:
                frequent_items.append(c)
                frequent_items_support.append(support)

        if frequent_items:
            itemset_dict[next_max_itemset] = np.array(frequent_items)
            support_dict[next_max_itemset] = np.array(frequent_items_support)
            max_itemset = next_max_itemset
        else:
            max_itemset = 0

    all_res = []
    for k in sorted(itemset_dict):
        support = pd.Series(support_dict[k])
        itemsets = pd.Series([frozenset(i) for i in itemset_dict[k]])

        res = pd.concat((support, itemsets), axis=1)
        all_res.append(res)

    res_df = pd.concat(all_res)
    res_df.columns = ['support', 'itemsets']
    if use_colnames:
        mapping = {idx: item for idx, item in enumerate(df.columns)}
        res_df['itemsets'] = res_df['itemsets'].apply(lambda x: frozenset([
                                                      mapping[i] for i in x]))
    res_df = res_df.reset_index(drop=True)

    return res_df


The association rules are part of the used packages. However, to see the calculations as asked in the Homework Sheet, we provide the source code in the following cell:

In [None]:
# Sebastian Raschka 2014-2018
# mlxtend Machine Learning Library Extensions
#
# Function for generating association rules
#
# Author: Joshua Goerner <https://github.com/JoshuaGoerner>
#         Sebastian Raschka <sebastianraschka.com>
#
# License: BSD 3 clause

from itertools import combinations
import numpy as np
import pandas as pd

# Using the before computed dataframe (frequent itemsets) containing the support to find association rules.
# Constructs a dataframe containing the rules, score, confidence and lift first.
# Then uses computation to output final df containing antecedents,	consequents,	
# antecedent support,	consequent support,	support	confidence,	lift,	leverage, and	conviction.

def association_rules(df, metric="confidence",
                      min_threshold=0.8, support_only=False):

    # check for mandatory columns
    if not all(col in df.columns for col in ["support", "itemsets"]):
        raise ValueError("Dataframe needs to contain the\
                         columns 'support' and 'itemsets'")

    def conviction_helper(sAC, sA, sC):
        confidence = sAC/sA
        conviction = np.empty(confidence.shape, dtype=float)
        if not len(conviction.shape):
            conviction = conviction[np.newaxis]
            confidence = confidence[np.newaxis]
            sAC = sAC[np.newaxis]
            sA = sA[np.newaxis]
            sC = sC[np.newaxis]
        conviction[:] = np.inf
        conviction[confidence < 1.] = ((1. - sC[confidence < 1.]) /
                                       (1. - confidence[confidence < 1.]))

        return conviction

    # metrics for association rules
    metric_dict = {
        "antecedent support": lambda _, sA, __: sA,
        "consequent support": lambda _, __, sC: sC,
        "support": lambda sAC, _, __: sAC,
        "confidence": lambda sAC, sA, _: sAC/sA,
        "lift": lambda sAC, sA, sC: metric_dict["confidence"](sAC, sA, sC)/sC,
        "leverage": lambda sAC, sA, sC: metric_dict["support"](
             sAC, sA, sC) - sA*sC,
        "conviction": lambda sAC, sA, sC: conviction_helper(sAC, sA, sC)
        }

    columns_ordered = ["antecedent support", "consequent support",
                       "support",
                       "confidence", "lift",
                       "leverage", "conviction"]

    # check for metric compliance
    if support_only:
        metric = 'support'
    else:
        if metric not in metric_dict.keys():
            raise ValueError("Metric must be 'confidence' or 'lift', got '{}'"
                             .format(metric))

    # get dict of {frequent itemset} -> support
    keys = df['itemsets'].values
    values = df['support'].values
    frozenset_vect = np.vectorize(lambda x: frozenset(x))
    frequent_items_dict = dict(zip(frozenset_vect(keys), values))

    # prepare buckets to collect frequent rules
    rule_antecedents = []
    rule_consequents = []
    rule_supports = []

    # iterate over all frequent itemsets
    for k in frequent_items_dict.keys():
        sAC = frequent_items_dict[k]
        # to find all possible combinations
        for idx in range(len(k)-1, 0, -1):
            # of antecedent and consequent
            for c in combinations(k, r=idx):
                antecedent = frozenset(c)
                consequent = k.difference(antecedent)

                if support_only:
                    # support doesn't need these,
                    # hence, placeholders should suffice
                    sA = None
                    sC = None

                else:
                    try:
                        sA = frequent_items_dict[antecedent]
                        sC = frequent_items_dict[consequent]
                    except KeyError as e:
                        s = (str(e) + 'You are likely getting this error'
                                      ' because the DataFrame is missing '
                                      ' antecedent and/or consequent '
                                      ' information.'
                                      ' You can try using the '
                                      ' `support_only=True` option')
                        raise KeyError(s)
                    # check for the threshold

                score = metric_dict[metric](sAC, sA, sC)
                if score >= min_threshold:
                    rule_antecedents.append(antecedent)
                    rule_consequents.append(consequent)
                    rule_supports.append([sAC, sA, sC])

    # check if frequent rule was generated
    if not rule_supports:
        return pd.DataFrame(
            columns=["antecedents", "consequents"] + columns_ordered)

    else:
        # generate metrics
        rule_supports = np.array(rule_supports).T.astype(float)
        df_res = pd.DataFrame(
            data=list(zip(rule_antecedents, rule_consequents)),
            columns=["antecedents", "consequents"])

        if support_only:
            sAC = rule_supports[0]
            for m in columns_ordered:
                df_res[m] = np.nan
            df_res['support'] = sAC

        else:
            sAC = rule_supports[0]
            sA = rule_supports[1]
            sC = rule_supports[2]
            for m in columns_ordered:
                df_res[m] = metric_dict[m](sAC, sA, sC)

        return df_res


In [None]:
# Use frequent itemsets to find association rules
# As a metric we use the confidence threshold of 0,8 as provided in the exercise sheet
rules = association_rules(frequent_items, metric="confidence", min_threshold=0.8)
rules.head()

# Result interpretation:
#
# Rule 0: The first rule 0 suggests that there is a strong association between the use of herbicides and zinc. When herbicides are used, zinc is also used with a high confidence, lift and conviction.
#         The rule states that when herbicides are used, zinc is also used with a support of 0.56. This means that out of all the transactions, 56% of them contain both herbicides and zinc. 
#         The confidence of this rule is 0.882353, which means that out of all the transactions that contain herbicides, 88.23% of them also contain zinc.
#         The lift value of 1.323529 suggests that the use of herbicides and zinc are positively associated, meaning that when herbicides are used, zinc is more likely to be used than if the two items were independent.
#         The leverage of 0.122222 indicates that the observed support is 12.22% higher than expected if herbicides and zinc were independent. 
#         The conviction of 2.833333 indicates that if the antecedent (herbicides) were not true, the consequent (zinc) would not be true.       

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(herbicides),(zinc),0.566667,0.666667,0.5,0.882353,1.323529,0.122222,2.833333
1,(herbicides),(insecticides),0.566667,0.666667,0.5,0.882353,1.323529,0.122222,2.833333
2,(insecticides),(copper),0.666667,0.6,0.566667,0.85,1.416667,0.166667,2.666667
3,(copper),(insecticides),0.6,0.666667,0.566667,0.944444,1.416667,0.166667,6.0
4,"(zinc, herbicides)",(insecticides),0.5,0.666667,0.433333,0.866667,1.3,0.1,2.5
