## Exercise 1
There are many libraries that implement the Apriori algorithm (e.g.,
mlxtend). Benchmark your implementation with one of the available
implementations online. Do the comparison using at least 3 of the
provided datasets.

## Convert transactions to Pandas DataFrame
This function converts the transactions loaded by the loadDataset() function from a list to a Pandas DataFrame.

In order to do the conversion we use a TransctionEncoder from mlxtend library.
The transactions returnd by the "loadDataset()" functions are arrays of subarrays, where each subarray is a transaction and the subarrays (transactions) do not necessarily have the same length (the same number of items).

### Fit
We 'fit' the original transactions (with the fit() method), with the help of the mlxtend encoder, in order to determine the unique items and build a mapping between each unique item and a column index. We basically obtain unique column names from the original dataset.

### Transform
The transform() method uses the unique columns names (vocabulary), obtained with the 'fit()' method to convert each transaction to a boolean (binary) array.
The lenght of the boolean array will be the number of unique items learned by the 'fit()' method.

The value at each index in the array will be True if the correpsonding unique item is present in the current transaction and False if it's not present.

The final data type of the 'binaryArray' is a NumPy NDArray or, dependeing on case, a 'sparse matrix'.

### Create the data frame
Create a Pandas data frame, which is a 2-dimensional data structure composed of columns. It is similar
to a table: it's elements are basically columns, where a column has a 'label' and an array of elements, where the elements are the column's data.

We obtain the Pandas data frame by passing it a NumPy array, in this case Pandas will treat each row of the Numpy array as a column in the data frame (the data frame is not necesarily column-based, it's also row-based just like a regular array).

In [None]:
def loadDataset(filepath):
    # The initial list of transactions, it's initially empty
    transactions = []

    # The line (of the current iteration) read from the dataset file
    currentLine = ""

    # The list of items (numbers in this case), of the current iteration,
    # as read from the dataset file
    currentItemsList = []

    # Open the file from the 'filepath' in 'read' mode and use it as an object called 'file'
    with open(filepath, 'r') as file:
        for line in file:
            # First, remove any extra whitespace (spaces, newlines, etc.) from the current line.
            # This still leaves one space between each word of the line (which is good, because
            # we use the space to separate the line into individual words or items)
            currentLine = line.strip()

            # Then, split the line into it's individual words (items)
            # We do not need to give the split() function any delimiter by which to split
            # because the default delimiter is 'space' already
            currentItemsList = currentLine.split()

            # Append the current list of items to the transactions list, this will generate
            # an array of transactions, where a transaction is an array of items.
            # The transactions do not necesarily have the same length.
            transactions.append(frozenset(currentItemsList))

    return transactions

In [140]:
from mlxtend.preprocessing import TransactionEncoder

def convertTransactionsToDataFrame(transactions):
    # Instantiate a 'TransactionEncoder' in order to convert a list of transactions
    # into a suitable format for the 'mlxtend' Apriori implmentation. 
    transactionEncoder = TransactionEncoder()

    # We 'fit' the original transactions, with the help of the mlxtend encoder
    # Then we apply 'transform()', which uses the unique columns names (vocabulary) obtained with
    # the 'fit()' method to convert each transaction to a boolean (binary) array.
    binaryArray = transactionEncoder.fit(transactions).transform(transactions)

    # Create a Pandas data frame, which is a 2-dimensional data structure composed of columns.
    dataFrame = pd.DataFrame(binaryArray, columns = transactionEncoder.columns_)

    return dataFrame

In [141]:
from mlxtend.preprocessing import TransactionEncoder

def convertTransactionsToBinaryMatrix(transactions):
    # Instantiate a 'TransactionEncoder' in order to convert a list of transactions
    # into a suitable format for the 'mlxtend' Apriori implmentation. 
    transactionEncoder = TransactionEncoder()

    # We 'fit' the original transactions, with the help of the mlxtend encoder
    # Then we apply 'transform()', which uses the unique columns names (vocabulary) obtained with
    # the 'fit()' method to convert each transaction to a boolean (binary) array.
    binaryArray = transactionEncoder.fit(transactions).transform(transactions)

    # Create a Pandas data frame, which is a 2-dimensional data structure composed of columns.
    dataFrame = pd.DataFrame(binaryArray, columns = transactionEncoder.columns_)

    return dataFrame

## Apriori implementation from mlxtend 

In this example, we use an Apriori implementation from the **mlxtend** library.
The `runMlxtendApriori()` function opens the dataset, converts it to a data frame and runs the Apriori algorithm on it.
It returns the found frequent itemsets and association rules and also some performace values.

For measuring performance we use **time** and **psutil** Python packages.

The **mlxtend** library has two functions for the Apriori algorithm
- `apriori()`, which must receive a DataFrame as argument and wich, in return, computes the *frequent itemsets*
- `association_rules()`, which must receive the *frequent itemsets* (previously determined by **apriori()** function) and in return, it computes the *association rules*

In [142]:
from mlxtend.frequent_patterns import apriori, association_rules

import time
import psutil

def runMlxtendApriori(datasetPath, minSupport, minConfidence):
    # Get the current running Python process as an object
    currentProcess = psutil.Process()

    # Load the dataset, but with a limit, to prevent performance issues
    transactions = loadDataset(datasetPath)

    # Convert the dataset to a Pandas data frame
    dataFrame = convertTransactionsToDataFrame(transactions)

    # Apply the Apriory algorithm, using the 'apriory()' implmentation from mlxtend library
    # and also measure performance
    timeBeforeApriori = time.time()
    memoryBeforeApriori = currentProcess.memory_info().rss

    frequentItemsets = apriori(dataFrame, min_support = minSupport, use_colnames = True)

    timeAfterApriori = time.time()
    memoryAfterApriori = currentProcess.memory_info().rss

    aprioriTime = timeAfterApriori - timeBeforeApriori
    aprioriMemory = memoryAfterApriori - memoryBeforeApriori

    # Generate association rules, using the 'association_rules()' from mltxtend library 
    # and also measure performance
    timeBeforeAssociationRules = time.time()
    memoryBeforeAssociationRules = currentProcess.memory_info().rss

    associationRules = association_rules(frequentItemsets, metric = "confidence", min_threshold = minConfidence)

    timeaAfterAssociationRules = time.time()
    memoryAfterAssociationRules = currentProcess.memory_info().rss

    associationRulesTime = timeaAfterAssociationRules - timeBeforeAssociationRules
    associationRulesMemory = memoryAfterAssociationRules - memoryBeforeAssociationRules

    return {
        "frequent_itemsets": frequentItemsets,           # Frequent itemsets with support values
        "rules": associationRules,                       # Association rules with confidence
        "itemsets_time_sec": aprioriTime,                # Itemsets generation execution time
        "assoc_rules_time_sec": associationRulesTime,    # Association Rules generation execution time
        "itemsets_memory_MB": aprioriMemory,             # Itemsets memory usage
        "assoc_rules_memory_MB": associationRulesMemory  # Association rules memory usage
    }

In [150]:
# Run the Apriori algorithm from the mlxtend library
aprioriResult1 = runMlxtendApriori("chess.dat.txt", minSupport=0.9, minConfidence=0.5)

aprioriResult2 = runMlxtendApriori("mushroom.dat.txt", minSupport=0.4, minConfidence=0.6)

aprioriResult3 = runMlxtendApriori("T10I4D100K.dat.txt", minSupport=0.02, minConfidence=0.6)

  cert_metric = np.where(certainty_denom == 0, 0, certainty_num / certainty_denom)


## Display the results of the Apriori algorithm

### Chess dataset

In [151]:
aprioriResult1["frequent_itemsets"][:10]

Unnamed: 0,support,itemsets
0,0.995307,(29)
1,0.951189,(34)
2,0.96965,(36)
3,0.991865,(40)
4,0.942741,(48)
5,0.929599,(5)
6,0.996558,(52)
7,0.945244,(56)
8,0.999687,(58)
9,0.985294,(60)


In [152]:
aprioriResult1["rules"][:10]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(29),(34),0.995307,0.951189,0.949937,0.954417,1.003394,1.0,0.003213,1.070813,0.720597,0.953218,0.06613,0.976551
1,(34),(29),0.951189,0.995307,0.949937,0.998684,1.003394,1.0,0.003213,3.566959,0.069288,0.953218,0.719649,0.976551
2,(29),(36),0.995307,0.96965,0.964956,0.969506,0.999852,1.0,-0.000142,0.995307,-0.030494,0.964956,-0.004715,0.982333
3,(36),(29),0.96965,0.995307,0.964956,0.99516,0.999852,1.0,-0.000142,0.96965,-0.00484,0.964956,-0.0313,0.982333
4,(40),(29),0.991865,0.995307,0.987171,0.995268,0.999961,1.0,-3.8e-05,0.991865,-0.004732,0.987171,-0.008202,0.993547
5,(29),(40),0.995307,0.991865,0.987171,0.991826,0.999961,1.0,-3.8e-05,0.995307,-0.008174,0.987171,-0.004715,0.993547
6,(29),(48),0.995307,0.942741,0.938048,0.942471,0.999714,1.0,-0.000269,0.995307,-0.057529,0.938048,-0.004715,0.968746
7,(48),(29),0.942741,0.995307,0.938048,0.995022,0.999714,1.0,-0.000269,0.942741,-0.004978,0.938048,-0.060737,0.968746
8,(29),(5),0.995307,0.929599,0.927409,0.931782,1.002348,1.0,0.002173,1.032,0.499168,0.929737,0.031008,0.964713
9,(5),(29),0.929599,0.995307,0.927409,0.997644,1.002348,1.0,0.002173,1.991999,0.033278,0.929737,0.497992,0.964713


### Mushroom dataset

In [153]:
aprioriResult2["frequent_itemsets"][:10]

Unnamed: 0,support,itemsets
0,0.482029,(1)
1,0.497292,(110)
2,0.517971,(2)
3,0.415559,(23)
4,0.584441,(24)
5,0.434269,(28)
6,0.450025,(3)
7,0.974151,(34)
8,0.838503,(36)
9,0.690793,(39)


In [154]:
aprioriResult2["rules"][:10]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(1),(24),0.482029,0.584441,0.405219,0.840654,1.438389,1.0,0.123502,2.607898,0.588407,0.612807,0.616549,0.766999
1,(24),(1),0.584441,0.482029,0.405219,0.693345,1.438389,1.0,0.123502,1.689099,0.733417,0.612807,0.407968,0.766999
2,(1),(34),0.482029,0.974151,0.479813,0.995403,1.021817,1.0,0.010244,5.623667,0.04122,0.491427,0.82218,0.743974
3,(1),(36),0.482029,0.838503,0.468242,0.971399,1.158492,1.0,0.06406,5.64662,0.264125,0.549393,0.822903,0.764913
4,(1),(85),0.482029,1.0,0.482029,1.0,1.0,1.0,0.0,inf,0.0,0.482029,0.0,0.741014
5,(1),(86),0.482029,0.975382,0.481044,0.997957,1.023145,1.0,0.010882,12.050714,0.043674,0.492688,0.917017,0.745571
6,(1),(90),0.482029,0.921713,0.468735,0.972421,1.055014,1.0,0.024442,2.838613,0.100673,0.501316,0.647715,0.740484
7,(110),(34),0.497292,0.974151,0.485475,0.976238,1.002142,1.0,0.001038,1.087826,0.004252,0.492385,0.080736,0.737297
8,(110),(36),0.497292,0.838503,0.473658,0.952475,1.135923,1.0,0.056677,3.398162,0.238028,0.5494,0.705723,0.75868
9,(110),(85),0.497292,1.0,0.497292,1.0,1.0,1.0,0.0,inf,0.0,0.497292,0.0,0.748646


### T10I4D100K dataset

In [155]:
aprioriResult3["frequent_itemsets"][:10]

Unnamed: 0,support,itemsets
0,0.0268,(112)
1,0.02193,(116)
2,0.03415,(12)
3,0.04973,(120)
4,0.02641,(132)
5,0.02687,(140)
6,0.04559,(145)
7,0.02611,(151)
8,0.0232,(161)
9,0.02791,(175)


In [156]:
aprioriResult3["rules"][:10]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski


## Display the performance metrics of the Apriori algorithm
In this section we display the summary of the Apriori lagorithm and also performance metrics: time and memory consumption.

In order to display the metrics in an aestetic way, we use the fact that a **Pandas** *DataFrame* is displayed by default as a corectly formated table, in a Python notebook.

First, the data is grouped in a `metrics` Python list, which contains the data in the correct order.
Then, another Python list is created, `columns`, to store the column names only.

In [157]:
columns = ["Library",
           "Dataset",
           "Freq. itemsets found",
            "Assoc. rules found",
           "Itemsets time [sec]",
           "Assoc. rules time [sec]",
           "Itemsets memory [MB]",
           "Assoc. rules [MB]"
           ]

metrics = [["mlxtend", # library name used for Apriori algorithm
            "chess.dat", # dataset name
            len(aprioriResult1["frequent_itemsets"].index), # how many itemesets were dscovered
            len(aprioriResult1["rules"].index), # how many rules were dscovered
            aprioriResult1["itemsets_time_sec"],
            aprioriResult1["assoc_rules_time_sec"],
            aprioriResult1["itemsets_memory_MB"],
            aprioriResult1["assoc_rules_memory_MB"]],
            ["mlxtend", # library name used for Apriori algorithm
             "mushroom.dat", # dataset name
            len(aprioriResult2["frequent_itemsets"].index), # how many itemesets were dscovered
            len(aprioriResult2["rules"].index), # how many rules were dscovered
            aprioriResult2["itemsets_time_sec"],
            aprioriResult2["assoc_rules_time_sec"],
            aprioriResult2["itemsets_memory_MB"],
            aprioriResult2["assoc_rules_memory_MB"]],
            ["mlxtend", # library name used for Apriori algorithm
             "T10I4D100K.dat", # dataset name
            len(aprioriResult3["frequent_itemsets"].index), # how many itemesets were dscovered
            len(aprioriResult3["rules"].index), # how many rules were dscovered
            aprioriResult3["itemsets_time_sec"],
            aprioriResult3["assoc_rules_time_sec"],
            aprioriResult3["itemsets_memory_MB"],
            aprioriResult3["assoc_rules_memory_MB"]]
            ]

metricsDataFrame = pd.DataFrame(metrics, columns = columns)

# Display the metrics
metricsDataFrame

Unnamed: 0,Library,Dataset,Freq. itemsets found,Assoc. rules found,Itemsets time [sec],Assoc. rules time [sec],Itemsets memory [MB],Assoc. rules [MB]
0,mlxtend,chess.dat,622,10742,0.037033,0.033043,0,0
1,mlxtend,mushroom.dat,565,4570,0.114187,0.022564,19619840,1572864
2,mlxtend,T10I4D100K.dat,155,0,20.647521,0.001046,335872,0
