
# Data Mining / Prospecção de Dados
### Sara C. Madeira and André Falcão
#### Pattern Mining I

### 0. Getting Started

In this notebook, we use Python 3, Jupyter Notebook and MLxtend. MLxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks developed by Sebastian Raschka, which uses Pandas, NumPy, Scikit-learn, Matplotlib and SciPy.

In the lab and at home you should have the latest version of Anaconda, which already installs Python 3, Jupyter, Scikit-learn, Pandas, NumPy, Scikit-learn, Matplotlib and SciPy.

MLxtend is not installed with Anaconda. Proceed as follows:

    In your computer/if you have permissions you can install MLxtend.

MLxtend is supported in Anaconda (https://anaconda.org/conda-forge/mlxtend). To install this package with conda, run the following in command line and follow the instructions:

conda install -c conda-forge mlxtend.

After this you should be ready to start.

OR

    In the LAB/if you do not have permissions you have to keep the mlxtend folder in your working directory..

If you are wondering why are we not using Scikit-Learn ?

Scikit-learn does not have implementations of pattern mining algorithms.

## 1. Frequent Pattern Mining and Association Rule Mining in MLxtend

In this section we follow closely the examples on generating frequent Itemsets via Apriori Algorithm and Association Rules Generation from Frequent Itemsets provided in the documentation of MLxtend by Sebastian Raschka.

### 1.1. Preprocess Dataset

Consider the previous example of a set of transactions (baskets) containing a set products (items) bought at a given supermarket.


In [None]:
# Transaction data (market baskets)
transactions = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]
transactions

Note that here we are not so concerned with representation as we are going to use an external library to process that information in new data structure.

The mlextend library understands transactions represented as list of lists and contructs a special array to process it.
We also are going to use pandas to have a novel look at our data

In [None]:
import pandas as pd
from mlxtend.preprocessing import  TransactionEncoder

The Apriori implementation at MLxtend receives a binary database, thus the first step is to transform the transactions database into a binary database as an array, where each line iis a transaction, each column j is an item (product) and 1 means at position ij means item j appears at transaction i.

In [None]:
#Compute binary database
tr_enc = TransactionEncoder()
trans_array = tr_enc.fit(transactions).transform(transactions)
binary_database = pd.DataFrame(trans_array, columns=tr_enc.columns_)
binary_database

### 1.2. Compute Frequent Itemsets using Apriori

We can now input the binary database to apriori and compute frequent itemsets. Consider a minimum support of 60%, which in this case means an item is frequent if it appears in at least 3 transactions.

In [None]:
from mlxtend.frequent_patterns import apriori

In [None]:
frequent_itemsets = apriori(binary_database, min_support=0.6)
frequent_itemsets

By default, apriori returns the column indexes of the items, which may be useful in downstream operations, such as association rule mining. For better readability, we can set use_colnames=True to convert these integer values into the respective item names.

The advantage of working with pandas DataFrames is that we can use its convenient features to filter the results. For instance, let us assume we decide that after all we are only interested in itemsets of length 2 that have a support of at least 80 percent. Given that we already have the frequent itemsets and their support, we can add a new column that stores the length of each itemset:

In [None]:
#Compute itemsets with min_support = 60% with item names
frequent_itemsets = apriori(binary_database, min_support=0.6, use_colnames=True)
# Add new column length
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))

frequent_itemsets

We can now filter the results based on the desired support and pattern length:

In [None]:
# filter using support and pattern length

frequent_itemsets = frequent_itemsets[ (frequent_itemsets['support'] >= 0.8) & (frequent_itemsets['length'] == 2)]
frequent_itemsets


Note that if we already knew that we were only intested in patterns with at least 80% support it was more efficient to run the Apriori algorithm already with this minimum support value and then filter the results based only on pattern length. Can you do this ?


In [None]:
#Compute frequent itemsets with min_support=0.8
# filter using pattern length
# add new column length


### 1.3. Generate Association Rules from Frequent Itemsets

The first step in association rule mining is to find the frequent itemsets. In this context, we can now generate association rules from the frequent itemsets first discovered using Apriori. In what follows, we follow closely the examples in Association Rules Generation from Frequent Itemsets.

The method generate_rules takes dataframes of frequent itemsets as produced by the apriori function in mlxtend.association. To demonstrate the usage of generate_rules, we first create a pandas DataFrame of frequent itemsets as generated by the apriori function.

The generate_rules function allows you to: 1) specify your metric of interest and (2) - specify the according threshold. Currently implemented measures are confidence and lift.

Consider we are interesting in rules derived from the frequent itemsets only if the level of confidence is above the 90 percent threshold (min_threshold=0.9):


In [None]:
#first retrieve the original itemsets with 60% support
frequent_itemsets = apriori(binary_database, min_support=0.6, use_colnames=True)

In [None]:
from mlxtend.frequent_patterns import association_rules
# Generate association rules with confidence >= 60%

all_rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.6)
all_rules

if we are interested in rules fulfilling a different interest metric, we can simply adjust the parameters. For example, in case we are only interested in rules that have a lift score of >= 1.2, we would do the following:

In [None]:
# Generate association rules with lift >= 1.2
good_rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
good_rules

If, on the other hand, we are interested in rules with confidence above 90% and lift >= 1.2, we can generate the rules using the confidence as metric and then filter using the lift, or vice versa:

In [None]:
# Generate association rules with confidence >= 90%
all_rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.9)
# Filter association rules using lift
new_rules = all_rules[rules['lift'] >= 1.2]
new_rules


### 1.4. Small Exercise

Consider the set of transactions below taken from from Han, Kamber and Pei, Chapter 6 and used as example in the theoretical lesson. Create and preprocess the dataset and then use the functions apriori and association_rules to generate, respectively, frequent patterns with different values for minimum support, and association rules with different values for confidence and lift.

AllElectronics



## 2. Market Basket Analysis in a Real Dataset using MLxtend

We will now use the dataset groceries.csv containing 9835 transactions (baskets) and 169 items (products) collected from a supermarket and download here.


### 2.1. Preprocess Dataset

Take a look at the dataset by opening the .csv file. We will use the function load_transactions below to load the dataset into the format used in the examples above.


In [None]:
def load_transactions (csv_file):
# input: csv file with one transaction per line,
#       where transactions may have a different number of items
# output: matrix where each row is a vector of items (transaction)
# author: Sara C. Madeira, Oct 2017  
    lines = open(csv_file, 'r').readlines()
    transactions_matrix = []
    for l in lines:
        l = l.rstrip('\n')
        transaction = l.split(',')
        transactions_matrix.append(transaction)
    return transactions_matrix

In [None]:
# Load transaction from file groceries.csv
transactions = load_transactions('groceries.csv')
transactions[:10]

In [None]:
#Check the number of transactions
len(transactions)

In [None]:
# Compute binary database (transactions X products )
tr_enc = TransactionEncoder()
trans_array = tr_enc.fit(transactions).transform(transactions)
binary_database = pd.DataFrame(trans_array, columns=tr_enc.columns_)
binary_database


### 2.2. Compute Frequent Itemsets

Let us use apriori to compute the frequent itemsets. Note that due to the number of transactions and different items the computations might not be instantaneous as before.


In [None]:
#Compute itemsets min_support = 20%
frequent_itemsets = apriori(binary_database, min_support=0.2, use_colnames=True)
frequent_itemsets

In [None]:
#Compute itemsets min_support = 10%
frequent_itemsets = apriori(binary_database, min_support=0.1, use_colnames=True)
frequent_itemsets

In [None]:
#Compute itemsets min_support = 5%
frequent_itemsets = apriori(binary_database, min_support=0.05, use_colnames=True)
frequent_itemsets

In [None]:
#Compute itemsets min_support = 1%
frequent_itemsets = apriori(binary_database, min_support=0.01, use_colnames=True)
frequent_itemsets

In [None]:
# add new column length
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets

In [None]:
# filter using pattern length = 2
frequent_2_itemsets = frequent_itemsets[frequent_itemsets['length'] == 2]
frequent_2_itemsets

In [None]:
# filter using pattern length = 3
frequent_3_itemsets = frequent_itemsets[frequent_itemsets['length'] == 3]
frequent_3_itemsets

### 2.2. Generate Association Rules from Frequent Itemsets

In [None]:
#Compute itemsets min_support = 1%
frequent_itemsets = apriori(binary_database, min_support=0.01, use_colnames=True)
print(len(frequent_itemsets))
# Compute association rules with 80% confidence
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.4)
#pd.options.display.max_rows=None
rules

In [None]:
# Compute association rules with 50% confidence
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)
rules


## 3. Self Experiences with Frequent Pattern Mining and Association Rule Mining in other Real Problems and Datasets

In the webpage of SPMF - An Open-Source Data Mining Library you can find a list of Datasets for Frequent Itemset mining / Association Rule Mining, an interesting collection of already preprocessed real-life datasets collected from several machine learning/data mining data repositories and competitions, such as Kaggle - The Home of Data Science & Machine Learning, KDD Cup - The Data Mining and Knowledge Discovery competition and UCI-Machine Learning Repository.

Choose some and have fun !
