<a href="https://colab.research.google.com/github/michalis0/Business-Intelligence-and-Analytics/blob/master/labs/07%20-%20Association%20Rules/Exercises/Solutions/association_sol.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Exercise

Let's try with a more real and bigger dataset. Load the `Groceries.csv` file and try to find association rules using the __confidence__ metric. Use a support threshold of __0.01__ and confidence threshold of __0.1__.

### Load the data
First, load the `Groceries.csv` file from the GitHub repository. Since this is not a proper csv file and there are different number of values in each row, you have to read the file manually. Run the code cell below to read the file and save it as a list of lists.

In [1]:
# We need the `requests` package to read the data from url.
import requests
response = requests.get("https://media.githubusercontent.com/media/michalis0/Business-Intelligence-and-Analytics/master/data/Groceries.csv")
data = response.text

dataset = []
for line in data.split("\n"):
    transaction = []
    row = line.rstrip().split(",")
    for item in row:
        transaction.append(item)
    dataset.append(transaction)

Next, create the one-hot encoded dataframe with `TransactionEncoder()`.

In [2]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

In [3]:
# Create the one_hot encoded dataframe with TransactionEncoder()
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
df.head()

Unnamed: 0,Unnamed: 1,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,baby food,bags,baking powder,bathroom cleaner,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False


Then find the frequent itemsets with `min_support=0.01` and `max_len=2`.

Then find the association rules with `metric='confidence'` and `min_threshold=0.1`.

In [4]:
# Find the frequent itemsets with  min_support=0.01, max_len=2
freq_items = apriori(df, min_support=0.01, use_colnames=True, max_len=2)

# Find the association rules with metric='confidence' and min_threshold=0.1
rules = association_rules(freq_items, metric="confidence", min_threshold=0.1)

Check out the rules you have found that have "soda" as antecedant. Are all of these rules interesting?

<h2> Important: This question is related to the Moodle quiz question 1. <h2>


In [5]:
# Extract rules with 'soda' as antecedents and sort them to find the rules with the highest lift values
rules[rules["antecedents"].astype(str).str.contains("soda")].sort_values(by="lift", ascending=False)
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(beef),(other vegetables),0.05246,0.193473,0.019723,0.375969,1.943264,1.0,0.009574,1.292447,0.512276,0.087191,0.226274,0.238957
1,(other vegetables),(beef),0.193473,0.05246,0.019723,0.101944,1.943264,1.0,0.009574,1.055101,0.601842,0.087191,0.052224,0.238957
2,(beef),(rolls/buns),0.05246,0.183916,0.013623,0.25969,1.412001,1.0,0.003975,1.102354,0.30794,0.061159,0.09285,0.166882
3,(beef),(root vegetables),0.05246,0.108987,0.017385,0.331395,3.040676,1.0,0.011668,1.332645,0.708283,0.120677,0.249613,0.245455
4,(root vegetables),(beef),0.108987,0.05246,0.017385,0.159515,3.040676,1.0,0.011668,1.127372,0.753217,0.120677,0.112982,0.245455


Check out the rules you have found that have "butter" as antecedant. Are all of these rules interesting?

<h2> Important: This question is related to the Moodle quiz question 2. <h2>

In [6]:
# Extract rules with 'butter' as antecedents and sort them to find the rules with the highest confidence values
rules[rules["antecedents"].astype(str).str.contains("butter")].sort_values(by="confidence", ascending=False)
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(beef),(other vegetables),0.05246,0.193473,0.019723,0.375969,1.943264,1.0,0.009574,1.292447,0.512276,0.087191,0.226274,0.238957
1,(other vegetables),(beef),0.193473,0.05246,0.019723,0.101944,1.943264,1.0,0.009574,1.055101,0.601842,0.087191,0.052224,0.238957
2,(beef),(rolls/buns),0.05246,0.183916,0.013623,0.25969,1.412001,1.0,0.003975,1.102354,0.30794,0.061159,0.09285,0.166882
3,(beef),(root vegetables),0.05246,0.108987,0.017385,0.331395,3.040676,1.0,0.011668,1.332645,0.708283,0.120677,0.249613,0.245455
4,(root vegetables),(beef),0.108987,0.05246,0.017385,0.159515,3.040676,1.0,0.011668,1.127372,0.753217,0.120677,0.112982,0.245455
