# Practice Session 04: Basket analysis

Association rule mining techniques are useful to find common patterns of items in large data sets. One specific application called **market basket analysis** is useful for online shops because if we know that item A and B are bought together frequently, we can design new actions to increase the profit as:

- A and B can be placed together so that when a customer buys one of the product he doesn't have to go far away to buy the other product.
- People who buy one of the products can be targeted through an advertisement campaign to buy the other.
- Collective discounts can be offered on these products if the customer buys both of them.
- Both A and B can be packaged together.

# 0. Preliminaries

## 0.1. Dataset

In this practice we are using a dataset contained in `dataset_associationrules.csv` with 1000 customers that purchased up to 8 different services from a portfolio of a Big Internet Player. The portfolio includes:

- Web hosting
- Office suite that includes email, Office tools as docs, excels and presentation
- Security solutions to protect cyber-attacks
- Cloud sub-product: infrastructure as a service
- Cloud sub-product: platform as a service
- Content Management as Wordpress, Joomla!, Drupal, etc....
- Chatbot for customer care
- Advertising

Each record (row) corresponds to a company and each column represents one of the products from the portfolio and can take the value 1 if the product was purchased or 0 if it was not.

## 0.2. Imports

In [17]:
import numpy as np  
import matplotlib.pyplot as plt  
import pandas as pd  
from apyori import apriori

## 0.3. Load the data

Open the csv with separator "," and assign to a dataframe variable (use read_csv from Pandas library)

In [18]:
dataset=pd.read_csv("dataset_associationrules.csv", sep=",")
dataset.head()

Unnamed: 0,ID_customer,WEBHOSTING,OFFICESUITE,SECURITY,CLOUD_IAAS,CLOUD_PAAS,CONTENTMGM,CHATBOT,ADVERTISING
0,0,0,0,1,0,0,0,0,0
1,1,0,1,1,0,0,0,0,0
2,2,1,0,1,0,0,1,0,0
3,3,0,0,1,0,0,0,0,0
4,4,1,1,1,0,0,1,0,0


## 0.4. The Apriori Algorithm in a nutshell
There are three major components of Apriori algorithm:

- Support: refers to the default popularity of an item and can be calculated by finding number of transactions containing a particular item divided by total number of transactions. Suppose we want to find support for item A. This can be calculated as:<br>


<center> **Support(A) = (Transactions containing (A))/(Total Transactions)** </center>

- Confidence: refers to the likelihood that an item B is also bought if item A is bought. It can be calculated by finding the number of transactions where A and B are bought together, divided by total number of transactions where A is bought. Mathematically, it can be represented as:<br>

<center>**Confidence(A→B) = (Transactions containing both (A and B))/(Transactions containing A)**</center>


- Lift: Lift(A -> B) refers to the increase in the ratio of sale of B when A is sold. Lift(A –> B) can be calculated by dividing Confidence(A -> B) divided by Support(B). Mathematically it can be represented as:<br>

<center>**Lift(A→B) = (Confidence (A→B))/(Support (B))**</center>

<UL> A Lift of 1 means there is no association between products A and B. Lift of greater than 1 means products A and B are more likely to be bought together. Finally, Lift of less than 1 refers to the case where two products are unlikely to be bought together.</UL>

# 1. Apriori algorithm

## 1.1. Exploratory data analysis

[**REPORT**] Print the head with the first 5 records

[**REPORT**] Evaluate the dimension of the dataset and the type of the given variables (float, string, integer, etc.).

Different statistical algorithms have been developed to implement association rule mining where Apriori is one such algorithm. In this practice we will focus on Apriori algorithm  will later apply to our dataset.

Now we will use one existing Apriori algorithm from [apyori library](https://pypi.org/project/apyori/) to find out which products are commonly sold together.

*Note: In case of this apriori library is not already installed in your laptop, you can install it with: `pip install apyori`*

## 1.2. Data preparation

The **Apriori** library we are going to use requires our dataset to be in the form of a list of lists where each element is a product sold.
However, our dataset is in the form of a pandas dataframe where each row represents a customer and each column takes value 1 if it was sold to the customer or 0 if it wasn't. Therefore, we need to 1st) replace "1"s by the name of the product and 2nd) to convert the dataframe into a list of lists.

[**CODE**] Replace "1"s by product names

[**CODE**] Besides, the **Apriori** algorithm does not need the **customer_ID** variable. Remove the column with **customer_ID**

At this point, your dataset should look like this:

In [23]:
dataset.head()

Unnamed: 0,WEBHOSTING,OFFICESUITE,SECURITY,CLOUD_IAAS,CLOUD_PAAS,CONTENTMGM,CHATBOT,ADVERTISING
0,0,0,SECURITY,0,0,0,0,0
1,0,OFFICESUITE,SECURITY,0,0,0,0,0
2,WEBHOSTING,0,SECURITY,0,0,CONTENTMGM,0,0
3,0,0,SECURITY,0,0,0,0,0
4,WEBHOSTING,OFFICESUITE,SECURITY,0,0,CONTENTMGM,0,0


[**CODE**] Convert the dataframe into a list of lists and store it in a `records` array

[**CODE**] Remove all `0`, remove all empty transactions, and store in the `records_final` array

Now, everything is ready to execute the `Apriori` function.

## 1.3. Algorithm execution and evaluation

[**REPORT**] Execute the apriori algorithm using [apyori.apriori](https://pypi.org/project/apyori/) **3 times** with different values of minimum values for support, confidence, lift and length. **Remember to set the "lift" parameter to a value strictly greater than 1.0.** For each iteration:
- Indicate the number of association rules
- Create a table with the main relevant association rules and justify the results. Explain their characteristics, i.e. support, confidence and lift

The function `association_result_list` will facilitate the visualization of association rules results

In [26]:
def association_result_list (association_results):
    for item in association_results:
        # first index of the inner list
        # Contains base item and add item
        item_origin=[]
        item_origin.append([x for x in item[2][0][0]])
        item_destin=[]
        item_destin.append([x for x in item[2][0][1]])
        print("Rule: " +str(item_origin) +" -> " + str(item_destin))
        #second index of the inner list
        print("Support: " + str(item[1]))
        #third index of the list located at 0th
        #of the third index of the inner list
        print("Confidence: " + str(item[2][0][2]))
        print("Lift: " + str(item[2][0][3]))
        print("=====================================")
    return

Your output should look similar to this one, but numbers may vary depending on the lift and confidence parameters that you provide.

In [30]:
association_result_list(association_results)

Rule: [['CLOUD_IAAS']] -> [['OFFICESUITE']]
Support: 0.03
Confidence: 0.4477611940298507
Lift: 2.5440976933514245
Rule: [['CLOUD_PAAS']] -> [['OFFICESUITE']]
Support: 0.005
Confidence: 0.8333333333333334
Lift: 4.734848484848485
Rule: [['CONTENTMGM']] -> [['OFFICESUITE']]
Support: 0.044
Confidence: 0.2894736842105263
Lift: 1.6447368421052633
Rule: [['CONTENTMGM']] -> [['SECURITY']]
Support: 0.138
Confidence: 0.9078947368421053
Lift: 1.4932479224376733
Rule: [['CONTENTMGM']] -> [['WEBHOSTING']]
Support: 0.055
Confidence: 0.3618421052631579
Lift: 1.3205916250480214
Rule: [['CLOUD_IAAS', 'CONTENTMGM']] -> [['OFFICESUITE']]
Support: 0.009
Confidence: 0.8181818181818181
Lift: 4.648760330578512
Rule: [['CLOUD_IAAS', 'OFFICESUITE']] -> [['SECURITY']]
Support: 0.023
Confidence: 0.7666666666666667
Lift: 1.2609649122807018
Rule: [['CLOUD_IAAS', 'WEBHOSTING']] -> [['OFFICESUITE']]
Support: 0.008
Confidence: 0.4444444444444445
Lift: 2.5252525252525255
Rule: [['CLOUD_IAAS', 'WEBHOSTING']] -> [['SECU

[**REPORT**] Considering the previous results:

- As Data Scientist, which is your main recommendation to increase sales to the Big Internet Player? Explain why
- When a customer purchases **CLOUD_PAAS**, which is the product that uses to buy too? Why?
- Describe the type of customer that purchases **OFFICESUITE** product
- Indicate two products that do **NOT** use to appear together. Why? 

# 2. Deliver

Deliver:

* A zip file containing your notebook (.ipynb file) with all the [**CODE**] parts implemented.
* A PDF report of a maximum of 2 pages including all parts of this notebook marked with "[**REPORT**]"

The report should end with the following statement: **I hereby declare that, except for the code provided by the course instructors, all of my code, report, and figures were produced by myself.**