### About Dataset

https://archive.ics.uci.edu/ml/datasets/Online+Retail+II

**Online Retail II** dataset contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.

**Definiton of Variables**

- Invoice: Invoice number, unique identifier variable for each transaction. Refund invoice numbers starts with "C"
- StockCode: Unique product code
- Description: Product name
- Quantity: The number of product in the invoice
- InvoiceDate: Date and time of the purchase
- Price: Unit price of a product (in terms of Sterlin)
- CustomerID: Unique customer identifier
- Country: Residential country of customers

<br>

<hr>

### Import Libraries

In [2]:
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

<br>

<hr>

### Functions

In [30]:
def outlier_thresholds(dataframe, variable):
    quartile1 = dataframe[variable].quantile(0.01) 
    quartile3 = dataframe[variable].quantile(0.99) 
    
    # quantiles(Q1,Q2) selected 0.01 and 0.99 because of dataset
    # 0.25 and 0.75 is not a good choice for this data
    
    interquantile_range = quartile3 - quartile1
    up_limit = quartile3 + 1.5 * interquantile_range
    low_limit = quartile1 - 1.5 * interquantile_range
    return low_limit, up_limit

def replace_with_thresholds(dataframe, variable):
    low_limit, up_limit = outlier_thresholds(dataframe, variable)
    dataframe.loc[(dataframe[variable] < low_limit), variable] = low_limit
    dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limit

def retail_data_prep(dataframe):
    dataframe.dropna(inplace=True)
    dataframe = dataframe[~dataframe["Invoice"].str.contains("C", na=False)]
    dataframe = dataframe[dataframe["Quantity"] > 0]
    dataframe = dataframe[dataframe["Price"] > 0]
    replace_with_thresholds(dataframe, "Quantity")
    replace_with_thresholds(dataframe, "Price")
    return dataframe


def create_invoice_product_df(dataframe, id=False):
    if id:
        return dataframe.groupby(['Invoice', "StockCode"])['Quantity'].sum().unstack().fillna(0). \
            applymap(lambda x: 1 if x > 0 else 0)
    else:
        return dataframe.groupby(['Invoice', 'Description'])['Quantity'].sum().unstack().fillna(0). \
            applymap(lambda x: 1 if x > 0 else 0)


def check_id(dataframe, stock_code):
    product_name = dataframe[dataframe["StockCode"] == stock_code][["Description"]].values[0].tolist()
    return product_name


def create_rules(dataframe,
                 id=True,
                 country="France",
                 min_support=0.01,
                 min_threshold=0.01):
    
    dataframe = dataframe[dataframe['Country'] == country]
    dataframe = create_invoice_product_df(dataframe, id)
    frequent_itemsets = apriori(dataframe, min_support=min_support, use_colnames=True)
    rules = association_rules(frequent_itemsets, metric="support", min_threshold=min_threshold)
    return rules

<br>

<hr>

### Read Dataset and create rules for ARL with Apriori Method

In [31]:
df_ = pd.read_excel("dataset/online_retail_II.xlsx",
                    sheet_name="Year 2010-2011")
df = df_.copy()

df = retail_data_prep(df)
rules = create_rules(df)



<br>

<hr>

### Show only pre-decided support/confidence/lift threshold values

In [32]:
rules[(rules["support"]>0.05) & (rules["confidence"]>0.1) & (rules["lift"]>5)]. \
sort_values("confidence", ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
23707,"(21080, 21094)",(21086),0.102828,0.138817,0.100257,0.975000,7.023611,0.085983,34.447301
23706,"(21080, 21086)",(21094),0.102828,0.128535,0.100257,0.975000,7.585500,0.087040,34.858612
108820,"(21080, POST, 21086)",(21094),0.084833,0.128535,0.082262,0.969697,7.544242,0.071358,28.758355
108821,"(21080, POST, 21094)",(21086),0.084833,0.138817,0.082262,0.969697,6.985410,0.070486,28.419023
1777,(21094),(21086),0.128535,0.138817,0.123393,0.960000,6.915556,0.105550,21.529563
...,...,...,...,...,...,...,...,...,...
7212,(22629),(22630),0.125964,0.100257,0.071979,0.571429,5.699634,0.059351,2.099400
62249,(22630),"(POST, 22629)",0.100257,0.100257,0.053985,0.538462,5.370809,0.043933,1.949443
62244,"(POST, 22629)",(22630),0.100257,0.100257,0.053985,0.538462,5.370809,0.043933,1.949443
62248,(22629),"(POST, 22630)",0.125964,0.074550,0.053985,0.428571,5.748768,0.044594,1.619537


<br>

<hr>

### Recommendation 

In [37]:
def arl_recommender(rules_df, product_id, rec_count=1):
    print("Selected Product ID:",product_id," Product Name:",check_id(df, product_id),"\n")
    sorted_rules = rules_df.sort_values("lift", ascending=False)
    recommendation_list = []
    for i, product in enumerate(sorted_rules["antecedents"]):
        for j in list(product):
            if j == product_id:
                recommendation_list.append(list(sorted_rules.iloc[i]["consequents"])[0])

    for i in range(rec_count):
        print(f"Recommended Product {i+1}, Product Id: {recommendation_list[i]}, Product Name: {check_id(df, recommendation_list[i])}")

In [38]:
arl_recommender(rules_df = rules,
                product_id = 22492,
                rec_count = 3)

Selected Product ID: 22492  Product Name: ['MINI PAINT SET VINTAGE '] 

Recommended Product 1, Product Id: 22556, Product Name: ['PLASTERS IN TIN CIRCUS PARADE ']
Recommended Product 2, Product Id: 22551, Product Name: ['PLASTERS IN TIN SPACEBOY']
Recommended Product 3, Product Id: 22326, Product Name: ['ROUND SNACK BOXES SET OF4 WOODLAND ']
