<a href="https://colab.research.google.com/github/fkihu/Model-Quality-and-Improvement-Assignment/blob/main/Week_10D3_Assignment_Market_Basket_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Background and Problem Statement

Care five is a German multinational retail corporation headquartered in Berlin, Germany.
It is the eighth-largest retailer in the world by revenue. It operates a chain of
hypermarkets, groceries stores, and convenience stores, which as of January 2021,
comprises its 12,000 stores in over 30 countries.
As a Data analyst working for one of the stores, you must perform market basket
analysis to help the store maximize revenue. More specifically, your task will be to analyze
transactional data to identify the top 10 products likely to be purchased together.
Given a dataset containing transactional data of products sold in the past week, you will
be required to perform the following:
1. Define the business question
2. Perform data importation and loading
3. Perform data preprocessing
4. Find frequent itemsets
5. Generate association rules
6. Perform metric interpretation and provide recommendation

## Business Question
Which are the top 10 products that are most likely to be purchased together?

## Data Importation and Loading

In [2]:
# Prerequisites

import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [17]:
# Loading the data

care_df = pd.read_csv('https://bit.ly/30A2gHO', index_col=[0])
care_df.sample(5)

Unnamed: 0_level_0,Quantity,Transaction,Store,Product
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
33744,2,103646,2,Perfume
30711,1,95267,7,Prescription Med
42131,1,125657,5,Perfume
42437,1,126398,8,Toothpaste
41122,8,122756,2,Toothpaste


In [26]:
care_df.Store.unique()

array([ 6,  1,  8,  4,  7,  5, 10,  3,  2,  9])

Observation: 
1. The dataset comprises of five columns. 
2. The first column labeled 'A' appears to be the index of the transactions.
3. The data was collected from 10 stores.

For this assignment, I will work for Store number 1

In [22]:
care_df1 = care_df[care_df.Store.isin(['1'])]
care_df1.sample(5)

Unnamed: 0_level_0,Quantity,Transaction,Store,Product
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
35084,3,107105,1,Pens
36015,1,109346,1,Wrapping Paper
34449,1,105500,1,Toothpaste
40573,2,121502,1,Bow
39750,20,119561,1,Pencils


## Data preprocessing

#### Step 1

##### (a) Grouping the dataframe by Transaction and Product. Then display the count of items.

In [27]:
care_df1 = care_df1.groupby(["Transaction","Product"]).size().reset_index(name="Count")
care_df1.head()

Unnamed: 0,Transaction,Product,Count
0,93197,Pencils,1
1,93248,Candy Bar,1
2,93257,Magazine,1
3,93260,Magazine,1
4,93317,Pens,1


##### (b) Consolidating the items into one transaction per row, with each item one-hot encoded.

In [28]:
care_df1 = (care_df1.groupby(['Transaction', 'Product'])['Count']
          .sum().unstack().reset_index().fillna(0)
          .set_index('Transaction'))

care_df1.head()

Product,Bow,Candy Bar,Deodorant,Greeting Cards,Magazine,Markers,Pain Reliever,Pencils,Pens,Perfume,Photo Processing,Prescription Med,Shampoo,Soap,Toothbrush,Toothpaste,Wrapping Paper
Transaction,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
93197,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
93248,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
93257,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
93260,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
93317,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


##### (c) Defining a custom function that returns values of 0 and 1 from the step 1(b) above, since the Apriori algorithm only takes 0's and 1's.

In [29]:
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

care_df1 = care_df1.applymap(encode_units)

care_df1.head()

Product,Bow,Candy Bar,Deodorant,Greeting Cards,Magazine,Markers,Pain Reliever,Pencils,Pens,Perfume,Photo Processing,Prescription Med,Shampoo,Soap,Toothbrush,Toothpaste,Wrapping Paper
Transaction,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
93197,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
93248,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
93257,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
93260,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
93317,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0


#### Step 2: Finding the frequent itemsets

In [30]:
store1_frequent_itemsets = apriori(care_df1, min_support=0.01, use_colnames=True)
store1_frequent_itemsets.head()

Unnamed: 0,support,itemsets
0,0.048062,(Bow)
1,0.147287,(Candy Bar)
2,0.142636,(Greeting Cards)
3,0.251163,(Magazine)
4,0.015504,(Pain Reliever)


#### Step 3: Generating the association rules

In [32]:
store1_rules = association_rules(store1_frequent_itemsets, metric="lift", min_threshold=1)

# Sorting 
store1_rules.sort_values("confidence", ascending = False, inplace = True)

# Previewing the associative rules
store1_rules.head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
53,"(Pencils, Greeting Cards)",(Magazine),0.021705,0.251163,0.013953,0.642857,2.559524,0.008502,2.096744
60,"(Toothpaste, Greeting Cards)",(Magazine),0.037209,0.251163,0.021705,0.583333,2.322531,0.01236,1.797209
22,"(Candy Bar, Magazine)",(Greeting Cards),0.04186,0.142636,0.021705,0.518519,3.635266,0.015735,1.78068
23,"(Candy Bar, Greeting Cards)",(Magazine),0.04186,0.251163,0.021705,0.518519,2.064472,0.011192,1.555277
30,"(Pencils, Greeting Cards)",(Candy Bar),0.021705,0.147287,0.010853,0.5,3.394737,0.007656,1.705426
59,"(Magazine, Toothpaste)",(Greeting Cards),0.044961,0.142636,0.021705,0.482759,3.384558,0.015292,1.657571
42,"(Pencils, Magazine)",(Candy Bar),0.029457,0.147287,0.013953,0.473684,3.216066,0.009615,1.620155
52,"(Pencils, Magazine)",(Greeting Cards),0.029457,0.142636,0.013953,0.473684,3.320938,0.009752,1.628992
11,(Greeting Cards),(Magazine),0.142636,0.251163,0.058915,0.413043,1.644525,0.02309,1.275797
47,"(Candy Bar, Toothpaste)",(Magazine),0.034109,0.251163,0.013953,0.409091,1.628788,0.005387,1.267263


## Metric interpretation | Recommendations

1. The above output shows the top 10 products likely to be purchased together. 
2. The confidence metric has been used to order this output in descending order. 
3. It is worth noting that all the 10 products have a support value of greater than 1%. This can be interpreted to mean that these item sets tend to occupy over 1% of the total transactions.
4. Lift is the metric that points to presence of strong influence from the first item set on the second one. All these items have a lift value greater than one. This means that the first item set tends to boost the sales of the second item set, and the boost factor is the lift.
5. The first association {Pencils, Greeting Cards} -> {Magazine} has a support value of 0.013953 which means that nearly 1.4% of all transactions in this store are of this combination.
6. The confidence value of the first product combination is 0.642857. This can be interpreted to mean that, if a customer purchases item set {Pencils, Greeting Cards}, there is a 64% chance that he/she will also purchase a Magazine, which is the second item set.
7. The top five product combinations all have a confidence value greater than 50% which is a strong indicator to the presence of opportunity to cross sell.
8. Of the top 10 product combinations, the combination that has the greatest lift is the {Candy Bar, Magazine} -> {Greeting Cards} which has a lift of 3.635266. This can be taken to mean that the sales of {Greeting Cards} goes up by more than three times the sale of {Candy Bar, Magazine}, whenever customers purchase the {Candy Bar, Magazine} item set. This is a great opportunity for the store to stock up a wide variety of Greeting Cards and to sensitize the staff to suggest them to the customers purchasing Candy Bars and Magazines.