<div id="container" style="position:relative;">
<div style="float:left"><h1> Forecasting Bakery Sales - Abi Magnall </h1></div>
<div style="position:relative; float:right"><img style="height:65px" src ="https://twomagpiesbakery.co.uk/wp-content/uploads/2020/11/logo-no-site.jpg" />
</div>
</div>

# Notebook 11 : Market Basket Analysis

---

**This notebook outlines the steps to carried out to perform market analysis on the four bakeries to identify rules of association between products.**

Using the Market Basket Analysis, rules of association for each of the bakeries can be identified. The different types of relationships that will be explored and attempted to identified include: 

- **Complementary products**: products which are often bought together, like `Tea` and `Scones`
- **Substitute products**: products which replace each other, like `Tea` and `Coffee`
- **Trigger products**: products which when bought, trigger other purchases
- **Common Baskets**: combinations of products that are often bought together 

**Initial Hypothesis:**
- Based on the EDA it appeared that `Coffee` had a greater correlation with other products than `Tea` did, so it is hypothesised that Coffee is often bought with other products 
- Speaking to the bakery owner, the customers that frequently visit the different shops have different preferences and spending habits. It is thought that `Aldeburgh` customers favour savoury products, whilst `Southwold` favours sweet, `Darsham` and `Norwich` have no preference. This analysis will help to identify and validate if that is the case. 

---

# Contents  

**1. [Bakery Data EDA](#Bakery-Data-EDA)**

**2. [Aldeburgh Basket Analysis](#Aldeburgh-Basket-Analysis)**
- [Aldeburgh Support Observations](#Aldebugh-Support-Observations)
- [Apriori Algorithm for Aldeburgh](#Apriori-Algorithm-for-Aldeburgh)
- [Determining Rules of Association for Aldeburgh](#Determining-Rules-of-Association-for-Aldeburgh)
- [Aldeburgh Basket Observations](#Aldeburgh-Basket-Observations)

**3. [Southwold Basket Analysis](#Southwold-Basket-Analysis)**
- [Southwold Support Observations](#Southwold-Support-Observations)
- [Apriori Algorithm for Southwold](#Apriori-Algorithm-for-Southwold)
- [Determining Rules of Association for Southwold](#Determining-Rules-of-Association-for-Southwold)
- [Southwold Basket Observations](#Southwold-Basket-Observations)

**4. [Darsham Basket Analysis](#Darsham-Basket-Analysis)**
- [Darsham Support Observations](#Darsham-Support-Observations)
- [Apriori Algorithm for Darsham](#Apriori-Algorithm-for-Darsham)
- [Determining Rules of Association for Darsham](#Determining-Rules-of-Association-for-Darsham)
- [Darsham Basket Observations](#Darsham-Basket-Observations)

**5. [Norwich Basket Analysis](#Norwich-Basket-Analysis)**
- [Norwich Support Observations](#Norwich-Support-Observations)
- [Apriori Algorithm for Norwich](#Apriori-Algorithm-for-Norwich)
- [Determining Rules of Association for Norwich](#Determining-Rules-of-Association-for-Norwich)
- [Norwich Basket Observations](#Norwich-Basket-Observations)

**6. [Summary](#Summary)**

**7. [Recommendations](#Recommendations)**

___

## Importing Libraries

In [1]:
# Main libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from mlxtend.preprocessing import TransactionEncoder
import os
import datetime as dt
import plotly.express as px
import plotly.graph_objects as go
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules


from plotly.subplots import make_subplots
from pandas.tseries.offsets import DateOffset

## Importing Custom Functions

In [2]:
import BakeryFunctions as bakery

## To Get Current Directory

In [3]:
working_directory = os.getcwd()
working_directory

'/Users/abimagnall/Documents/BrainStation/Capstone/Data/Abi_Magnall_Captsone_Project'

## Importing Preprocessed Dataset

In [4]:
aldeburgh = pd.read_csv(working_directory+'/4_processed_data/aldeburgh_basket.csv', index_col=0)
southwold = pd.read_csv(working_directory+'/4_processed_data/southwold_basket.csv', index_col=0)
darsham = pd.read_csv(working_directory+'/4_processed_data/darsham_basket.csv', index_col=0)
norwich = pd.read_csv(working_directory+'/4_processed_data/norwich_basket.csv', index_col=0)

---

# Bakery Data EDA
Basic EDA will be performed on the imported datasets to confirm they contain the correct data, columns, datatypes and no missing or duplicated rows. 

In [5]:
bakery.basic_eda(aldeburgh)

The number of duplicated rows are: 35753

The number of missing values are: 
TransactionId    0
ProductName      0
Quantity         0
dtype: int64

The first 5 rows are:



Unnamed: 0,TransactionId,ProductName,Quantity
1,325399103,Coffee,1
5,325403315,Madagascan Brownie,4
8,325407468,Madagascan Brownie,1
11,325413684,Tea,2
12,325413684,Friand,1



Information summary of the dataset:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 365332 entries, 1 to 643970
Data columns (total 3 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   TransactionId  365332 non-null  int64 
 1   ProductName    365332 non-null  object
 2   Quantity       365332 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 11.1+ MB


None

In [6]:
# To check if the duplicated rows of data are actual duplicates or not
aldeburgh[aldeburgh.duplicated()]

Unnamed: 0,TransactionId,ProductName,Quantity
39,325365820,Coffee,1
49,325380962,Coffee,1
63,325395191,Coffee,1
93,325390595,Coffee,1
109,325384267,Coffee,1
...,...,...,...
643853,467527572,Coffee,1
643862,467527034,Coffee,1
643913,467506322,Coffee,1
643947,467511980,Coffee,1


---

In [7]:
# To validate southwold dataset
bakery.basic_eda(southwold)

The number of duplicated rows are: 23038

The number of missing values are: 
TransactionId    0
ProductName      0
Quantity         0
dtype: int64

The first 5 rows are:



Unnamed: 0,TransactionId,ProductName,Quantity
0,368108160,Coffee,1
2,368121026,Coffee,1
3,368121026,Bakewell,1
4,368121026,Croissant,1
5,368121026,Pain Au Chocolate,1



Information summary of the dataset:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 304736 entries, 0 to 552753
Data columns (total 3 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   TransactionId  304736 non-null  int64 
 1   ProductName    304736 non-null  object
 2   Quantity       304736 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 9.3+ MB


None

In [8]:
# To check if the duplicated rows of data are actual duplicates or not
southwold[southwold.duplicated()]

Unnamed: 0,TransactionId,ProductName,Quantity
19,368128682,Coffee,1
26,368140346,Coffee,1
29,368142803,Coffee,1
39,368156375,Coffee,1
40,368156375,Coffee,1
...,...,...,...
552635,467582197,Coffee,1
552640,467584100,Coffee,1
552671,467589962,Coffee,1
552698,467594014,Tea,1


---

In [9]:
# To check if the duplicated rows of data are actual duplicates or not
darsham[darsham.duplicated()]

Unnamed: 0,TransactionId,ProductName,Quantity
22,368130843,Coffee,1
24,368136666,Coffee,1
53,368222359,Coffee,1
55,368224531,Coffee,1
91,368203620,Coffee,1
...,...,...,...
502939,467523872,Coffee,1
503013,467545199,Coffee,1
503017,467545963,Coffee,1
503093,467578869,Tea,1


In [10]:
# To validate darsham dataset
bakery.basic_eda(darsham)

The number of duplicated rows are: 17316

The number of missing values are: 
TransactionId    0
ProductName      0
Quantity         0
dtype: int64

The first 5 rows are:



Unnamed: 0,TransactionId,ProductName,Quantity
0,368107752,Pain Au Chocolate,1
1,368107752,Coffee,1
2,368107986,Magpie Sourdough,1
5,368107986,Cheese Straw,6
9,368116678,Madagascan Brownie,1



Information summary of the dataset:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 265343 entries, 0 to 503246
Data columns (total 3 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   TransactionId  265343 non-null  int64 
 1   ProductName    265343 non-null  object
 2   Quantity       265343 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 8.1+ MB


None

---

In [11]:
# To validate norwich dataset
bakery.basic_eda(norwich)

The number of duplicated rows are: 12863

The number of missing values are: 
TransactionId    0
ProductName      0
Quantity         0
dtype: int64

The first 5 rows are:



Unnamed: 0,TransactionId,ProductName,Quantity
1,368106886,Tea,1
3,368107620,Coffee,1
4,368107620,Sausage Roll,1
5,368108547,Coffee,1
7,368115933,Coffee,1



Information summary of the dataset:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 161657 entries, 1 to 279144
Data columns (total 3 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   TransactionId  161657 non-null  int64 
 1   ProductName    161657 non-null  object
 2   Quantity       161657 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 4.9+ MB


None

In [12]:
# To check if the duplicated rows of data are actual duplicates or not
norwich[norwich.duplicated()]

Unnamed: 0,TransactionId,ProductName,Quantity
71,368358272,Coffee,1
120,368106339,Coffee,1
162,368346295,Coffee,1
185,368119689,Coffee,1
226,368123360,Coffee,1
...,...,...,...
279071,467580556,Coffee,1
279080,467582600,Coffee,1
279102,467589600,Coffee,1
279120,467595267,Tea,1


## Observations
- There are no missing rows of data for any of the bakery datasets 
- The duplicated rows are not true duplicates but where the same product and quantity have been bought, which is shown by the different `TransactionId` for each row 

---

# Aldeburgh Basket Analysis 
In order to perform Market Basket Analysis, the data needs to be grouped by the `TransactionId` and `ProductName`, to get a series with a list of the products bought per `TransactionId`. This can then get transformed and transposed to provide a dataframe. The new dataframe will have the product names as column headers and each transaction is a row, with a boolean of True of False if that product was in that transaction. This is all performed using the `transaction_encoder` function. 

In [13]:
# Copy is taken for audit trail purposes 
aldeburgh_encoded = aldeburgh.copy()

In [14]:
# Transaction encoder function is called 
aldeburgh_encoded = bakery.transaction_encoder(aldeburgh_encoded)

In [15]:
# Validate it worked 
aldeburgh_encoded

Unnamed: 0,Almond Toast,Baguette,Bakewell,Cheese & Tomato Melt,Cheese Scone,Cheese Straw,Cinnamon Swirl,Coffee,Croissant,Croque Monsieur,Danish,Empanada,Friand,Fruit Scone,Madagascan Brownie,Magpie Sourdough,Moroccan Vegan Roll,Pain Au Chocolate,Sausage Roll,Tea
0,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False
1,False,False,False,False,False,False,True,True,False,False,False,False,True,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,True,False,False,False,False,False,False,False,False,False,True,False,True,False,False,False,False,False
4,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,True,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
189793,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False
189794,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False
189795,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
189796,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


## Observations 
It can be seen that the number of rows in the `aldeburgh_basket` dataset has reduced from 365,332 to 189,798, showing that nearly half of the transactions in the data contained multiple products. 

# Apriori Algorithm for Aldeburgh 
To perform the market basket analysis the Apriori Algorith, which is an algorithm for frequent item set mining and association rule learning over relational databses.

The algorithm works by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often (set by the minimum support hyperparameter) in the database. 

Whilst calculating the Apriori Algorithm a minimum support was set to 0.01, this was deemed suitable as setting higher would result in too few products being returned as there are c. 200,000 'baskets' being examined, and therefore the support value of each product or combination of products will be incredibly low. 

In [16]:
# alpriori algorithm called 
ald_basket = apriori(aldeburgh_encoded, min_support=0.01, use_colnames=True)

First the individual items and sets of items will be assessed with their related `support` score. The `support` score is the proportion of baskets that contain a the given item or combinations of items. In other words, it shows the top most popular baskets for the `Aldeburgh` bakery. 

From the initial EDA in [Bakery Data EDA Notebook](./5_Bakery_Data_EDA.ipynb) the top 20 products for each shop was idetntified. However, that did not include items bought together. This analysis will provide insights into the top baskets bought at each bakery. 

In [17]:
print(f'The total number of itemsets combinations of the 20 products bought in Aldeburgh is: \
{ald_basket.itemsets.count()}')

The total number of itemsets combinations of the 20 products bought in Aldeburgh is: 46


In [18]:
# To see the top 15 itemsets bought 
ald_basket.sort_values('support', ascending=False).head(15)

Unnamed: 0,support,itemsets
7,0.592467,(Coffee)
18,0.144106,(Sausage Roll)
19,0.131587,(Tea)
15,0.090143,(Magpie Sourdough)
6,0.086502,(Cinnamon Swirl)
2,0.082672,(Bakewell)
8,0.076998,(Croissant)
40,0.072087,"(Tea, Coffee)"
14,0.062156,(Madagascan Brownie)
17,0.059748,(Pain Au Chocolate)


In [19]:
# To view the bottom 15 bought transactions 
ald_basket.sort_values('support', ascending=True).head(15)

Unnamed: 0,support,itemsets
43,0.010116,"(Sausage Roll, Empanada)"
31,0.010711,"(Coffee, Croque Monsieur)"
22,0.011222,"(Sausage Roll, Bakewell)"
23,0.011338,"(Bakewell, Tea)"
41,0.011528,"(Croissant, Magpie Sourdough)"
44,0.012281,"(Sausage Roll, Moroccan Vegan Roll)"
32,0.012982,"(Coffee, Empanada)"
3,0.013051,(Cheese & Tomato Melt)
26,0.013841,"(Sausage Roll, Cheese Straw)"
10,0.013888,(Danish)


## Aldeburgh Support Observations
From the above aldeburgh basket, the different `support` values have been provided for each product or combination of products. A total of 46 different combinations of the 20 products have been bought over the 2 years, which is less than expected, suggesting that a fair few of the products are bought separately. 

**Top 15 Itemsets Observations:**

For the top 15 baskets for `Aldeburgh`, it can be seen that the majority of the products are individual products, except for `Coffee and Tea` and `Coffee and Sausage Roll`, with `Coffee` being the most popular item with a support score of c.0.60, meaning it is present in 60% of all baskets. 

The results shown half support the hypothesis that the `Aldeburgh` customers tend to prefer savoury products over sweet ones as half of the top baskets bought contain savoury items, and the top 4 products are savoury or `Tea` or `Coffee` which account for the majority of the baskets. 

The `Coffee and Sausage Roll` isn't a surprise as it is known that they are the top 2 selling products for the bakery. However, the `Coffee and Tea` is an interesting insight as it was hypothesised that these would be substitute products. This is most likley due to multiple people visiting the bakery and combing their purchases. The face that `Coffee and Tea` is the 8th most frequent basket in the shop highlights the fact that customers likely visit the bakery with other people as opposed to on their own. 

**Bottom 15 Itemsets Observations:**

`Danish`, `Almond Toast` and `Cheese & Tomato Melt` are the only individual items in the bottom 15 itemsets. Speaking to the owner, the pattern and demand for these three products has not quite been determined and therefore they have been produced less consistently than the other products. This could also suggest that `Almond Toast` and `Danish` are bought as substitute products for `Croissant`, `Pain Au Chocolate` or `Cinnamon Swirl` products. 

# Determining Rules of Association for Aldeburgh
Now the data is in the correct format, the association rules can be determined, using the `association_rules` function. 

This function returns a dataframe containing the frequently bought together products grouped into 'antecedents' and 'consequents'. The way this is interpreted is that 'given the group of antecedents, this group of consequents can be seen with a certain frequency. 

The dataframe being created will contain numerous columns, the key columns include: 
- **Support**: how often the basket occurs

- **Confidence**: to see the strength of the rule. What proportion of transactions with the first item also contain the other item(s)

- **Lift**: how much more often are the antecedent and consequent products occur together than expected if their purchase frequency were independent. It can also be interprested as a measure of how much the consequent sales could potentially be driven up by the relationship

- **Conviction**: a measure of the dependence of the consequent on the antecedent. A high value denotes that the consequent is always seen to be purchased with the antecedent

In [20]:
# Appliting association_rules function and ordering the values by the life column 
association_rules(ald_basket, metric='lift', min_threshold=1).sort_values('lift', ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
10,(Croissant),(Pain Au Chocolate),0.076998,0.059748,0.021828,0.283495,4.744871,0.017228,1.312276
11,(Pain Au Chocolate),(Croissant),0.059748,0.076998,0.021828,0.365344,4.744871,0.017228,1.454335
6,(Cinnamon Swirl),(Pain Au Chocolate),0.086502,0.059748,0.016154,0.186746,3.125579,0.010986,1.156161
7,(Pain Au Chocolate),(Cinnamon Swirl),0.059748,0.086502,0.016154,0.27037,3.125579,0.010986,1.252002
4,(Cinnamon Swirl),(Croissant),0.086502,0.076998,0.015264,0.176453,2.291663,0.008603,1.120764
5,(Croissant),(Cinnamon Swirl),0.076998,0.086502,0.015264,0.198235,2.291663,0.008603,1.139358
14,(Sausage Roll),(Moroccan Vegan Roll),0.144106,0.045638,0.012281,0.085225,1.867422,0.005705,1.043276
15,(Moroccan Vegan Roll),(Sausage Roll),0.045638,0.144106,0.012281,0.269106,1.867422,0.005705,1.171024
12,(Sausage Roll),(Empanada),0.144106,0.037687,0.010116,0.070199,1.862651,0.004685,1.034966
13,(Empanada),(Sausage Roll),0.037687,0.144106,0.010116,0.268419,1.862651,0.004685,1.169924


# Aldeburgh Basket Observations 
From the above dataframe the interesting observations and association rules determined are: 
- The top 6 baskets all contain different combinations of pastries, with high lift values. For example, for `Pain Au Chocolate` and `Croissant` the lift score is 4.7, suggesting that for the bakery customers are 4.7 times more likely to have `Pain Au Chocolate` and `Croissant` in their basket than either one of them individually 

This is an interesting observation as again the different pastry products would be assumed to be substitute products as opposed to complimentary products. This further reinforces the hypothesis that a lot of customers visiting the bakery are couples or groups of people as opposed to solo customers. 

- `Sausage Roll` and `Moroccan Vegan Roll` also appear relatively high up in the table, with a lift score of c.1.9

These products would be assumed as substitute or even competing products as one is for a certain customer audience, those who eat meat, and the other is for a completely different group of customers, those who don't eat any animal products. Again, this strengthens the hypothesis that a lot of the customers visiting the bakery are in groups of 2 or more. 

- `Coffee` did not appear at all in the top 16 baskets, suggesting that customers are more likely to puchase a `Coffee` without any other products than a `Coffee` with other products

- `Tea` and `Bakewell` are the only products that appeared in the dataframe. The lift is relatively low, suggesting that it's only slightly significant. However, this was one of the products identified in the EDA process to be strongly correlated with `Tea` and therefore likely to be purchased together 

---

# Southwold Basket Analysis
The above process will now be repeated for the remainder of the shops to identify commonly purchased together products and identify trends across the shops. 

In [21]:
# Copy is taken for audit trail purposes 
southwold_encoded = southwold.copy()

In [22]:
# The transaction_encoder is called on the southwold dataset
southwold_encoded = bakery.transaction_encoder(southwold_encoded)

In [23]:
# To validate it worked
southwold_encoded

Unnamed: 0,Almond Toast,Baguette,Bakewell,Cheese & Tomato Melt,Cheese Scone,Cheese Straw,Cinnamon Swirl,Coffee,Croissant,Croque Monsieur,Danish,Empanada,Friand,Fruit Scone,Madagascan Brownie,Magpie Sourdough,Moroccan Vegan Roll,Pain Au Chocolate,Sausage Roll,Tea
0,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False
1,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
3,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
165014,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False
165015,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
165016,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False
165017,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False


In [24]:
# The apriori functin is called on the dataframe to determine the support of each basket combination 
sw_basket = apriori(southwold_encoded, min_support=0.01, use_colnames=True)

In [25]:
print(f'The total number of itemsets combinations of the 20 products bought in Southwold is: \
{sw_basket.itemsets.count()}')

The total number of itemsets combinations of the 20 products bought in Southwold is: 47


In [26]:
# The itemsets are ordered by the support value and the top 15 baskets are displayed 
sw_basket.sort_values('support', ascending=False).head(15)

Unnamed: 0,support,itemsets
7,0.51748,(Coffee)
18,0.165211,(Sausage Roll)
15,0.112823,(Magpie Sourdough)
6,0.103085,(Cinnamon Swirl)
8,0.087814,(Croissant)
2,0.082681,(Bakewell)
17,0.068901,(Pain Au Chocolate)
19,0.06478,(Tea)
14,0.063696,(Madagascan Brownie)
5,0.061332,(Cheese Straw)


In [27]:
# The itemsets are ordered by the support value and the bottom 15 baskets are displayed 
sw_basket.sort_values('support', ascending=True).head(15)

Unnamed: 0,support,itemsets
3,0.010671,(Cheese & Tomato Melt)
20,0.010841,"(Coffee, Almond Toast)"
46,0.011302,"(Coffee, Croissant, Pain Au Chocolate)"
23,0.011417,"(Sausage Roll, Bakewell)"
31,0.011999,"(Sausage Roll, Cinnamon Swirl)"
35,0.012271,"(Coffee, Fruit Scone)"
44,0.012302,"(Sausage Roll, Empanada)"
42,0.012895,"(Croissant, Magpie Sourdough)"
24,0.013186,"(Cheese Scone, Coffee)"
29,0.013295,"(Cinnamon Swirl, Magpie Sourdough)"


## Southwold Support Observations
From the above `Southwold` basket, the different `support` values have been provided for each product or combination of products. A total of 47 different combinations of the 20 products have been bought over the 2 years, which is less than expected, suggesting that a fair few of the products are bought separately. 

**Top 15 Itemsets Observations:**
For the top 15 baskets for `Southwold`, it can be seen again that the majority of the products are individual products, except for `Coffee and Sausage Roll` and `Coffee and Cinnamon Swirl`.

The results shown half support the hypothesis that the `Southwold` customers tend to prefer sweet products over savoury ones asjust over half of the top baskets bought contain sweet items, excluding `Tea` and `Coffee`. The top 5 products also contain 2 sweet products, unlike `Aldeburgh` baskets. 

It can also be seen that `Coffee` is preferred far more than `Tea` in `Southwold` compared to `Aldeburgh`. For `Aldeburgh`, `Tea` was the third most popular basket item with a support score over over 0.1, whereas for `Southwold` it is 8th with a score c.0.06. This is strengthened by teh fact that the only combination of products that made the top 15 baskets were `Coffee` and something else. 

**Bottom 15 Itemsets Observations:**

`Danish`, `Croque Monsieur` and `Cheese & Tomato Melt` are the only individual items in the bottom 15 itemsets. Again, these are products that the bakery struggles with determining the demand for. One theory why the `Croque Monsieur` and `Cheese & Tomato Melt` are two of the least bought products could be due to the fact the for the bakeries `Aldeburgh` and `Southwold`, the top selling products discovered in the EDA are all Take Away products. This is due to their location by the seaside and it likely that customers purchase food and drinks to sit on the beach. Therefore, they don't want to spend time waiting for their food to be heated up when they could just grab a `Sausage Roll` and leave. Furthermore, `Southwold` has one of the smaller sitting areas inside comapred to the other bakeries, which makes it less likely for sit in, messy and time consuming products to be chosen. 

# Determining Rules of Association for Southwold
Now the data is in the correct format, the association rules can be determined, using the `association_rules` function. 

In [28]:
association_rules(sw_basket, metric='lift', min_threshold=1).sort_values('lift', ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
14,(Croissant),(Pain Au Chocolate),0.087814,0.068901,0.026009,0.296184,4.298677,0.019959,1.322929
15,(Pain Au Chocolate),(Croissant),0.068901,0.087814,0.026009,0.377485,4.298677,0.019959,1.465323
23,(Pain Au Chocolate),"(Coffee, Croissant)",0.068901,0.038935,0.011302,0.164028,4.212881,0.008619,1.149638
20,"(Coffee, Croissant)",(Pain Au Chocolate),0.038935,0.068901,0.011302,0.290272,4.212881,0.008619,1.31191
22,(Croissant),"(Coffee, Pain Au Chocolate)",0.087814,0.035075,0.011302,0.128701,3.669323,0.008222,1.107455
21,"(Coffee, Pain Au Chocolate)",(Croissant),0.035075,0.087814,0.011302,0.322218,3.669323,0.008222,1.34584
9,(Pain Au Chocolate),(Cinnamon Swirl),0.068901,0.103085,0.020337,0.295163,2.863292,0.013234,1.272513
8,(Cinnamon Swirl),(Pain Au Chocolate),0.103085,0.068901,0.020337,0.197284,2.863292,0.013234,1.159936
4,(Cinnamon Swirl),(Croissant),0.103085,0.087814,0.019773,0.191817,2.184353,0.010721,1.128687
5,(Croissant),(Cinnamon Swirl),0.087814,0.103085,0.019773,0.225174,2.184353,0.010721,1.15757


# Southwold Basket Observations 
From the above dataframe the interesting observations and association rules determined are: 
- The top 10 baskets all contain different combinations of pastries, with high lift values. All with incredibly high lift scores of over 4  

- `Coffee` is present in a lot more of the baskets compared to the `Aldeburgh` basket. In particular, `Coffee` is frequently bought with a pastry or two. For example, row 3 demonstrates that a customer is roughly 4.3 times more likely to buy a `Coffee` and `Croissant` if they already have a `Pain Au Chocolate` in their basket 

- `Sausage Roll` and `Moroccan Vegan Roll` also appear relatively high up in the table, with a lift score of c.1.9, as does `Empanada` and `Sausage Roll`, which are products which again would have thought to be catered to different customers but are seen bought together 

- `Tea` and `Coffee` also appeared in the dataframe, with a relatively low lift, but highlighting the fact that customers are likely to be visiting the bakery in groups 

---

# Darsham Basket Analysis 
The above process will be repeated for the `Darsham` bakery.

In [29]:
# Copy taken for audit trail purposes
darsham_encoded = darsham.copy()

In [30]:
# Transaction encoder is applied to the darsham bakery dataset 
darsham_encoded = bakery.transaction_encoder(darsham_encoded)

In [31]:
# To validate it worked
darsham_encoded.head()

Unnamed: 0,Almond Toast,Baguette,Bakewell,Cheese & Tomato Melt,Cheese Scone,Cheese Straw,Cinnamon Swirl,Coffee,Croissant,Croque Monsieur,Danish,Empanada,Friand,Fruit Scone,Madagascan Brownie,Magpie Sourdough,Moroccan Vegan Roll,Pain Au Chocolate,Sausage Roll,Tea
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False,False
1,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False
3,False,True,False,False,False,False,False,False,True,False,False,True,False,False,False,False,False,True,False,False
4,False,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,True,False,False


In [32]:
# Applying apriori function
dar_basket = apriori(darsham_encoded, min_support=0.01, use_colnames=True)

In [33]:
print(f'The total number of itemsets combinations of the 20 products bought in Darsham is: \
{dar_basket.itemsets.count()}')

The total number of itemsets combinations of the 20 products bought in Darsham is: 60


In [34]:
# To see the top 15 itemsets bought
dar_basket.sort_values('support', ascending=False).head(15)

Unnamed: 0,support,itemsets
7,0.550799,(Coffee)
18,0.193519,(Sausage Roll)
15,0.129913,(Magpie Sourdough)
5,0.110035,(Cheese Straw)
19,0.0988,(Tea)
6,0.096232,(Cinnamon Swirl)
2,0.087612,(Bakewell)
16,0.085036,(Moroccan Vegan Roll)
47,0.081593,"(Sausage Roll, Coffee)"
8,0.078016,(Croissant)


In [35]:
# To see the bottom 15 itemsets bought
dar_basket.sort_values('support', ascending=True).head(15)

Unnamed: 0,support,itemsets
57,0.010022,"(Sausage Roll, Pain Au Chocolate)"
22,0.010069,"(Baguette, Magpie Sourdough)"
20,0.010684,"(Coffee, Almond Toast)"
28,0.010881,"(Cheese Straw, Cinnamon Swirl)"
59,0.010999,"(Sausage Roll, Coffee, Cheese Straw)"
31,0.011259,"(Cheese Straw, Moroccan Vegan Roll)"
23,0.01133,"(Sausage Roll, Baguette)"
51,0.011487,"(Sausage Roll, Croissant)"
27,0.011653,"(Sausage Roll, Cheese Scone)"
30,0.01196,"(Cheese Straw, Magpie Sourdough)"


## Darsham Support Observations
From the above `Darsham` basket, the different `support` values have been provided for each product or combination of products. 

**Top 15 Selling Itemsets**

The top 15 products for `Darsham` follow a similar pattern to that of `Aldeburgh` and `Southwold`, with the majority of the top purchased products being solo products, with `Coffee` and `Sausage Roll` being the top 2 products bought. In addition to this, the only combination of products that made the top 15 baskets is `Coffee and Sausage Roll`. However, the total number different combinations of the 20 products is 60, which is a lot larger than for `Aldeburgh` or `Southwold`.This suggests that a wider range of products are purchased together in this bakery but a lot less frequently. 

The mixture of sweet and savoury products is relatively even, however the top 5 products are all savoury or `Coffee` or `Tea`. 

**Bottom 15 Selling Itemsets**

Again `Cheese & Tomato Melt` is featured in the bototm 15 itemsets, however this is the only individual item for `Darsham`. The remainder of the items appear to have a two items, one being a sweet product and the other being a savoury. For example, `Cinnamon Swirl and Cheese Straw` or `Sausage Roll and Croissant`. 

In [36]:
association_rules(dar_basket, metric='lift', min_threshold=1).sort_values('lift', ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
23,(Pain Au Chocolate),(Croissant),0.063393,0.078016,0.01727,0.272434,3.49203,0.012325,1.267216
22,(Croissant),(Pain Au Chocolate),0.078016,0.063393,0.01727,0.221369,3.49203,0.012325,1.20289
25,(Moroccan Vegan Roll),(Empanada),0.085036,0.068207,0.01519,0.178634,2.619014,0.00939,1.134444
24,(Empanada),(Moroccan Vegan Roll),0.068207,0.085036,0.01519,0.22271,2.619014,0.00939,1.177121
14,(Cinnamon Swirl),(Pain Au Chocolate),0.096232,0.063393,0.015072,0.156624,2.470685,0.008972,1.110545
15,(Pain Au Chocolate),(Cinnamon Swirl),0.063393,0.096232,0.015072,0.237758,2.470685,0.008972,1.185671
11,(Croissant),(Cinnamon Swirl),0.078016,0.096232,0.013812,0.177035,1.839676,0.006304,1.098186
10,(Cinnamon Swirl),(Croissant),0.096232,0.078016,0.013812,0.143524,1.839676,0.006304,1.076486
30,(Sausage Roll),(Moroccan Vegan Roll),0.193519,0.085036,0.024338,0.125763,1.478946,0.007882,1.046586
31,(Moroccan Vegan Roll),(Sausage Roll),0.085036,0.193519,0.024338,0.286204,1.478946,0.007882,1.129848


# Darsham Basket Observations 
From the above dataframe the interesting observations and association rules determined are: 

 - Similar with the other baskets, the top baskets contain a mixture of pastries or puff pastry products (such as `Sausage Roll` or `Cheese Straw`
 
 - `Darsham` customers appear to prefer puff pastry product combinations more than the other shops, which has more puff pastry combinations with greater lifts compared to the other shops and at a greater frequency, with either `Moroccan Vegan Roll`, `Cheese Straw` or `Sausage Roll` making up 10 out of the above 18 baskets 

- The lift scores for all the above transactions are far lowers than `Southwold` or `Darsham` demonstrating that customers are perhaps less likely to buy combinations of products, which is also strengthened by the fact only one of the itemsets in the top 15 was not an individual product 

# Norwich Basket Analysis

In [37]:
# Copy of the dataset taken for audit trail purposes
norwich_encoded = norwich.copy()

In [38]:
# The transaction_encoder is applied 
norwich_encoded = bakery.transaction_encoder(norwich_encoded)

In [39]:
# To validate it worked
norwich_encoded.head()

Unnamed: 0,Almond Toast,Baguette,Bakewell,Cheese & Tomato Melt,Cheese Scone,Cheese Straw,Cinnamon Swirl,Coffee,Croissant,Croque Monsieur,Danish,Empanada,Friand,Fruit Scone,Madagascan Brownie,Magpie Sourdough,Moroccan Vegan Roll,Pain Au Chocolate,Sausage Roll,Tea
0,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False


In [40]:
nor_basket = apriori(darsham_encoded, min_support=0.01, use_colnames=True)

In [41]:
print(f'The total number of itemsets combinations of the 20 products bought in Norwich is: \
{nor_basket.itemsets.count()}')

The total number of itemsets combinations of the 20 products bought in Norwich is: 60


In [42]:
# The top 15 selling itemsets
nor_basket.sort_values('support', ascending=False).head(15)

Unnamed: 0,support,itemsets
7,0.550799,(Coffee)
18,0.193519,(Sausage Roll)
15,0.129913,(Magpie Sourdough)
5,0.110035,(Cheese Straw)
19,0.0988,(Tea)
6,0.096232,(Cinnamon Swirl)
2,0.087612,(Bakewell)
16,0.085036,(Moroccan Vegan Roll)
47,0.081593,"(Sausage Roll, Coffee)"
8,0.078016,(Croissant)


In [43]:
# The bottom 15 selling itemsets 
nor_basket.sort_values('support', ascending=True).head(15)

Unnamed: 0,support,itemsets
57,0.010022,"(Sausage Roll, Pain Au Chocolate)"
22,0.010069,"(Baguette, Magpie Sourdough)"
20,0.010684,"(Coffee, Almond Toast)"
28,0.010881,"(Cheese Straw, Cinnamon Swirl)"
59,0.010999,"(Sausage Roll, Coffee, Cheese Straw)"
31,0.011259,"(Cheese Straw, Moroccan Vegan Roll)"
23,0.01133,"(Sausage Roll, Baguette)"
51,0.011487,"(Sausage Roll, Croissant)"
27,0.011653,"(Sausage Roll, Cheese Scone)"
30,0.01196,"(Cheese Straw, Magpie Sourdough)"


## Norwich Support Observations
From the above `Norwich` basket, the different `support` values have been provided for each product or combination of products. 

**Top 15 Itemsets:**

The top 15 products for `Norwich` is identical to that of `Darsham`. This suggests again that customers are likely to prefer purchasing single products than combinations, and a mixture of sweet and savoury products. However, simialr to `Darsham` the total number different combinations of the 20 products is 60, suggesting that a wider range of products are purchased together in this bakery but a lot less frequently. 

**Bottom 15 Itemsets:**

Again, `Cheese & Tomoato Melt` features in the bottom 15 itemsets. This suggests that is is not a frequently bought product in any of the bakeries. Further analysis and discussions with the owner are required to determine if that's through customers preference or due to sporadic production and availability of that product. 

In [44]:
association_rules(nor_basket, metric='lift', min_threshold=1).sort_values('lift', ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
23,(Pain Au Chocolate),(Croissant),0.063393,0.078016,0.01727,0.272434,3.49203,0.012325,1.267216
22,(Croissant),(Pain Au Chocolate),0.078016,0.063393,0.01727,0.221369,3.49203,0.012325,1.20289
25,(Moroccan Vegan Roll),(Empanada),0.085036,0.068207,0.01519,0.178634,2.619014,0.00939,1.134444
24,(Empanada),(Moroccan Vegan Roll),0.068207,0.085036,0.01519,0.22271,2.619014,0.00939,1.177121
14,(Cinnamon Swirl),(Pain Au Chocolate),0.096232,0.063393,0.015072,0.156624,2.470685,0.008972,1.110545
15,(Pain Au Chocolate),(Cinnamon Swirl),0.063393,0.096232,0.015072,0.237758,2.470685,0.008972,1.185671
11,(Croissant),(Cinnamon Swirl),0.078016,0.096232,0.013812,0.177035,1.839676,0.006304,1.098186
10,(Cinnamon Swirl),(Croissant),0.096232,0.078016,0.013812,0.143524,1.839676,0.006304,1.076486
30,(Sausage Roll),(Moroccan Vegan Roll),0.193519,0.085036,0.024338,0.125763,1.478946,0.007882,1.046586
31,(Moroccan Vegan Roll),(Sausage Roll),0.085036,0.193519,0.024338,0.286204,1.478946,0.007882,1.129848


# Norwich Basket Observations 
The above dataframe is again almost identical to that of `Darsham` with: 
- The top frequently bought together items are either sweet or puff pastry items 
- Puff pastry items appear in the majority of combination baskets
- All the combinations have lower lift scores than `Aldeburgh` and `Southwold` suggesting customers here prefer purchasing single items than combinations compared to customer at the other bakeries

---

# Summary
The main insights gained from this analysis are:
- For all the shops, the top frequently bought together and more likely to be bought together given one are pastries and puff pastry products
- For all the shops `Coffee` is the most purchased product, followed by `Sausage Roll`
- For `Aldeburgh` and `Southwold` a customer was over 4 times more likely to buy a `Croissant` given they already had a `Pain Au Chocolate` in the basket than a `Pain Au Chocolate` on it's own 
- `Norwich` and `Darsham` customers appear to have similar purchasing habits, having identical top 15 selling itemsets, number of combinations of itemsets and very similar basket behaviours
- `Norwich` and `Darsham` customers frequently purchase puff pastry products together and with other items, more than at `Aldeburgh` and `Southwold` which favour purchasing sweet pastry products together and more frequently  
- `Southwold` customers prefer `Coffee` considerable more than `Tea`, with the `Tea` support score being only 12% of that of the `Coffee` support score 
- `Cheese & Tomato Melt` feautred in the bottom 15 frequently bought item across all bakeries and itemsets


# Recommendations
To make full recommendations further analysis is required than this brief introductory one.

For promotional offer recommendations, products that are frequently bought together should not be made a promotion as they are likely to be bought together anyway. Instead, the business should focus on products that are bought frequently, such as `Sausage Roll`, `Coffee`, `Magpie Sourdough`, `Cheese Straw` or `Cinnamon Swirl` and create a promotion that offers a discounted product of either one of the products that:
- Has the greatest margin if the business wants to drive revenue
- Is less frequenely bought is the business wants to promote that product 
- There is a surplus of the product that needs selling 

For individual recommendations: 
- `Aldeburgh` customers frequently purchase sweet pastry products together, therefore a promotion could be one sweet pastry and one other product that fits in the above category (either has the greater margin or is less frequnetly bought such as `Almond Toast` 
- `Southwold` customers have a strong preferenc for `Coffee` and often buy `Coffee` with other products. A promotional offer could be `Coffee` and another product, that isn't already likely to be bought with `Coffee` but is similar, to drive sales of that product 
- `Darsham` and `Norwich` customers have a strong preference for puff pastry products. Therefore, a promotion involving either `Cheese Staws` or `Sausage Rolls` should be effective 

>[Return to Contents](#Contents)