<div id="container" style="position:relative;">
<div style="float:left"><h1> Forecasting Bakery Sales - Abi Magnall </h1></div>
<div style="position:relative; float:right"><img style="height:65px" src ="https://twomagpiesbakery.co.uk/wp-content/uploads/2020/11/logo-no-site.jpg" />
</div>
</div>

# Notebook 1 : Bakery Data Gathering

---

The purpose of this notebook is to explain the problem and purpose of this project, as well as collect the data from the a cloud based system using their API. 

A total of four bakeries, which will be referred to as `Aldeburgh`, `Southwold`, `Darsham` and `Norwich` throughout this project, will have the transaction data collected spanning from 01-09-2020 to 31-10-2022. This data was provided by the amazing `Two Magpies Bakery` in order to improve their business planning and strategy using the outcomes of this project. 

**N.B** **This notebook does not need to be run**, it only contains cells to make the API calls to the till system, and therefore will not run without the API tokens. The cells have been commented out incase. 

---

# Contents  

**1. [Introduction](#Introduction)**
- [Problem Statement](#Problem-Statement)


**2. [Data Collection](#Data-Collection)**
- [Data and Column Descriptions](#Data-and-Column-Descriptions)

**3. [Next Steps](#Next-Steps)**

___

# Introduction 
## Problem Statement 

Cash flow is a key metric in determining whether a business is able to grow and survive, particularly during times of economic uncertainty. It is essential for a business to know how much revenue it’s going to generate and, depending on the type of business, when it will receive payments from customers throughout the year so that it can plan any investments accordingly. Without having an accurate data driven forecast, all the planning and strategy would be guesswork. Therefore, it is essential for the success and growth of any business to have a reliable forecast.



    "The two most important things about forecast accuracy are:
    1. Surety of cash and so giving confidence to the banks and any potential investors
    2. To manage & optimise manufacturing efficiency - ensuring that there is the right labour producing the right product at the right time" - Co-Owner, Two Magpies Bakery



**The Value Add of an Accurate Data-Driven Revenue Forecast:**
- Managers can be proactive instead of reactive and can adapt their strategy to achieve the sales needed
- Better investments, staffing and hiring decisions based on peaks and troughs throughout the year (achieved through a weekly and monthly forecast) 
- A monthly forecast is required to show to potential investors to raise funding as it shows that the business if not only target but going to grow

All of these are required to support the growth of the business. 

**The Aim of this Project:**

The aim of this project is to develop accurate revenue forecasts for a leading Artisan Bakery in East Anglia, called the [Two Magpies Bakery](https://twomagpiesbakery.co.uk/). The bakery requires: 
1. A daily forecast up to 7 days ahead, *with a target accuracy of 95%*
2. A weekly forecast up to 6 weeks ahead, *with a target accuracy of 95%*
3. A monthly forecast up to 6 months ahead, *with a target accuracy of 92-95%* 


**How to Achieve This Project:**

Through thorough research, the modelling plan for this project includes: 
- Moving Average Model, which will be the baseline model as it is the simplest to implement and most understandable 
- Linear Regression Model, another simple and easy to implement and understand model, that can perform successfully depending on the scenario 
- SARIMAX, which is known in industry as one of the leading timeseries models
- Facebook Prophet, also known in industry as one of the leading timeseries algorithms and is particularly good at dealing with special dates

---

# Data Collection
The data was collected calling an API to the EposNow till system used by the bakery. The API tokens are saved in external text files which are called in, therefore the below code blocks performing the API call will not run without the tokens. 

The API gets the data between two given dates, which gets appended to a dataframe. The newly formed dataframe has rows of data, where each row of data is a new transaction. 

The full raw dataset runs from 01/09/2020-31/10/2022. These dates were selected as it the longest amount of time that was available post the main lockdowns due to the Corona virus pandemic. This size of data is the minimum required for timeseries analysis, and potentially may prove problematic due to the length. However, speaking with the bakery owner, the sales and spending habits of customers have completely changed since the pandemic. Therefore, it was thought that collecting data pre-pandemic would have different revenue trends that would be identified and forecasted by the models, which are not longer accurate for the true patterns. 

## Imports

In [None]:
import requests
import pandas as pd
import os

## To Get Current Directory

In [None]:
working_directory = os.getcwd()
working_directory

## Importing Custom Functions

In [None]:
import BakeryFunctions as bakery

## Importing the API Tokens

In [None]:
ald_token = open('aldeburgh_token.txt', 'r')
sw_token = open('southwold_token.txt', 'r')
dar_token = open('darsham_token.txt', 'r')
nor_token = open('norwich_token.txt', 'r')

---

# Aldeburgh Data Download 
*The inputted data is for the final download of data which was for the test set spanning 01-10-2022 - 01-11-2022, not the full data set from 01-09-2020.*

The raw data is returned as a list of dictionaries, where each dictionary contains the transaction information. This is identified and retrieved from the full list and gets appended to the dataframe. 

In [None]:
# # Aldeburgh Download

# # An empty dataframe is created to be filled with transaction data 
# aldeburgh_df = pd.DataFrame()

# # To loop through the page numbers 
# for i in range(1, 65): 
#     # Link to call the API which gets the transaction rows between two dates 
#     api_url = 'https://api.eposnowhq.com/api/v4/Transaction/GetByDate/2022-10-01/2022-11-01/?page='+str(i)
#     # To call the API token 
#     token = ald_token.read()
#     headers = {'Authorization': "Basic {} ".format(token)}
#     auth_response = requests.get(api_url, headers=headers)
#     print(f'Processing page {i}: {len(auth_response.json())} transactions found.')
#     for j in range(200):
#         try:
#             # To append the data in the correct format, with the correct date for each transaction 
#             for transaction_item in auth_response.json()[j]['TransactionItems']:
#                 tmp = pd.DataFrame.from_dict(transaction_item, orient = 'index').T
#                 tmp['Date'] = auth_response.json()[j]['DateTime']
#                 #print(tmp)
#                 try:
#                     aldeburgh_df = pd.concat([aldeburgh_df, tmp])
#                 except:
#                     # To print if there are any issues processing that transaction row 
#                     print(f'Issue processing transaction: page {i}, transaction {j}')
#         except:
#             print(f'Issue processing transaction: page {i}, transaction {j}')

In [None]:
# # To validate it downloaded successfully 
# aldeburgh_df.head()

In [None]:
# # To export the the data to a new csv file 
# aldeburgh_df.to_csv(working_directory+'/1_raw_data/ald_oct22.csv',index=False)

## Observations 
Whilst the data was being called from the API, some transactions had issues processing them. This will therefore be explored in [Bakery Data Preprocessing](./4_Bakery_Data_Preprocessing.ipynb) to ensure that all transaction rows are accounted for but comapring the calculated revenue to the true revenue. 

---

# Southwold Data Download

In [None]:
# # Southwold Download

# # A new empty dataframe is created 
# southwold_df = pd.DataFrame()
# # To loop through the page numbers 
# for i in range(1, 65): 
#     # To call the API 
#     api_url = 'https://api.eposnowhq.com/api/v4/Transaction/GetByDate/2022-10-01/2022-11-01/?page='+str(i)
#     token = sw_token.read()
#     headers = {'Authorization': "Basic {} ".format(token)}
#     auth_response = requests.get(api_url, headers=headers)
#     print(f'Processing page {i}: {len(auth_response.json())} transactions found.')
#     for j in range(200):
#         try:
#             for transaction_item in auth_response.json()[j]['TransactionItems']:
#                 tmp = pd.DataFrame.from_dict(transaction_item, orient = 'index').T
#                 tmp['Date'] = auth_response.json()[j]['DateTime']
#                 try:
#                     southwold_df = pd.concat([southwold_df, tmp])
#                 except:
#                     print(f'Issue processing transaction: page {i}, transaction {j}')
#         except:
#             print(f'Issue processing transaction: page {i}, transaction {j}')

In [None]:
# # To validate it worked 
# southwold_df

In [None]:
# # To save the raw data to a csv file 
# southwold_df.to_csv(working_directory+'/1_raw_data/sw_oct22.csv',index=False)

---

# Darsham Data Download

In [None]:
# # Darsham Download

# # Empty dataframe is created
# darsham_df = pd.DataFrame()
# # To loop through the page numbers 
# for i in range(1, 65): 
#     api_url = 'https://api.eposnowhq.com/api/v4/Transaction/GetByDate/2022-10-01/2022-11-01/?page='+str(i)
#     token = dar_token.read()
#     headers = {'Authorization': "Basic {} ".format(token)}
#     auth_response = requests.get(api_url, headers=headers)
#     print(f'Processing page {i}: {len(auth_response.json())} transactions found.')
#     for j in range(200):
#         try:
#             for transaction_item in auth_response.json()[j]['TransactionItems']:
#                 tmp = pd.DataFrame.from_dict(transaction_item, orient = 'index').T
#                 tmp['Date'] = auth_response.json()[j]['DateTime']
#                 #print(tmp)
#                 try:
#                     darsham_df = pd.concat([darsham_df, tmp])
#                 except:
#                     print(f'Issue processing transaction: page {i}, transaction {j}')
#         except:
#             print(f'Issue processing transaction: page {i}, transaction {j}')

In [None]:
# # To validate it worked 
# darsham_df

In [None]:
# To save the raw data to csv 
darsham_df.to_csv(working_directory+'/1_raw_data/dars_oct22.csv',index=False)

---

# Norwich Data Download

In [None]:
# # Norwich Download

# # New empty dataframe is created
# norwich_df = pd.DataFrame()
# # To loop through the page numbers 
# for i in range(1, 65): 
#     api_url = 'https://api.eposnowhq.com/api/v4/Transaction/GetByDate/2022-10-01/2022-11-01/?page='+str(i)
#     token = nor_token
#     headers = {'Authorization': "Basic {} ".format(token)}
#     auth_response = requests.get(api_url, headers=headers)
#     print(f'Processing page {i}: {len(auth_response.json())} transactions found.')
#     for j in range(200):
#         try:
#             for transaction_item in auth_response.json()[j]['TransactionItems']:
#                 tmp = pd.DataFrame.from_dict(transaction_item, orient = 'index').T
#                 tmp['Date'] = auth_response.json()[j]['DateTime']
#                 #print(tmp)
#                 try:
#                     norwich_df = pd.concat([norwich_df, tmp])
#                 except:
#                     print(f'Issue processing transaction: page {i}, transaction {j}')
#         except:
#             print(f'Issue processing transaction: page {i}, transaction {j}')

In [None]:
# # To validate it worked
# norwich_df

In [None]:
# # To save the raw data to csv
# norwich_df.to_csv(working_directory+'/1_raw_data/nor_oct22.csv',index=False)

---

# Data and Column Descriptions
Basic EDA will be carried out on one of the dataframes to identify what format the raw data is collected in.


In [None]:
# To drop the two columns that contains lists for each row
# darsham_df_eda = darsham_df.drop(columns=['Taxes', 'MultipleChoiceItems']).copy()

In [None]:
# bakery.further_eda(darsham_df_eda)

## Observations 
From the above intital EDA, it can be seen that:
- The raw data consists of multiple rows of data (c.20,000 for this API call but over 500,000 for the full raw data) and 23 columns. 
- There are 4 columns that are completely empty, which are likely to be redundant and be removed in the cleaning process. 
- There are a lot of missing values that need to be dealt with appropriately. 
- There appears to be no dupliacted rows of data, but this is on a small sample of the true full dataset so this will be re-assessed in the cleaning process.
- All the datatypes are object, which will have to be amended to the approprate datatype in the cleaning process.

A data dictionary for the raw data can be found below: 

|Column| Description | 
|:--| :- | 
|Id    |   The ID of the row of data from the till system  |                
|TransactionId| The unique transaction ID | 
|ProductId     |  The ID of of the product(s) bought |                
|UnitPrice      |       The unit price of the product(s) bought |         
|UnitPriceExcTax |     The unit price of the product(s) bought excluding tax |           
|CostPrice        |    The cost price of the product(s) that is stored in the system |           
|TaxGroupId        |  The tax group ID of the product(s) |
|Quantity         |    The quantity of the product bought in that transaction |         
|DiscountAmount  |    The discount amount on the transaction |        
|DiscountReasonId |     The ID of the discount reason on the transaction | 
|DiscountAmountExcTax|  The discount amount excluding tax |      
|RefundReasonId     |    The ID of the refund reason |     
|Notes               |   Additional notes added to the order |     
|PrintOnOrder         |    Boolean if the order was printed |       
|MultipleChoiceProductId | Feature of the till system not utilised by the bakery |   
|ParentId       |   Feature of the till system not utilised by the bakery |          
|IsTaxExempt     | Boolean if the order is tax exempt or not |               
|MeasurementDetails |  Feature of the till system not utilised by the bakery |
|Taxes| List of dictionaries containing Tax information | 
|MultipleChoiceItems| List of dictionaries containing the selection of different flavours available of that product | 
|CourseFired         |   Feature of the till system not utilised by the bakery |        
|Date                 | The date of the transaction |           

---

# Next Steps
- The weather data will also be cleaned and explored to ensure it is clean, in the correct and required format for the modelling phase, this can be found here [Weather Cleaning and EDA](./2_Weather_Cleaning_EDA.ipynb). 

- The raw bakery transaction data needs to be explored for missing, duplicated or erronous data and cleaned to remove redundant columns. This is carried out in [Bakery Data Cleaning Notebook](./3_Bakery_Data_Cleaning.ipynb).



>[Return to Contents](#Contents)