# Data Exploration (f_demand_actual)

The demand data used by the ELT Demand Planners for forecasting is held in the AWS Redshift data warehouse in the table r2ibp.f_demand_actual. This is daily demand data and has the following structure:

- date (daily)
- sold_to_customer_key
- ship_to_country_key
- isbn
- quantity_demanded (NB This can be -ve for some countries)

This notebook explores the content of this table. It will be the main data source for any new forecasting algorithms used to create the starting point forecast used by Demand Planning. 

In [None]:
#Library imports

from sqlalchemy import create_engine
import psycopg2
import numpy as np
import pandas as pd
import datetime as dt
import dateutil

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Retrieve Data from f_demand_actual

Get all data from f_demand_actual. NB The demand data is aggregated by month for each isbn/country combination. Only complete months are selected.

In [None]:
#Redshift user credentials - set here
USER = 
PASSWORD = 

#Create SQLAlchemy engine for Redshift database connection
user = USER
password = PASSWORD
host= 
port='5439'
dbname='prod'

url = "postgresql+psycopg2://{0}:{1}@{2}:{3}/{4}".format(user, password, host, port, dbname)
engine = create_engine(url)

In [None]:
query = """
    select
        isbn + ship_to_country_key as key,
        isbn,
        ship_to_country_key as country,
        last_day(date) as month,
        sum(quantity_demanded) as qty
    from r2ibp.f_demand_actual
    where month <= current_date
    and isbn not like '555%%'
    group by key, isbn, country, month
    order by key, isbn, country, month asc
    """

conn = engine.connect()
df = pd.read_sql_query(query, conn)
conn.close()

#Convert month to timestamp
df['month'] = pd.to_datetime(df['month']).dt.date

## Basic Statistics

How much data is available? etc.

In [None]:
first_month = df['month'].min()
last_full_month = df['month'].max()
twelve_months_ago =  last_full_month - dateutil.relativedelta.relativedelta(months=12)

print('The first month in the dataset is', first_month)
print('The last month in the dataset is', last_full_month)
print('Which gives', (last_full_month.year - first_month.year)*12 \
                              + (last_full_month.month - first_month.month) + 1, 'months of data in total')
print('Made up of', len(df), 'separate monthly demand quantities\n')

print('The full dataset has:')
print(df['isbn'].nunique(), 'unique ISBNs, across')
print(df['country'].nunique(), 'different countries, resulting in')
print(df['key'].nunique(), 'separate ISBN/country combinations')


#The last 12 months - I'll want to come back to this
# print('\nFor last 12 months\n')
# print('Number of unique ISBNs    ', df[df['month'] > twelve_months_ago]['isbn'].nunique())
# print('Number of unique countries', df[df['month'] > twelve_months_ago]['country'].nunique())
# print('Number of ISBN/country    ', df[df['month'] > twelve_months_ago]['key'].nunique())

## Demand Analysis

How much demand is there for ELT products?

### Aggregate Global Demand

At the overall level? And how has this changed during the lifespan of the dataset?

In [None]:
#Let's have a look at the overall demand profile

ts = df.copy().groupby("month")[["qty"]].sum()
ts_12ma = ts.rolling(window=12).mean()

plt.subplots(figsize=(8, 6))

plt.plot(ts, label="Total")
plt.plot(ts_12ma, label = '12m MA')

#plt.title('Global demand')
plt.xlabel('Month')
plt.ylabel('Total Units')
plt.ylim(bottom=0)
plt.legend()
plt.grid()
plt.show();

#We can see that:
#1) There is significant monthly variation in the demand
#2) Underlying demand fell significantly with the pandemic and remains lower (although slowly trending upwards?)

In [None]:
#Looking a bit more into the change by looking at the 12 month diff
#I'm assuming 12 month seasonality here

ts_12m_diff = ts.diff(12).dropna()

plt.subplots(figsize=(8, 6))

plt.plot(ts_12m_diff)

#plt.title('Global Demand - Change from 12 months earlier')
plt.xlabel('Month')
plt.ylabel('Units Change')
plt.grid()
plt.show();

#There was about a 1m drop in 2020 which has not (siginicantly recovered)
#There is also significant variation +/-0.5m on orders of about 1.5m units

### At individual ISBN/country

What does the demand look like at the level that we want to forecast i.e. by month for each isbn/country?

In [None]:
#Let's look at 12 ISBN/country combinations at random
np.random.seed(0) #To ensure always the same ones
plot_list = df['key'].sample(12).to_list()

COLS = 3

rows = int(np.ceil(len(plot_list)/COLS))  #round up
fig, axes = plt.subplots(rows, COLS, figsize = (16,rows*4))
#The following is to iterate the axes
axes_flat = axes.flat

for i, key in enumerate(plot_list):
    
    ts_actuals = df[df['key'] == key][['month', 'qty']]
    
    #Plot these all on the same date range
    start = first_month
#     start = ts_actuals['month'].min()
#     if start > twelve_months_ago:
#         start = twelve_months_ago
    
    idx = pd.date_range(start, last_full_month, freq='M') 

    ts_actuals.set_index(pd.to_datetime(ts_actuals.month), inplace=True)
    ts_actuals.drop("month", axis=1, inplace=True)

    #This is used to fill in the missing days
    ts_actuals = ts_actuals.reindex(idx, fill_value=0).astype(int)
    
    ax = axes_flat[i]
    ax.plot(ts_actuals)
    ax.grid()
    ax.set_title(key);
         
plt.tight_layout()
plt.show();

#This highlights a number of points straightaway.
#FOR EXAMPLE - BUILD ON THIS LIST
#1. Data tends to be very "spikey"
#2. Lots of low demand
#3. Lots of months with zero demand (how many?)
#4. Not all TS cover the whole period (product lifecycle)
#5. Negative demand in some countries (which is an artefact of the date being used)

### What's the variation of demand?

This is both between isbn/countries (level) and within isbn/countries (seasonality):
- Look at the last 12 months
- Total demand (i.e overall levels of demand) i.e. what's the overall LEVEL of demand
- Number of months with demand (i.e. the "lumpiness" of demand) which might indicate seasonality

In [None]:
#Let's look at the last 12 months

df_12m_demand = df.copy()
df_12m_demand = df_12m_demand[df_12m_demand['month'] > twelve_months_ago]

#Calculate the number of months with demand - before aggregating
df_mths_w_demand = df_12m_demand[['key', 'month']].groupby(['key']).count()
df_mths_w_demand.rename(columns = {'month':'mths_w_orders'}, inplace = True)

#Now aggregate and use cut to put order quantity into log10 bins
df_12m_demand = df_12m_demand[['key', 'qty']].groupby(['key']).sum()
df_12m_demand['qty_bin'] = pd.cut(df_12m_demand['qty'], [0, 10, 100, 1000, 10000, 100000],
                           labels = ['<=10', '10-100', '100-1000', '1000-10000', '>10000'])

#Join the number of months with demand
df_12m_demand = df_12m_demand.join(df_mths_w_demand)
#And tidy up
del df_mths_w_demand

#Plot a pie chart of the distribution of demand
# y = df_12m_demand['qty_bin'].value_counts()
# my_labels = ['<=10', '10-100', '100-1000', '1000-10000', '>10000']   
# plt.pie(y, labels = my_labels)
# plt.show();

#And also the numbers of month with demand
# y = df_12m_demand['mths_w_orders'].value_counts()
# my_labels = ['1', '2', '3', '4', '5', '12', '6', '7', '8', '10', '9', '11']
# plt.pie(y, labels = my_labels)
# plt.show();

#Create the crosstab
df_crosstab = pd.crosstab(df_12m_demand['mths_w_orders'], columns=df_12m_demand['qty_bin'],
                  values=df_12m_demand['qty'], aggfunc='count', margins = True)

#Print and plot
print(df_crosstab)

# print('\nHeatmap of the log10 values')
# sns.heatmap(np.log10(df_crosstab.iloc[:12, :5]), vmin=0, vmax = 4.5, annot=True);

#From which we can see that:
#1. 50% of TS have orders of not more than 10/year (i.e. very low volumes)
#2. Nearly 50% of orders have a single order (I can calculate more accurately, if needed)


### How does this vary by country?

There is value in looking at this as demand planners have to interact with sales managers at the country level

In [None]:
df_country = df[['country', 'qty']].groupby(['country']).sum().sort_values(by=['qty'], ascending = False)
print('Total demand by country (units)\n')
print(df_country.head(12))

In [None]:
country_list = list(df_country.index[:12])

COLS = 3
rows = int(np.ceil(len(country_list)/COLS))  #round up
fig, axes = plt.subplots(rows, COLS, figsize = (16,rows*4))
#The following is to iterate the axes
axes_flat = axes.flat

for i, country in enumerate(country_list):
    
    ts_actuals = df[df['country'] == country].groupby("month")[["qty"]].sum().reset_index()
    
    #Plot these all on the same date range
    start = first_month
    
    idx = pd.date_range(start, last_full_month, freq='M') 

    ts_actuals.set_index(pd.to_datetime(ts_actuals.month), inplace=True)
    ts_actuals.drop("month", axis=1, inplace=True)

    #This is used to fill in the missing days
    ts_actuals = ts_actuals.reindex(idx, fill_value=0).astype(int)
      
    ax = axes_flat[i]
    ax.plot(ts_actuals)
    ax.set_ylim(bottom=0)
    ax.grid()
    ax.set_title(country);
         
plt.tight_layout()
plt.show();

#Some countries have clear patterns e.g. ES and TR.

### And what about at ISBN level?

In [None]:
df_isbn = df[['isbn', 'qty']].groupby(['isbn']).sum().sort_values(by=['qty'], ascending = False)
print('Total demand by isbn (units)\n')
print(df_isbn.head(12))

In [None]:
#Let's look at the top 12 ISBNs

isbn_list = list(df_isbn.index[:12])

COLS = 3
rows = int(np.ceil(len(isbn_list)/COLS))  #round up
fig, axes = plt.subplots(rows, COLS, figsize = (16,rows*4))
#The following is to iterate the axes
axes_flat = axes.flat

for i, isbn in enumerate(isbn_list):
    
    ts_actuals = df[df['isbn'] == isbn].groupby("month")[["qty"]].sum().reset_index()
    
    #Plot these all on the same date range
    start = first_month
    
    idx = pd.date_range(start, last_full_month, freq='M') 

    ts_actuals.set_index(pd.to_datetime(ts_actuals.month), inplace=True)
    ts_actuals.drop("month", axis=1, inplace=True)

    #This is used to fill in the missing days
    ts_actuals = ts_actuals.reindex(idx, fill_value=0).astype(int)
    ax = axes_flat[i]
    ax.plot(ts_actuals)
    ax.set_ylim(bottom=0)
    ax.grid()
    ax.set_title(isbn);
         
plt.tight_layout()
plt.show();

#Very little pattern at this level