# End-to-end demand forecasting and retraining workflow

Author: Andrii Kruchko

Version Date: 03/27/2023

[Reference DataRobot's API documentation](https://docs.datarobot.com/en/docs/api/reference/index.html)

## Summary

This notebook illustrates an end-to-end demand forecasting workflow in DataRobot.  Time series forecasting in DataRobot has a huge suite of tools and approaches to handle highly complex multiseries problems. These include:

- Automatic feature engineering and the creation of lagged variables across multiple data types, as well as training dataset creation.
- Diverse approaches for time series modeling with text data, learning from cross-series interactions, and scaling to hundreds or thousands of series.
- Feature generation from an uploaded calendar of events file specific to your business or use case.
- Automatic backtesting controls for regular and irregular time series.
- Training dataset creation for an irregular series via custom aggregations.
- Segmented modeling, hierarchical clustering for multi-series  models, text support, and ensembling.
- Periodicity and stationarity detection and automatic feature list creation with various differencing strategies.
- Cold start modeling on series with limited or no history.
- Insights for models.
- Data and accuracy drift monitoring.
- Automated retraining.

This notebook demonstrates retraining policies with DataRobot MLOps deployments. 

The dataset consists of 50 series (46 SKUs across 22 stores) over a two year period with varying series history, typical of a business releasing and removing products over time.

DataRobot will be used for the model training, selection, deployment, and making predictions.  Snowflake will work as a data source for both training and testing, and as a storage to write predictions back. This workflow, however, applies to any data source, e.g. Redshift, S3, Big Query, Synapse, etc. For examples of data loading from other environments, check out the other end-to-end examples in this GitHub repo. 

The notebook covers the following steps:

- [Ingest the data from Snowflake into AI Catalog within DataRobot](#data_prep)
- [Run a new DataRobot project](#modeling)
- [Deploy the recommended model](#deployment)
- [Set up automated retraining](#retr)
- [Define and run a job to make predictions and write them back into Snowflake](#preds)



## Setup

### Optional: Import public demo data

For this workflow, you can download publicly available datasets (training, scoring, and calendar data) from DataRobot's S3 bucket to your database or load them into your DataRobot instance. 

If you are using Snowflake, you will need to update the fields below with your Snowflake information. The data is loaded and created in your Snowflake instance. You will also need the following files found in the same repo as this notebook:

* dr_utils.py
* datasets.yaml

Once you are done with this notebook, remember to delete the data from your Snowflake instance.

In [1]:
#requires Python 3.8 or higher
from dr_utils import prepare_demo_tables_in_db

Fill out the credentials for your Snowflake instance. You will need write access to a database.

In [2]:
db_user = 'your_username' # Username to access Snowflake database
db_password = 'your_password' # Password 
account = 'account' # Snowflake account identifier
db = 'YOUR_DB_NAME' # Database to Write_To
warehouse = 'YOUR_WAREHOUSE' # Warehouse 
schema = 'YOUR_SCHEMA' # Schema

Use the util function to pull the data from DataRobot's public S3 and import into your Snowflake instance.

In [3]:
response = prepare_demo_tables_in_db(
    db_user = db_user,                        
    db_password = db_password,                
    account = account,                        
    db = db,                                  
    warehouse = warehouse,                     
    schema = schema
)

******************************
table: ts_demand_forecasting_train


Unnamed: 0,STORE_SKU,DATE,UNITS,UNITS_MIN,UNITS_MAX,UNITS_MEAN,UNITS_STD,TRANSACTIONS_SUM,PROMO_MAX,PRICE_MEAN,STORE,SKU,SKU_CATEGORY
0,store_130_SKU_120931082,2019-05-06,388.0,44.0,69.0,55.428571,8.182443,243.0,1.0,44.8,store_130,SKU_120931082,cat_1160
1,store_130_SKU_120931082,2019-05-13,318.0,37.0,62.0,45.428571,8.079958,210.0,1.0,44.8,store_130,SKU_120931082,cat_1160


writing ts_demand_forecasting_train to snowflake from:  https://s3.amazonaws.com/datarobot_public_datasets/ai_accelerators/ts_demand_forecasting_train.csv
******************************
table: ts_demand_forecasting_calendar


Unnamed: 0,date,event_type
0,2017-01-01,New Year's Day
1,2017-01-02,New Year's Day (Observed)


writing ts_demand_forecasting_calendar to snowflake from:  https://s3.amazonaws.com/datarobot_public_datasets/ai_accelerators/ts_demand_forecasting_calendar.csv


### Import libraries

In [4]:
from platform import python_version

import pandas as pd
import datarobot as dr

import dr_utils as dru

print('Python version:', python_version())
print('Client version:', dr.__version__)

Python version: 3.11.1
Client version: 3.0.2


### Connect to DataRobot

1. In DataRobot, navigate to **Developer Tools** by clicking on the user icon in the top-right corner. From here you can generate a API Key that you will use to authenticate to DataRobot. You can find more details on creating an API key [in the DataRobot documentation](https://app.datarobot.com/docs/api/api-quickstart/index.html#create-a-datarobot-api-key). 

2. Determine your DataRobot API Endpoint. The API endpoint is the same as your DataRobot UI root. Replace {datarobot.example.com} with your deployment endpoint.

    API endpoint root: `https://{datarobot.example.com}/api/v2`
    
    For users of the AI Cloud platform, the endpoint is `https://app.datarobot.com/api/v2`
    
3. After obtaining your API Key and endpoint, there are several options to [connect to DataRobot](https://app.datarobot.com/docs/api/api-quickstart/index.html#configure-api-authentication).

In [5]:
# Instantiate the DataRobot connection

DATAROBOT_API_TOKEN = "" # Get this from the Developer Tools page in the DataRobot UI
# Endpoint - This notebook uses the default endpoint for DataRobot Managed AI Cloud (US)
DATAROBOT_ENDPOINT = "https://app.datarobot.com/" # This should be the URL you use to access the DataRobot UI

client = dr.Client(
    token=DATAROBOT_API_TOKEN, 
    endpoint=DATAROBOT_ENDPOINT,
    user_agent_suffix='AIA-E2E-AWS-7' # Optional. Helps DataRobot improve this workflow
)

dr.client._global_client = client

<a id='data_prep'></a>
## Data preparation

### Define variables

In [6]:
date_col = 'DATE'
series_id = 'STORE_SKU'
target = 'UNITS'

### Configure a data connection

DataRobot supports connections to a wide variety of databases through AI Catalog, allowing repeated access to the database as an AI Catalog data store. You can find the examples [in the DataRobot documentation](https://docs.datarobot.com/en/docs/data/connect-data/data-sources/index.html).

[Credentials for the connection to your Data Store can be securely stored within DataRobot](https://docs.datarobot.com/en/docs/data/connect-data/stored-creds.html). They will be used during the dataset creation in the AI Catalog, and can be found under the `Data Connections` tab in DataRobot.  

If you don't have credentials and a datastore created, uncomment and run the cell below.

In [None]:
# # Find the driver ID from name
# Can be skipped if you have the ID - showing the code here for completeness
# for d in dr.DataDriver.list():
#     if d.canonical_name in 'Snowflake (3.13.9 - recommended)':
#         print((d.id, d.canonical_name))
        
# # Create a datastore and datastore ID
# data_store = dr.DataStore.create(data_store_type='jdbc', canonical_name='Snowflake Demo DB', driver_id='626bae0a98b54f9ba70b4122', jdbc_url= db_url)
# data_store.test(username=db_user, password=db_password)

# # Create and store credentials to allow the AI Catalog access to this database
# # These can be found in the Data Connections tab under your profile in DataRobot  
# cred = dr.Credential.create_basic(name='test_cred',user=db_user, password=db_password,)

Use the snippet below to find a credential and a data connection (AKA a data store).

In [7]:
creds_name = 'your_stored_credential'
data_store_name = 'your_datastore_name'

credential_id = [cr.credential_id for cr in dr.Credential.list() if cr.name == creds_name][0]
data_store_id = [ds.id for ds in dr.DataStore.list() if ds.canonical_name == data_store_name][0]

Use the snippet below to create or get an existing data connection based on a query and upload a training dataset into the AI Catalog.

You'll create two datasets with diffent training end dates to show how to set up a model retraining policies and track data and accuracy drift over time within DataRobot MLOps capabilities. Also, the last four weeks of data won't be used for training purposes at all in order to compare the performance of the previous and retrained models.

In [8]:
date_train_max = "'2022-05-09'"

data_source_train, dataset_train = dru.create_dataset_from_data_source(
    data_source_name='ts_training_data_first', query=f'select * from {db}.{schema}."ts_demand_forecasting_train" where {date_col} <= {date_train_max};',
    data_store_id=data_store_id, credential_id=credential_id)

new data source: DataSource('ts_training_data_first')


In [9]:
date_train_max = "'2022-09-26'"

data_source_retrain, dataset_retrain = dru.create_dataset_from_data_source(
    data_source_name='ts_retraining_data', query=f'select * from {db}.{schema}."ts_demand_forecasting_train" where {date_col} <= {date_train_max};',
    data_store_id=data_store_id, credential_id=credential_id)

new data source: DataSource('ts_retraining_data')


Use the following snippet to create or get an existing data connection based on a query and upload a calendar dataset into the AI Catalog.

In [10]:
data_source_calendar, dataset_calendar = dru.create_dataset_from_data_source(
    data_source_name='ts_calendar', query=f'select * from {db}.{schema}."ts_demand_forecasting_calendar";',
    data_store_id=data_store_id, credential_id=credential_id)

new data source: DataSource('ts_calendar')


The following snippet is optional and can be used to get data to investigate before modeling.

In [11]:
df_train = dataset_train.get_as_dataframe()
df_train[date_col] = pd.to_datetime(df_train[date_col], format='%Y-%m-%d')
print('the original data shape   :', df_train.shape)
print('the total number of series:', df_train[series_id].nunique())
print('the min date:', df_train[date_col].min())
print('the max date:', df_train[date_col].max())
df_train.head()

the original data shape   : (7389, 13)
the total number of series: 50
the min date: 2019-05-06 00:00:00
the max date: 2022-05-09 00:00:00


Unnamed: 0,STORE_SKU,DATE,UNITS,UNITS_MIN,UNITS_MAX,UNITS_MEAN,UNITS_STD,TRANSACTIONS_SUM,PROMO_MAX,PRICE_MEAN,STORE,SKU,SKU_CATEGORY
0,store_130_SKU_120931082,2019-05-06,388.0,44.0,69.0,55.428571,8.182443,243.0,1.0,44.8,store_130,SKU_120931082,cat_1160
1,store_130_SKU_120931082,2019-05-13,318.0,37.0,62.0,45.428571,8.079958,210.0,1.0,44.8,store_130,SKU_120931082,cat_1160
2,store_130_SKU_120931082,2019-05-20,126.0,13.0,23.0,18.0,3.91578,118.0,0.0,44.8,store_130,SKU_120931082,cat_1160
3,store_130_SKU_120931082,2019-05-27,285.0,23.0,65.0,40.714286,14.067863,197.0,1.0,44.8,store_130,SKU_120931082,cat_1160
4,store_130_SKU_120931082,2019-06-03,93.0,10.0,20.0,13.285714,3.352327,87.0,0.0,44.8,store_130,SKU_120931082,cat_1160


In [12]:
df_retrain = dataset_retrain.get_as_dataframe()
df_retrain[date_col] = pd.to_datetime(df_retrain[date_col], format='%Y-%m-%d')
print('the original data shape   :', df_retrain.shape)
print('the total number of series:', df_retrain[series_id].nunique())
print('the min date:', df_retrain[date_col].min())
print('the max date:', df_retrain[date_col].max())
df_retrain.head()

the original data shape   : (8389, 13)
the total number of series: 50
the min date: 2019-05-06 00:00:00
the max date: 2022-09-26 00:00:00


Unnamed: 0,STORE_SKU,DATE,UNITS,UNITS_MIN,UNITS_MAX,UNITS_MEAN,UNITS_STD,TRANSACTIONS_SUM,PROMO_MAX,PRICE_MEAN,STORE,SKU,SKU_CATEGORY
0,store_130_SKU_120931082,2019-05-06,388.0,44.0,69.0,55.428571,8.182443,243.0,1.0,44.8,store_130,SKU_120931082,cat_1160
1,store_130_SKU_120931082,2019-05-13,318.0,37.0,62.0,45.428571,8.079958,210.0,1.0,44.8,store_130,SKU_120931082,cat_1160
2,store_130_SKU_120931082,2019-05-20,126.0,13.0,23.0,18.0,3.91578,118.0,0.0,44.8,store_130,SKU_120931082,cat_1160
3,store_130_SKU_120931082,2019-05-27,285.0,23.0,65.0,40.714286,14.067863,197.0,1.0,44.8,store_130,SKU_120931082,cat_1160
4,store_130_SKU_120931082,2019-06-03,93.0,10.0,20.0,13.285714,3.352327,87.0,0.0,44.8,store_130,SKU_120931082,cat_1160


<a id='modeling'></a>
## Modeling

The next step after the data preparation and before modeling is [to specify the modeling parameters](https://datarobot-public-api-client.readthedocs-hosted.com/en/v2.28.0/reference/modeling/spec/time_series.html#):

- *Features known in advance* are things you know in the future, such as product metadata or a planned marketing event. If all features are known in advance, use the setting `default_to_known_in_advance`.

- *Do not derive features* will be excluded from deriving time-related features. If all features should be excluded, `default_to_do_not_derive` can be used.

- `metric` is used for evaluating models. [DataRobot supports a wide variety of metrics](https://app.datarobot.com/docs/modeling/reference/model-detail/opt-metric.html). The metric used depends on the use case. If the value is not specified, DataRobot suggests a metric based on the target distribution.

- `feature_derivation_window_start` and `feature_derivation_window_end` define the feature derivation window (FDW). The FDW represents the rolling window that is used to derive time series features and lags. FDW definition should be long enough to capture relevant trends to your use case. On the other hand, FDW shouldn't be too long (e.g., 365 days) because it shrinks the available training data and increases the size of the feature list. Older data does not help the model learn recent trends. It is not necessary to have a year-long FDW to capture seasonality; DataRobot auto-derives features (month indicator, day indicator), as well as learns effects near calendar events (if a calendar is provided), in addition to blueprint specific techniques for seasonality.

- `gap_duration` is the duration of the gap between training and validation or holdout scoring data, representing delays in data availability. For example, at prediction time, if events occuring on Monday aren't reported or made available until Wednesday, you would have a gap of two days. This can occur with reporting lags or with data that requires some form of validation before being stored on a system of record.

- `forecast_window_start` and `forecast_window_end` defines the forecast window (FW). It represents the rolling window of future values to predict. FW depends on a business application of the model predictions.

- `number_of_backtests` and `validation_duration`. Proper backtest configuration helps evaluate the model’s ability to generalize to the appropriate time periods. The main considerations during the backtests specification are listed below.
    - Validation from all backtests combined should span the region of interest.
    - Fewer backtests means the validation lengths might need to be longer.
    - After the specification of the appropriate validation length, the number of backtests should be adjusted until they span a full region of interest.
    - The validation duration should be at least as long as your best estimate of the amount of time the model will be in production without retraining.

- `holdout_start_date` with one of `holdout_end_date` or `holdout_duration` can be added additionally. DataRobot will define them based on `validation_duration` if they were not specified.

- `calendar_id` is the ID of the previously created calendar. DataRobot automatically create features based on the calendar events (such as “days to next event”). There are several options to create a calendar:
    
    - From an AI Catalog dataset:
        ```python
        calendar = dr.CalendarFile.create_calendar_from_dataset(dataset_id)
        
        
        ```
    - Based on the provided country code and dataset start date and end dates:
        ```python
        calendar = dr.CalendarFile.create_calendar_from_country_code(country_code, start_date, end_date)
        
        
        ```
    - From a local file:
        ```python
        calendar = dr.CalendarFile.create(path_to_calendar_file)
        
        
        ```

- `allow_partial_history_time_series_predictions` - Not all blueprints are designed to predict on new series with only partial history, as it can lead to suboptimal predictions. This is because for those blueprints the full history is needed to derive the features for specific forecast points. "Cold start" is the ability to model on series that were not seen in the training data; partial history refers to prediction datasets with series history that is only partially known (historical rows are partially available within the feature derivation window). If `True`, Autopilot runs the blueprints optimized for cold start and also for partial history modeling, eliminating models with less accurate results for partial history support.

- `segmented_project` set to `True` and the cluster name `cluster_id` can be used for segmented modeling. This feature offers the ability to build multiple forecasting models simultaneously. DataRobot creates multiple projects “under the hood”. Each project is specific to its own data per `cluster_id`. The model benefits by having forecasts tailored to the specific data subset, rather than assuming that the important features are going to be the same across all of series. The models for different `cluster_id`s will have features engineered specifically from cluster-specific data. The benefits of segmented modeling also extend to deployments. Rather than deploying each model separately, you can deploy all of them at once within one segmented deployment.

> The function `dr.helpers.partitioning_methods.construct_duration_string()` can be used to construct a valid string representing the `gap_duration`, `validation_duration` and `holdout_duration` duration in accordance with ISO8601.

### Create a calendar

Create a calendar based on the dataset in the AI Catalog.

In [13]:
calendar = dr.CalendarFile.create_calendar_from_dataset(dataset_id=dataset_calendar.id, calendar_name=dataset_calendar.name)
calendar_id = calendar.id

### Configure modeling settings

In [14]:
features_known_in_advance = ['STORE', 'SKU', 'SKU_CATEGORY', 'PROMO_MAX']
do_not_derive_features = ['STORE', 'SKU', 'SKU_CATEGORY']

params = {
    'metric': None,
    'features_known_in_advance': features_known_in_advance,
    'do_not_derive_features': do_not_derive_features,
    
    'target': target,
    'mode': 'quick',
    
    'datetime_partition_column': date_col,
    'multiseries_id_columns': [series_id],
    'use_time_series': True,
    
    'feature_derivation_window_start': None,
    'feature_derivation_window_end': None,
    
    'gap_duration': None,
    
    'forecast_window_start': None,
    'forecast_window_end': None,
    
    'number_of_backtests': None,
    'validation_duration': None,
    
    'calendar_id': calendar_id,
    
    'allow_partial_history_time_series_predictions': True,
}

### Create and run projects

In order to try all cold start solution methods, run two projects on the same training data. 

Use the following snippet to create and run the project without new series support.

In [15]:
project = dru.run_project(data=dataset_train, params=params)
project.wait_for_autopilot(verbosity=dr.enums.VERBOSITY_LEVEL.SILENT)

DataRobot will define FDW and FD automatically.
2023-03-27 11:21:16.314331 start: UNITS_20230327_1121


<a id='deployment'></a>
## Deployment

You have multiple options for a production deployment in DataRobot MLOps. Creating a deployment adds your model package to the Model Registry and containerizes all model artifacts, generates compliance documentation, exposes a production quality REST API on a prediction server in your DataRobot cluster, and enables all lifecycle management functionality, like drift monitoring.

`dru.make_deployment` can be used to deploy a specific model by passing the `model_id`. If `model_id` is **None**, the DataRobot recommended model from the project is deployed.

In [18]:
deployment = dru.make_deployment(project, target_drift_enabled=True, feature_drift_enabled=True)

Deployment ID: 64215a1adecb97f050eccafd; URL: https://app.datarobot.com/deployments/64215a1adecb97f050eccafd/overview



Update your deployment's additional settings in order to set up automated reatraining later:

- `update_association_id_settings`: The association ID functions as a foreign key for your prediction dataset so you can later match up actuals with those predictions. It corresponds to an event for which you want to track the outcome. In this case, combine **series ID** and the corresponding **forecast date**.

- `update_predictions_data_collection_settings` and `update_challenger_models_settings`: These settings allow predictions to be scored on selected models in order to compare the performance with the curently deployed model.

In [20]:
deployment.update_association_id_settings(column_names=[f'{series_id}_{date_col}'], required_in_prediction_requests=True)
deployment.update_predictions_data_collection_settings(enabled=True)
deployment.update_challenger_models_settings(challenger_models_enabled=True)

Additionally, use the DataRobot UI to enable the option to infer actual values from time series history and automatically use them for accuracy estimation.

<img src="images/auto_feedback.png"/>

<a id='retr'></a>
## Set up automatatic retraining

To maintain model performance after deployment without extensive manual work, DataRobot provides an automatic retraining capability for deployments. Upon providing a retraining dataset registered in the **AI Catalog**, you can define up to five retraining policies on each deployment, each consisting of a trigger, a modeling strategy, modeling settings, and a replacement action. When triggered, retraining will produce a new model based on these settings and notify you to consider promoting it.

You can find more details on how to set up automated retraining [in the DataRobot documentation](https://app.datarobot.com/docs/mlops/manage-mlops/set-up-auto-retraining.html#set-up-retraining-for-a-deployment).

### Provide retraining data

All retraining policies on a deployment refer to the same **AI Catalog** dataset. You can register the dataset by navigating to the **Settings > Data** tab of the deployment and adding it to the **Learning** section. Alternatively, you can add training data directly from the **Challengers and Retraining** tab.
<img src="images/reatraining_data.png"/>

### Schedule monitoring

DataRobot provides automated monitoring with a notification system. You can configure notifications to alert you when the service health, data drift status, model accuracy, or fairness exceed your defined acceptable levels.

#### [Set up data drift monitoring](https://app.datarobot.com/docs/mlops/governance/deploy-notifications.html#set-up-data-drift-monitoring)
Drift assesses how the distribution of data changes across all features, for a specified range. The thresholds you set determine the amount of drift you will allow before a notification is triggered.

Use the **Data Drift** section of the **Monitoring** tab to set thresholds for drift and importance:

- Drift is a measure of how new prediction data differs from the original data used to train the model.
- Importance allows you to separate the features you care most about from those that are less important.

For both drift and importance, you can visualize the thresholds and how they separate the features on the [Data Drift tab](https://app.datarobot.com/docs/mlops/monitor/data-drift.html).

#### [Set up accuracy monitoring](https://app.datarobot.com/docs/mlops/governance/deploy-notifications.html#set-up-accuracy-monitoring)

For Accuracy, the notification conditions relate to a performance optimization metric for the underlying model in the deployment. Select from the same set of metrics that are available on the Leaderboard. You can visualize accuracy using the [Accuracy over Time graph](https://app.datarobot.com/docs/mlops/monitor/deploy-accuracy.html#accuracy-over-time-graph) and the [Prediction & Actual graph](https://app.datarobot.com/docs/mlops/monitor/deploy-accuracy.html#predicted-actual-graph).

Accuracy monitoring is defined by a single accuracy rule. Every 30 seconds, the rule evaluates the deployment's accuracy. Notifications trigger when this rule is violated.

Prior to configuring accuracy notifications and monitoring for a deployment, set an [association ID](https://app.datarobot.com/docs/mlops/manage-mlops/setup-accuracy.html#select-an-association-id).

<img src="images/monitoring.png"/>

### Set up retraining policies

1. On the **Settings > Challengers and Retraining** tab for a deployment, click **+ Add Retraining Policy**.
<img src="images/retrain_add.png"/>

2. Set a [retraining trigger](https://app.datarobot.com/docs/mlops/manage-mlops/set-up-auto-retraining.html#triggers).

3. Configure how DataRobot [selects a model](https://app.datarobot.com/docs/mlops/manage-mlops/set-up-auto-retraining.html#time-series-model-selection) from the new Autopilot project.

4. Set up a replacement strategy by selecting a [model action](https://app.datarobot.com/docs/mlops/manage-mlops/set-up-auto-retraining.html#model-action).

5. Set up a [modeling strategy](https://app.datarobot.com/docs/mlops/manage-mlops/set-up-auto-retraining.html#time-series-modeling-strategy) by selecting settings for the new Autopilot project.

6. Click **Save policy** above the policy settings.

#### Triggers
Retraining policies can be triggered manually or in response to three types of conditions:

- **Automatic schedule**: Pick a time for the retraining policy to trigger automatically. Choose from increments ranging from every three months to every day. Note that DataRobot uses your local time zone.

- **Drift status**: Initiates retraining when the deployment's data drift status declines to the level(s) you select.

- **Accuracy status**: Triggers when the deployment's accuracy status changes from a better status to the levels you select (green to yellow, yellow to red, etc.).

#### Model selection

**Same blueprint as champion**: The retraining policy uses the same engineered features as the champion model's blueprint. The search for newly derived features does not occur because it could potentially generate features that are not captured in the champion's blueprint.

**Autopilot**: When using Autopilot instead of the same blueprint, the time series feature derivation process does occur. However, Comprehensive Autopilot mode is not supported. Additionally, time series Autopilot does not support the options to only include Scoring Code blueprints and models with SHAP value support.

<img src="images/model_selection.png"/>

#### Model action

<img src="images/model_action.png"/>

#### Modeling strategy

**Same blueprint as champion**: When creating a "same-blueprint" retraining policy for a time series deployment, you must use the champion model's feature list and advanced modeling options. The only option that you can override is the calendar used because, for example, a new holiday or event may be included in an updated calendar that you want to account for during retraining.

**Autopilot**: When creating an Autopilot retraining policy for a time series deployment, you must use the informative features modeling strategy. This strategy allows Autopilot to derive a new set of feature lists based on the informative features generated by new or different data. You cannot use the model's original feature list because time series Autopilot uses a feature extraction and reduction process by default.

<a id='preds'></a>
## Make predictions

Once you have set up model retraining, use your deoloyment to make predictions for several consecutive months to track data and accuracy drift over time.

The scoring dataset should follow requirements to ensure the Batch Prediction API can make predictions:

- Sort prediction rows by their series ID then timestamp, with the earliest row first.
- The dataset must contain rows without a target for the desired forecast window.

You can find more details on the scoring dataset structure [in the DataRobot documentation](https://app.datarobot.com/docs/api/reference/batch-prediction-api/batch-pred-ts.html#requirements-for-the-scoring-dataset).

In [22]:
forecast_points = ["'2022-05-09'", "'2022-06-06'", "'2022-07-04'", "'2022-08-01'", "'2022-08-29'", "'2022-09-26'"]

model_name = "'first model'"

for forecast_point in forecast_points:
    query = f"""
        select
            concat(t.store_sku, '_', t.date::varchar) store_sku_date,
            t.store_sku,
            t.date,
            iff(t.date > {forecast_point}, null, t.units) units,
            iff(t.date > {forecast_point}, null, t.units_min) units_min,
            iff(t.date > {forecast_point}, null, t.units_max) units_max,
            iff(t.date > {forecast_point}, null, t.units_mean) units_mean,
            iff(t.date > {forecast_point}, null, t.units_std) units_std,
            iff(t.date > {forecast_point}, null, t.transactions_sum) transactions_sum,
            t.promo_max,
            iff(t.date > {forecast_point}, null, t.price_mean) price_mean,
            t.store,
            t.sku,
            t.sku_category,
            {model_name} model_name
        from {db}.{schema}."ts_demand_forecasting_train" t
        where
            t.date >= dateadd(day, -70, {forecast_point}) and 
            t.date <= dateadd(day,  28, {forecast_point})
        order by
            t.store_sku,
            t.date
        ;
    """

    intake_settings = {
        "type": "jdbc",
        "data_store_id": data_store_id,
        "credential_id": credential_id,
        "query": query
        }

    output_settings = {
        "type": "jdbc",
        "data_store_id": data_store_id,
        "credential_id": credential_id,
        "table": "ts_demand_forecasting_predictions",
        "schema": schema,
        "catalog": db,
        "create_table_if_not_exists": True,
        "statement_type": "insert"
        }

    pred_job = dru.make_predictions_from_deployment(deployment=deployment, intake_settings=intake_settings, output_settings=output_settings,
                                    passthrough_columns_set='all')


### Data drift

The plot below shows the data drift for a deployment. Currently, only two features with relative low importance have drifted significantly.
<img src="images/data_drift.png"/>

### Accuracy

According to accuracy tracking the model performance degraded over those several months scoring period. RMSE is more than 22% worse compared to the original model's performance. It will trigger the corresponding retraining policy.

<img src="images/accuracy.png"/>

### Predictions after retraining

To check the retraining benefits, compare the first and retraining models performance on the period right after the training data ends.

In [24]:
forecast_points = ["'2022-09-26'"]

model_name = "'retrained model'"

for forecast_point in forecast_points:
    query = f"""
        select
            concat(t.store_sku, '_', t.date::varchar) store_sku_date,
            t.store_sku,
            t.date,
            iff(t.date > {forecast_point}, null, t.units) units,
            iff(t.date > {forecast_point}, null, t.units_min) units_min,
            iff(t.date > {forecast_point}, null, t.units_max) units_max,
            iff(t.date > {forecast_point}, null, t.units_mean) units_mean,
            iff(t.date > {forecast_point}, null, t.units_std) units_std,
            iff(t.date > {forecast_point}, null, t.transactions_sum) transactions_sum,
            t.promo_max,
            iff(t.date > {forecast_point}, null, t.price_mean) price_mean,
            t.store,
            t.sku,
            t.sku_category,
            {model_name} model_name
        from {db}.{schema}."ts_demand_forecasting_train" t
        where
            t.date >= dateadd(day, -70, {forecast_point}) and 
            t.date <= dateadd(day,  28, {forecast_point})
        order by
            t.store_sku,
            t.date
        ;
    """

    intake_settings = {
        "type": "jdbc",
        "data_store_id": data_store_id,
        "credential_id": credential_id,
        "query": query
        }

    output_settings = {
        "type": "jdbc",
        "data_store_id": data_store_id,
        "credential_id": credential_id,
        "table": "ts_demand_forecasting_predictions",
        "schema": schema,
        "catalog": db,
        "create_table_if_not_exists": True,
        "statement_type": "insert"
        }

    pred_job = dru.make_predictions_from_deployment(deployment=deployment, intake_settings=intake_settings, output_settings=output_settings,
                                    passthrough_columns_set='all')


## Get predictions

In the previous step, you ran prediction jobs from your registered data store (Snowflake) on the deployment before and after retraining. The records with the value `first model` in the columns `model_name` are predictions before the model retraining. They cover longer period in order to check data drift and accuracy over time. Since you've observed a significant drop in the model performance, the retraining was triggered. The records with the value `retrained model` in the columns `model_name` were made on the data right after the retraining data ends.

The steps below will add the new prediction table, `ts_demand_forecasting_predictions_combined`, to the AI Catalog. This enables versioning of each prediction run, and completing the comparison of the model before and after retraining. It contains the predictions for the period right after the retraining data ends.

In [25]:
query = f"""
    select
        t.store_sku,
        t.date,
        t.units,
        pf."UNITS (actual)_PREDICTION" prediction_first_model,
        pr."UNITS (actual)_PREDICTION" prediction_retrained_model
    from {db}.{schema}."ts_demand_forecasting_train" t
    join {db}.{schema}."ts_demand_forecasting_predictions" pf
        on t.store_sku = pf.store_sku
        and t.date = pf.date
        and pf.model_name = 'first model'
    join {db}.{schema}."ts_demand_forecasting_predictions" pr
        on t.store_sku = pr.store_sku
        and t.date = pr.date
        and pr.model_name = 'retrained model'
    ;
"""

In [26]:
# Create or get an existing data connection based on a query and upload a training dataset into the AI Catalog

data_source_preds, dataset_preds = dru.create_dataset_from_data_source(
    data_source_name='ts_demand_forecasting_predictions_combined', query=query,
    data_store_id=data_store_id, credential_id=credential_id)

new data source: DataSource('ts_demand_forecasting_predictions_combined')


In [27]:
# Get predictions as a DataFrame

df_preds = dataset_preds.get_as_dataframe()
df_preds[date_col] = pd.to_datetime(df_preds[date_col], format='%Y-%m-%d')
df_preds.sort_values([series_id, date_col], inplace=True)
print('the original data shape   :', df_preds.shape)
print('the total number of series:', df_preds[series_id].nunique())
print('the min date:', df_preds[date_col].min())
print('the max date:', df_preds[date_col].max())
df_preds.head()

the original data shape   : (200, 5)
the total number of series: 50
the min date: 2022-10-03 00:00:00
the max date: 2022-10-24 00:00:00


Unnamed: 0,STORE_SKU,DATE,UNITS,PREDICTION_FIRST_MODEL,PREDICTION_RETRAINED_MODEL
0,store_130_SKU_120931082,2022-10-03,86.0,73.83269,73.985825
1,store_130_SKU_120931082,2022-10-10,97.0,74.379489,73.38073
2,store_130_SKU_120931082,2022-10-17,73.0,72.94839,72.261116
3,store_130_SKU_120931082,2022-10-24,64.0,71.079112,71.776375
4,store_130_SKU_120969795,2022-10-03,63.0,52.748136,52.485294


The overlapping period contains predictions for four consecutive weeks. The retrained model has significantly better performance.

In [29]:
print('RMSE before retraining:', dru.rmse(df_preds, target, 'PREDICTION_FIRST_MODEL'))
print('RMSE after retraining :', dru.rmse(df_preds, target, 'PREDICTION_RETRAINED_MODEL'))

RMSE before retraining: 41.95933587139678
RMSE after retraining : 34.24508060550132


## Conclusion

This notebook's workflow provides a repeatable framework from project setup to model deployment, monitoring, and retraining for times series data with multiple series (SKUs) with full, partial and no history.

## Delete project artifacts 

Optional.

In [None]:
# # Uncomment and run this cell to remove everything you added in DataRobot during this session

# dr.Dataset.delete(dataset_train.id)
# dr.Dataset.delete(dataset_retrain.id)
# dr.Dataset.delete(dataset_calendar.id)
# dr.Dataset.delete(dataset_preds.id)

# data_source_train.delete()
# data_source_retrain.delete()
# data_source_calendar.delete()
# data_source_preds.delete()

# deployment.delete()

# project.delete()