## Statsmodel Forecast with Wallaroo Features: Data Connection

Wallaroo Connections are definitions set by MLOps engineers that are used by other Wallaroo users for connection information to a data source.

This provides MLOps engineers a method of creating and updating connection information for data stores:  databases, Kafka topics, etc.  Wallaroo Connections are composed of three main parts:

* Name:  The unique name of the connection.
* Type:  A user defined string that designates the type of connection.  This is used to organize connections.
* Details:  Details are a JSON object containing the information needed to make the connection.  This can include data sources, authentication tokens, etc.

Wallaroo Connections are only used to store the connection information used by other processes to create and use external connections.  The user still has to provide the libraries and other elements to actually make and use the conneciton.

The primary advantage is Wallaroo connections allow scripts and other code to retrieve the connection details directly from their Wallaroo instance, then refer to those connection details.  They don't need to know what those details actually - they can refer to them in their code to make their code more flexible.

For this step, we will use a Google BigQuery dataset to retrieve the inference information, predict the next month of sales, then store those predictions into another table.  This will use the Wallaroo Connection feature to create a Connection, assign it to our workspace, then perform our inferences by using the Connection details to connect to the BigQuery dataset and tables.

## Prerequisites

* A Wallaroo instance version 2023.2.1 or greater.
* [Google Authentication Credentials](https://cloud.google.com/docs/authentication/external/set-up-adc).  This tutorial allows any authenticated Google account to view the data in the reference dataset and tables.

## References

* [Wallaroo SDK Essentials Guide: Model Uploads and Registrations: Python Models](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-model-uploads/wallaroo-sdk-model-upload-python/)
* [Wallaroo SDK Essentials Guide: Pipeline Management](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-essentials-pipelines/wallaroo-sdk-essentials-pipeline/)
* [Wallaroo SDK Essentials Guide: Data Connections Management](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-essentials-dataconnections/)

## Statsmodel Forecast Connection Steps

### Import Libraries

The first step is to import the libraries that we will need.

In [107]:
import json
import os
import datetime

import wallaroo
from wallaroo.object import EntityNotFoundError
from wallaroo.framework import Framework

# used to display dataframe information without truncating
from IPython.display import display
import pandas as pd
import numpy as np

from resources import simdb
from resources import util

pd.set_option('display.max_colwidth', None)

# for Big Query connections
from google.cloud import bigquery
from google.oauth2 import service_account
import db_dtypes

import time

In [108]:
display(wallaroo.__version__)

'2023.2.1'

### Connect to the Wallaroo Instance

The first step is to connect to Wallaroo through the Wallaroo client.  The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the `wallaroo.Client()` command, which provides a URL to grant the SDK permission to your specific Wallaroo environment.  When displayed, enter the URL into a browser and confirm permissions.  Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use `wl = wallaroo.Client()`.  For more information on Wallaroo Client settings, see the [Client Connection guide](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-essentials-client/).

In [109]:
# Login through local Wallaroo instance

wl = wallaroo.Client()

wallarooPrefix = "doc-test."
wallarooSuffix = "wallaroocommunity.ninja"

wl = wallaroo.Client(api_endpoint=f"https://{wallarooPrefix}api.{wallarooSuffix}", 
                    auth_endpoint=f"https://{wallarooPrefix}keycloak.{wallarooSuffix}", 
                    auth_type="sso")

### Set Configurations

The following will set the workspace, model name, and pipeline that will be used for this example.  If the workspace or pipeline already exist, then they will assigned for use in this example.  If they do not exist, they will be created based on the names listed below.

Workspace names must be unique.  To allow this tutorial to run in the same Wallaroo instance for multiple users, set the `suffix` variable or share the workspace with other users.

#### Set Configurations References

* [Wallaroo SDK Essentials Guide: Workspace Management](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-essentials-workspace/)
* [Wallaroo SDK Essentials Guide: Pipeline Management](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-essentials-pipelines/wallaroo-sdk-essentials-pipeline/)

In [110]:
# used for unique connection names

import string
import random

suffix= ''.join(random.choice(string.ascii_lowercase) for i in range(4))

suffix='jch'

workspace_name = f'forecast-model-workshop{suffix}'

pipeline_name = 'forecast-workshop-pipeline'

### Set the Workspace and Pipeline

The workspace will be either used or created if it does not exist, along with the pipeline.  The models uploaded in the Upload and Deploy tutorial are referenced in this step.

In [111]:
def get_workspace(name):
    workspace = None
    for ws in wl.list_workspaces():
        if ws.name() == name:
            workspace= ws
    if(workspace == None):
        workspace = wl.create_workspace(name)
    return workspace

def get_pipeline(name):
    try:
        pipeline = wl.pipelines_by_name(name)[0]
    except EntityNotFoundError:
        pipeline = wl.build_pipeline(name)
    return pipeline

workspace = get_workspace(workspace_name)

wl.set_current_workspace(workspace)

pipeline = get_pipeline(pipeline_name)

# Get the most recent version of a model in the workspace
# Assumes that the most recent version is the first in the list of versions.
# wl.get_current_workspace().models() returns a list of models in the current workspace

def get_model(mname):
    modellist = wl.get_current_workspace().models()
    model = [m.versions()[0] for m in modellist if m.name() == mname]
    if len(model) <= 0:
        raise KeyError(f"model {mname} not found in this workspace")
    return model[0]

# upload three models:  the control and two challengers

control_model_name = 'forecast-control-model'
challenger01_model_name = 'forecast-challenger01-model'
challenger02_model_name = 'forecast-challenger02-model'

# retrieve the models

bike_day_model = get_model(control_model_name)

challenger_model_01 = get_model(challenger01_model_name)

challenger_model_02 = get_model(challenger02_model_name)


### Deploy the Pipeline

We will now add the uploaded model as a step for the pipeline, then deploy it.  The pipeline configuration will allow for multiple replicas of the pipeline to be deployed and spooled up in the cluster.  Each pipeline replica will use 0.25 cpu and 512 Gi RAM.

In [112]:
# Set the deployment to allow for additional engines to run
# Undeploy and clear the pipeline in case it was used in other demonstrations
pipeline.undeploy()
pipeline.clear()
deploy_config = (wallaroo.DeploymentConfigBuilder()
                        .replica_count(1)
                        .replica_autoscale_min_max(minimum=2, maximum=5)
                        .cpus(0.25)
                        .memory("512Mi")
                        .build()
                    )

pipeline.add_model_step(bike_day_model)
# pipeline.add_model_step(step)
pipeline.deploy(deployment_config = deploy_config)

0,1
name,forecast-workshop-pipeline
created,2023-07-27 15:54:55.416132+00:00
last_updated,2023-07-27 19:10:16.024373+00:00
deployed,True
tags,
versions,"ebdc834c-86c4-4818-8b2b-9a308d40c6ee, 08ba54b2-2674-477a-a570-da148296c85d, 2886724f-c93f-4fd1-9592-3c520c3da31c, a6d854cd-273a-462f-b9e4-397cf106aa84, 869bcedb-85f3-4fef-a111-9962a5e0d784, 02fb29d5-d5b4-4be9-adea-b6a9fa09c54c, bf0206fb-139d-4c91-84ae-ca22a42b481e, 02c8f781-adae-466f-9773-c528b05bdbce, bdfd7e4d-95e9-4c1a-b532-97178a6c2bf7, 43f1a7e2-246a-40fc-98c4-78bff1f2d674, 351736d9-07bf-4956-9724-6e135be5fcd8, dc172ae6-00f1-4cdc-8b19-591eaeb72185, 02039466-5f92-40c0-a846-2944cb0cb995, 25d73dc4-00c0-4d09-b897-f8cb7df45ec0, e3a29c1e-4922-4bd5-8fe8-55b8eed2df31, ace2bfea-1c99-45bb-bd46-236600f78e11, 404e87ac-94e2-43f8-bd75-9c7ff76ad7d2, 41298f7f-d353-49f6-ad34-ef251ea321cf"
steps,forecast-challenger02-model


### Create the Connection

The details of the connection are stored in the file `./resources/bigquery_service_account_statsmodel.json` that include the credentials to connect to a public Big Query dataset.  For this example, any credentials for a Google account are sufficient to access this sample dataset and tables.

With the credentials are three other important fields:

* `dataset`: The BigQuery dataset from the project specified in the service account credentials file.
* `input_table`: The table used for inference inputs.
* `output_table`: The table used to store results.

The details on how to generate the table and data for the sample `bike_rentals` table are stored in the file `./resources/create_bike_rentals.table`, with the data used stored in `./resources/bike_rentals.csv`.

Wallaroo connections are created through the Wallaroo Client `create_connection(name, type, details)` method.  See the [Wallaroo SDK Essentials Guide: Data Connections Management guide](https://docs.wallaroo.ai/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-essentials-guide/wallaroo-sdk-essentials-dataconnections/) for full details.

Wallaroo connections are retrieved with the Wallaroo Client `get_connection`

Note that connection names must be unique across the Wallaroo instance.  The sample code below assumes that the connection with the same name was either previously created with the proper credentials, or will be created through this step.

In [113]:
# set the connection information for other steps

forecast_connection_input_name = f'statsmodel-bike-rentals-{suffix}'
forecast_connection_input_type = "BIGQUERY"
forecast_connection_input_argument = json.load(open('./resources/bigquery_service_account_statsmodel.json.example'))

# if the connection with the same name exists, use it.  Otherwise, create it.
def get_connection(name, type, details):
    connection = None
    for cn in wl.list_connections():
        if cn.name() == name:
            connection= cn
    if(connection == None):
        connection = wl.create_connection(name=name, 
                                         connection_type=type, 
                                         details=details)
    return connection


statsmodel_connection = get_connection(forecast_connection_input_name,
                                       forecast_connection_input_type,
                                       forecast_connection_input_argument)

### Add Connection to Workspace

We'll now add the connection to our workspace so it can be retrieved by other workspace users.  The method Workspace `add_connection(connection_name)` adds a Data Connection to a workspace.

In [114]:
workspace.add_connection(forecast_connection_input_name)
workspace.list_connections()

name,connection type,details,created at,linked workspaces
statsmodel-bike-rentals-jch,BIGQUERY,*****,2023-07-27T18:54:43.671521+00:00,['forecast-model-workshopjch']


### Retrieve Connection from Workspace

To simulate a data scientist's procedural flow, we'll now retrieve the connection from the workspace.

The method Workspace `list_connections()` displays a list of connections attached to the workspace. By default the details field is obfuscated.  Specific connections are retrieved by specifying their position in the returned list.

In [115]:
forecast_connection = workspace.list_connections()[0]
display(forecast_connection)

Field,Value
Name,statsmodel-bike-rentals-jch
Connection Type,BIGQUERY
Details,*****
Created At,2023-07-27T18:54:43.671521+00:00
Linked Workspaces,['forecast-model-workshopjch']


### Run Inference from BigQuery Table

We'll now retrieve sample data through the Wallaroo connection, and perform a sample inference.  The connection details are retrieved through the Connection `details()` method.

The process is:

* Create the BigQuery credentials.
* Connect to the BigQuery dataset.
* Retrieve the inference data.

In [116]:
bigquery_statsmodel_credentials = service_account.Credentials.from_service_account_info(
    forecast_connection.details())

bigquery_statsmodel_client = bigquery.Client(
    credentials=bigquery_statsmodel_credentials, 
    project=forecast_connection.details()['project_id']
)

In [117]:
inference_inputs = bigquery_statsmodel_client.query(
        f"""
        select dteday as date, count FROM {forecast_connection.details()['dataset']}.{forecast_connection.details()['input_table']}
        where dteday > DATE_SUB(DATE('2011-02-22'), 
        INTERVAL 1 month) AND dteday <= DATE('2011-02-22') 
        ORDER BY dteday 
        LIMIT 5
        """
    ).to_dataframe().apply({"date":str, "count":int}).to_dict(orient='list')

# the original table sends back the date schema as a date, not text.  We'll convert it here.

# inference_inputs = inference_inputs.apply({"date":str, "cnt":int})

display(inference_inputs)


{'date': ['2011-01-23',
  '2011-01-24',
  '2011-01-25',
  '2011-01-26',
  '2011-01-27'],
 'count': [986, 1416, 1985, 506, 431]}

### Perform Inference from BigQuery Connection Data

With the data retrieved, we'll perform an inference through it and display the result.

In [118]:
results = pipeline.infer(inference_inputs)
results

[{'forecast': [1177, 1023, 1082, 1060, 1068, 1065, 1066]}]

### Four Weeks of Inference Data

Now we'll go back staring at the "current data" of the next month in 2011, and fetch the previous month to that date, then use that to predict what sales will be over the next 7 days.

The inference data is saved into the `inference_data` List - each element in the list will be a separate inference request.

In [119]:
# Start by getting the current month - we'll alway assume we're in 2011 to match the data store

month = datetime.datetime.now().month
month=5
start_date = f"{month+1}-1-2011"
display(start_date)

'6-1-2011'

In [120]:
def get_forecast_days(firstdate) :
    days = [i*7 for i in [-1,0,1,2,3,4]]
    deltadays = pd.to_timedelta(pd.Series(days), unit='D') 

    analysis_days = (pd.to_datetime(firstdate) + deltadays).dt.date
    analysis_days = [str(day) for day in analysis_days]
    analysis_days
    seed_day = analysis_days.pop(0)

    return analysis_days

In [121]:
forecast_dates = get_forecast_days(start_date)
display(forecast_dates)

['2011-06-01', '2011-06-08', '2011-06-15', '2011-06-22', '2011-06-29']

In [122]:
# get our list of items to run through

inference_data = []
days = []

# get the days from the start date to the end date
def get_forecast_dates(forecast_day: str, nforecast=7):
    days = [i for i in range(nforecast)]
    deltadays = pd.to_timedelta(pd.Series(days), unit='D')
    
    last_day = pd.to_datetime(forecast_day)
    dates = last_day + deltadays
    datestr = dates.dt.date.astype(str)
    return datestr 

# used to generate our queries
def mk_dt_range_query(*, tablename: str, forecast_day: str) -> str:
    assert isinstance(tablename, str)
    assert isinstance(forecast_day, str)
    query = f"""
            select count from {tablename} where 
            dteday >= DATE_SUB(DATE('{forecast_day}'), INTERVAL 1 month) 
            AND dteday < DATE('{forecast_day}') 
            ORDER BY dteday
            """
    return query


for day in forecast_dates:
    print(f"Current date: {day}")
    day_range=get_forecast_dates(day)
    days.append({"date": day_range})
    query = mk_dt_range_query(tablename=f"{forecast_connection.details()['dataset']}.{forecast_connection.details()['input_table']}", forecast_day=day)
    print(query)
    data = bigquery_statsmodel_client.query(query).to_dataframe().apply({"count":int}).to_dict(orient='list')
    # add the date into the list
    inference_data.append(data)

Current date: 2011-06-01

            select count from wallaroo_workshop_public.bike_rentals where 
            dteday >= DATE_SUB(DATE('2011-06-01'), INTERVAL 1 month) 
            AND dteday < DATE('2011-06-01') 
            ORDER BY dteday
            
Current date: 2011-06-08

            select count from wallaroo_workshop_public.bike_rentals where 
            dteday >= DATE_SUB(DATE('2011-06-08'), INTERVAL 1 month) 
            AND dteday < DATE('2011-06-08') 
            ORDER BY dteday
            
Current date: 2011-06-15

            select count from wallaroo_workshop_public.bike_rentals where 
            dteday >= DATE_SUB(DATE('2011-06-15'), INTERVAL 1 month) 
            AND dteday < DATE('2011-06-15') 
            ORDER BY dteday
            
Current date: 2011-06-22

            select count from wallaroo_workshop_public.bike_rentals where 
            dteday >= DATE_SUB(DATE('2011-06-22'), INTERVAL 1 month) 
            AND dteday < DATE('2011-06-22') 
            O

In [123]:
display(inference_data)

[{'count': [3351,
   4401,
   4451,
   2633,
   4433,
   4608,
   4714,
   4333,
   4362,
   4803,
   4182,
   4864,
   4105,
   3409,
   4553,
   3958,
   4123,
   3855,
   4575,
   4917,
   5805,
   4660,
   4274,
   4492,
   4978,
   4677,
   4679,
   4758,
   4788,
   4098,
   3982]},
 {'count': [4333,
   4362,
   4803,
   4182,
   4864,
   4105,
   3409,
   4553,
   3958,
   4123,
   3855,
   4575,
   4917,
   5805,
   4660,
   4274,
   4492,
   4978,
   4677,
   4679,
   4758,
   4788,
   4098,
   3982,
   3974,
   4968,
   5312,
   5342,
   4906,
   4548,
   4833]},
 {'count': [4553,
   3958,
   4123,
   3855,
   4575,
   4917,
   5805,
   4660,
   4274,
   4492,
   4978,
   4677,
   4679,
   4758,
   4788,
   4098,
   3982,
   3974,
   4968,
   5312,
   5342,
   4906,
   4548,
   4833,
   4401,
   3915,
   4586,
   4966,
   4460,
   5020,
   4891]},
 {'count': [4660,
   4274,
   4492,
   4978,
   4677,
   4679,
   4758,
   4788,
   4098,
   3982,
   3974,
   4968,
   5312,
   5

### Parallel Inference Example

For this example, we will use the `parallel_infer` method.  This allows us to submit the entire list of inference inputs in one request.  This is an asynchronous method that will manage submitting the separate inference requests to the pipeline, then gather the results into a List of inference results.  These results will be parsed to display the entire list of results for the entire month.

In [124]:
parallel_results = await pipeline.parallel_infer(tensor_list=inference_data, 
                                                 timeout=20, 
                                                 num_parallel=16, 
                                                 retries=2)

In [125]:
display(parallel_results)

[[{'forecast': [4373, 4385, 4379, 4382, 4380, 4381, 4380]}],
 [{'forecast': [4666, 4582, 4560, 4555, 4553, 4553, 4552]}],
 [{'forecast': [4683, 4634, 4625, 4623, 4622, 4622, 4622]}],
 [{'forecast': [4732, 4637, 4648, 4646, 4647, 4647, 4647]}],
 [{'forecast': [4692, 4698, 4699, 4699, 4699, 4699, 4699]}]]

In [126]:
days_results = list(zip(days, parallel_results))

In [127]:
# merge our parallel results into the predicted date sales

# results_table = pd.DataFrame(list(zip(days, parallel_results)),
#                             columns=["date", "forecast"])
results_table = pd.DataFrame(columns=["date", "forecast"])

# display(days_results)
for date in days_results:
    # display(date)
    new_days = date[0]['date'].tolist()
    new_forecast = date[1][0]['forecast']
    new_results = list(zip(new_days, new_forecast))
    results_table = results_table.append(pd.DataFrame(list(zip(new_days, new_forecast)), columns=['date','forecast']))

Based on all of the predictions, here are the results for the next month.

In [128]:
results_table

Unnamed: 0,date,forecast
0,2011-06-01,4373
1,2011-06-02,4385
2,2011-06-03,4379
3,2011-06-04,4382
4,2011-06-05,4380
5,2011-06-06,4381
6,2011-06-07,4380
0,2011-06-08,4666
1,2011-06-09,4582
2,2011-06-10,4560


### Undeploy the Pipeline

Undeploy the pipeline and return the resources back to the Wallaroo instance.

In [129]:
pipeline.undeploy()

0,1
name,forecast-workshop-pipeline
created,2023-07-27 15:54:55.416132+00:00
last_updated,2023-07-27 19:10:16.024373+00:00
deployed,False
tags,
versions,"ebdc834c-86c4-4818-8b2b-9a308d40c6ee, 08ba54b2-2674-477a-a570-da148296c85d, 2886724f-c93f-4fd1-9592-3c520c3da31c, a6d854cd-273a-462f-b9e4-397cf106aa84, 869bcedb-85f3-4fef-a111-9962a5e0d784, 02fb29d5-d5b4-4be9-adea-b6a9fa09c54c, bf0206fb-139d-4c91-84ae-ca22a42b481e, 02c8f781-adae-466f-9773-c528b05bdbce, bdfd7e4d-95e9-4c1a-b532-97178a6c2bf7, 43f1a7e2-246a-40fc-98c4-78bff1f2d674, 351736d9-07bf-4956-9724-6e135be5fcd8, dc172ae6-00f1-4cdc-8b19-591eaeb72185, 02039466-5f92-40c0-a846-2944cb0cb995, 25d73dc4-00c0-4d09-b897-f8cb7df45ec0, e3a29c1e-4922-4bd5-8fe8-55b8eed2df31, ace2bfea-1c99-45bb-bd46-236600f78e11, 404e87ac-94e2-43f8-bd75-9c7ff76ad7d2, 41298f7f-d353-49f6-ad34-ef251ea321cf"
steps,forecast-challenger02-model
