# Load our dataset

In [1]:
import os
from pathlib import Path

import pandas as pd

data_dir = os.path.join("~/data")

input_df = pd.read_csv(os.path.join(data_dir, "beer_dataset.csv"), parse_dates=True)

input_df.head()


Unnamed: 0,DATE,grain,BeerProduction
0,1992-01-01,grain,3459
1,1992-02-01,grain,3458
2,1992-03-01,grain,4002
3,1992-04-01,grain,4564
4,1992-05-01,grain,4221


# Create a Zendikon Reusable Pipeline

A Zendikon pipeline consists of individual steps that each process some input datasets to produce output datasets.
These steps can be either be reusable ones provided by the Zendikon library, or fully custom ones specified by the user.

In this example, we show a minimal example of mixing simple custom pre-processing steps and the provided step for leveraging AML's AutoML for modeling.

## Declare the custom pipeline steps

Each custom step's logic is specified in its own Python script. We extend `BasePipelineStep` with information about each step's script location and use these classes for declaring the pipeline later on.

Zendikon provides function decorators that simplify the definition of custom steps. The scripts in `./step_scripts/*.py` provide more concrete examples of how these decorators are used.

In [2]:
from zendikon.pipelines.step.base_step import BasePipelineStep, StepConfig
from zendikon.pipelines.pipeline import PipelineStepInfo


class DataSplittingStep(BasePipelineStep):
    def __init__(self, step_config: StepConfig) -> None:
        source_directory = Path("./step_scripts")
        script_name = "split_data.py"
        super().__init__(step_config, source_directory, script_name=script_name)

class PreprocessStep(BasePipelineStep):
    def __init__(self, step_config: StepConfig):
        source_directory = Path("./step_scripts")
        script_name = "preprocess.py"
        super().__init__(step_config, source_directory, script_name=script_name)


## Check our custom steps

In [3]:
import sys
source_dir = "./step_scripts"
if source_dir not in sys.path:
    sys.path.append(source_dir)

from argparse import Namespace
from preprocess import preprocess_data
from split_data import split_data


proc_data = preprocess_data(input_df, cli_args=Namespace(time_column_name="DATE", target_column_name="BeerProduction"))

train_data, valid_data = split_data(proc_data, cli_args=Namespace(time_column_name="DATE", split_date="2012-01-01"))

train_data.head()


Unnamed: 0,DATE,BeerProduction
0,1992-01-01,3459
1,1992-02-01,3458
2,1992-03-01,4002
3,1992-04-01,4564
4,1992-05-01,4221


## Putting all steps together into a pipeline

Here, we declare the flow of our pipeline by declaring steps along with their associated information:
- Input datasets: Can either be the name of 1) a registered dataset in the AML workspace, or 2) an output from a prior step.
- Output datasets: Names of the output datasets, corresponding to the step function's outputs.
- CLI arguments: Argument values to pass to `StepArgument`s of the step function.
- Execution environment: Dependencies required for the step to run.

In this case, besides our custom steps above, we can also simply utilize Zendikon-provided ones such as `ZendikonAutoMLStep` and `ZendikonAutoMLInferenceStep` to leverage AutoML.

In [4]:
from zendikon.pipelines.reusable_steps.automl_step import ZendikonAutoMLStep
from zendikon.pipelines.reusable_steps.automl_inference_step import ZendikonAutoMLInferenceStep

steps_info = [
    PipelineStepInfo(PreprocessStep, 
        StepConfig("Select relevant columns", inputs=["beer_input_dataset"], outputs=["beer_processed_dataset"], conda_dependencies_file="./conda_dependencies.yml")),
    PipelineStepInfo(DataSplittingStep, 
        StepConfig("Splitting data to train and validation", inputs=["beer_processed_dataset"], outputs=["beer_train_dataset", "beer_valid_dataset"], conda_dependencies_file="./conda_dependencies.yml")),
    PipelineStepInfo(ZendikonAutoMLStep, 
        StepConfig("AutoML training", inputs=["beer_train_dataset"], outputs=["models_info", "best_model"], conda_dependencies_file="./conda_dependencies.yml")),
    PipelineStepInfo(ZendikonAutoMLInferenceStep, 
        StepConfig("AutoML inferencing with best model", inputs=["beer_valid_dataset", "best_model"], outputs=["best_model_predicted"], arguments={"target_column": "BeerProduction"}, conda_dependencies_file="./conda_dependencies.yml")),
]

steps_info

[<zendikon.pipelines._pipeline_step_info.PipelineStepInfo at 0x7f07746c36a0>,
 <zendikon.pipelines._pipeline_step_info.PipelineStepInfo at 0x7f0773bcdd00>,
 <zendikon.pipelines._pipeline_step_info.PipelineStepInfo at 0x7f0773ecad60>,
 <zendikon.pipelines._pipeline_step_info.PipelineStepInfo at 0x7f0794586280>]

# Running our pipeline on AML

From this point on, we will require an AML workspace to work with. The details can be specified in `config.json`.

# Experiment Setup

We begin by using the AML SDK to establish the AML workspace, experiment and compute target we will be utilizing. 

In [1]:
from azureml.core import Workspace, Experiment, ComputeTarget, Dataset, Environment

# AML Setup
workspace = Workspace.from_config()
print('Workspace name: ' + workspace.name,
      'Azure region: ' + workspace.location,
      'Subscription id: ' + workspace.subscription_id,
      'Resource group: ' + workspace.resource_group, sep='\n')

experiment_name = "reusable_pipeline_time_series_forecasting"
experiment = Experiment(workspace=workspace, name=experiment_name)
compute_target = ComputeTarget(workspace=workspace, name="zendikon-cpu-f4")

Workspace name: zendikon-test
Azure region: eastus2
Subscription id: 6f83e421-6b03-4154-9df4-fc8739806b66
Resource group: zendikon


# Zendikon Package
Ensure that the `zendikon` package is available in: 

1. The current environment (notebook) we are currently in. If using from JupyterLab, kernel `zendikon-env` should already have this set up.
2. The pipeline's environment once it is submitted to AML. To do so, execute the following and update the dependency link in `conda_dependencies.yml`.

In [4]:
# Use this wheel URL in conda_dependencies.yml
whl_url = Environment.add_private_pip_wheel(workspace, "./zendikon-1.8.8.post12-py3-none-any.whl", exist_ok=True)
whl_url

'https://zendikontest6824921913.blob.core.windows.net/azureml/Environment/azureml-private-packages/zendikon-1.8.8.post12-py3-none-any.whl'

# Preparing dataset for pipeline
Registered datasets in AML are used as input datasets to Zendikon pipelines. We can achieve this in several ways:

1. Use [AML Studio (UI)](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-connect-data-ui?tabs=credential) to use an existing external datastore and register datasets from it.
2. The same can be achieved with the [AML SDK](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-register-datasets).

However for simplicity, we directly upload and register the small dataset we have loaded in this example with the SDK.
Only do this step if you decide to register the dataset via code instead of using the UI.

In [2]:
datastore = workspace.get_default_datastore()

# Register and upload the entire beer dataset to the workspace
input_dataset = Dataset.Tabular.register_pandas_dataframe(input_df, datastore, "beer_input_dataset", show_progress=True)

Validating arguments.
Arguments validated.
Successfully obtained datastore reference and path.
Uploading file to managed-dataset/14208358-220a-4952-9845-11857f1edb2f/
Successfully uploaded file to datastore.
Creating and registering a new dataset.
Successfully created and registered a new dataset.


## Prepare the forecasting parameters to be used in the pipeline

In [3]:
from azureml.automl.core.forecasting_parameters import ForecastingParameters

target_column_name = "BeerProduction"
time_column_name = "DATE"
forecast_horizon = 12
freq = "MS"

# Forecasting Parameters
forecasting_parameters = ForecastingParameters(
    time_column_name=time_column_name,
    forecast_horizon=forecast_horizon,
    freq=freq,  # Set the forecast frequency to be monthly (start of the month)
)

## Create the pipeline instance

primary_metric=“NormRMSE” (normalized root mean squared error, by default). In order to change the primary metric, specify the parameter primary_metric when calling TimeSeriesForecastingPipeline.from_default_settings.


In [8]:
from zendikon.pipelines.time_series.forecasting import TimeSeriesForecastingPipeline

pipeline = TimeSeriesForecastingPipeline.from_default_automl_config(
    input_datasets=[],
    forecasting_parameters=forecasting_parameters,
    label_column_name=target_column_name,
    compute_targets=[compute_target],
    steps_info=steps_info)


If this is not your intention, we recommend interrupting the command using Ctrl + c and check your pipeline config.



# Submit Pipeline

The pipeline will now execute remotely on our specified compute target, and we can track the progress in AML Studio with the generated link below:

In [9]:
pipeline.submit(experiment, wait_for_completion=False)

Created step Select relevant columns [79a5e91c][0f8aa93b-3cd7-4b7f-996f-e7a170ea6bc5], (This step is eligible to reuse a previous run's output)Created step Splitting data to train and validation [30bdccc9][71c7a0e8-2c3b-499f-87f6-fc4fefa5a086], (This step is eligible to reuse a previous run's output)

Created step AutoML training [05efd96a][72ec4c8d-7669-4cfe-a190-d39f713bd3ed], (This step will run and generate new outputs)
Created step AutoML inferencing with best model [bab490de][1ec7a7fb-d154-486d-9dee-a2dd2d566df3], (This step will run and generate new outputs)
Submitted PipelineRun 6a5fe01c-6c92-44cd-953c-23e5b166775a
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/6a5fe01c-6c92-44cd-953c-23e5b166775a?wsid=/subscriptions/6f83e421-6b03-4154-9df4-fc8739806b66/resourcegroups/zendikon/workspaces/zendikon-test&tid=72f988bf-86f1-41af-91ab-2d7cd011db47


Experiment,Id,Type,Status,Details Page,Docs Page
reusable_pipeline_time_series_forecasting,6a5fe01c-6c92-44cd-953c-23e5b166775a,azureml.PipelineRun,Preparing,Link to Azure Machine Learning studio,Link to Documentation


# Let's get results