# Forecasting using SSA with Luna Dataset

### Forecasting the next-hour load using SSA on Luna dataset

This example shows how to train a time-series forecasting model using the Luna dataset. In this notebook you will learn:
- How to run hyper-parameter optimization and search for the best model for your data using SSA and built-in `AutoMLExperiment` class.

## Install NuGet packages for training ML.NET models and plotting:

In [1]:
// using nightly-build
#i "nuget:https://pkgs.dev.azure.com/dnceng/public/_packaging/MachineLearning/nuget/v3/index.json"
#r "nuget: Plotly.NET.Interactive, 3.0.2"
#r "nuget: Plotly.NET.CSharp, 0.0.1"
#r "nuget: Microsoft.ML.AutoML, 0.20.0-preview.22356.1"
#r "nuget: Microsoft.Data.Analysis, 0.20.0-preview.22356.1"

Loading extensions from `Microsoft.ML.AutoML.Interactive.dll`

Loading extensions from `Microsoft.Data.Analysis.Interactive.dll`

Loading extensions from `Plotly.NET.Interactive.dll`

## Import packages

In [1]:
using static Microsoft.DotNet.Interactive.Formatting.PocketViewTags;
using Microsoft.Data.Analysis;
using System;
using System.IO;
using Microsoft.ML;
using Microsoft.ML.AutoML;
using Microsoft.ML.Data;
using Plotly.NET;
using Microsoft.ML.Transforms.TimeSeries;
using Microsoft.ML.SearchSpace;
using System.Diagnostics;

### Import Dataset

Luna is a time-series dataset which records the metric of hourly-active usage of an internal-used service on Azure. It has two columns: `DateTime` and `load` where `load` column records the metric of hourly-active usage. Luna shows a strong weekly seasonal pattern given its nature as an Azure service, and a slow trend (growth with time) as well. Those features allow you to build a forecasting model to predict the next hour load metric to adjust the size of computing power accordingly.

In the code block below, we show how to load dataset into `DataFrame`.

In [1]:
var dataPath = @"./data/Luna.csv";
var df = DataFrame.LoadCsv(dataPath);
var loads = df["load"].Cast<float?>();

### Plotting last three weeks of Luna

In [1]:
var lastThreeWeek = df["load"].Cast<float>().TakeLast(7 * 24 * 3);
var x = Enumerable.Range(0, lastThreeWeek.Count());
var line = Chart2D.Chart.Line<int, float, string>(x, lastThreeWeek, Name: "load");
line.Display();

## Create `ForecastInput` and `ForecastOutput` class

In [1]:
public class ForecastInput
{
    [ColumnName("load")]
    public float Load { get; set; }
}

public class ForecastOutnput
{
    [ColumnName("predict")]
    public float[] Predict { get; set; }
}

## Set up search space for SSA
SSA(Singular Spectrum Analysis) is an algorithm for univariante-timeseries forecasting and can be consumed via [ForecastBySSA](https://docs.microsoft.com/dotnet/api/microsoft.ml.timeseriescatalog.forecastbyssa?view=ml-dotnet) in ML.Net. 

The following code shows how to create a search space on SSA's certain parameters. This is necessary if you want to set up a customized hyper-parameter optimization using `AutoMLExperiment`. 

In SSA, the parameters that have the most significant impact on training result are `windowSize`, `seriesLenth` and `rank`. So we want to set up a sweeping range for those parameters using `Range` attribute over Properties.

In [1]:
public class ForecastBySsaSearchSpace
{
    [Range(2, 24 * 7 * 30)]
    public int WindowSize { get; set; } = 2;

    [Range(2, 24 * 7 * 30)]
    public int SeriesLength { get; set; } = 2;

    [Range(1, 24 * 7 * 30)]
    public int Rank { get; set; } = 1;
}

## Create a custom `TrialRunner` for `AutoMLExperimet`.
A `TrialRunner`, which implements `ITrialRunner`, takes in parameters and pipeline, trains the model, evaluates the model, and return the metric. `AutoMLExperiment` has built-in trial runners for binary, mulitclass classification and regression. But not for time-series forecasting, so it is necessary to provide our own trial runner.

In the code below, we create `SSARunner` that implements `ITrialRunner`. The core function is `Run`. It will train SSA model first, then calculating rolling-update rmse by creating a timeseries prediction engine using a trained model, predicting next 1-hour Luna load, comparing it with actual value, then updating the model with actual value and so on.

In [1]:
public class SSARunner : ITrialRunner
{
    private MLContext _context;
    private IDataView _trainDataset;
    private IDataView _evaluateDataset;

    public SSARunner(MLContext context, IDataView trainDataset, IDataView evaluateDataset)
    {
        this._context = context;
        this._trainDataset = trainDataset;
        this._evaluateDataset = evaluateDataset;
    }

    public TrialResult Run(TrialSettings settings, IServiceProvider provider)
    {
        try
        {
            var trainDataset = this._trainDataset;
            var testDataset = this._evaluateDataset;

            var stopWatch = new Stopwatch();
            stopWatch.Start();
            var pipeline = settings.Pipeline.BuildTrainingPipeline(this._context, settings.Parameter);
            var model = pipeline.Fit(trainDataset);

            var predictEngine = model.CreateTimeSeriesEngine<ForecastInput, ForecastOutnput>(this._context);

            // check point
            predictEngine.CheckPoint(this._context, "origin");

            var predictedLoad1H = new List<float>();
            var N = testDataset.GetRowCount();

            // rolling update evaluate
            foreach (var load in testDataset.GetColumn<Single>("load"))
            {
                // firstly, get next n predict where n is horizon, in this case, it's always 1.
                var predict = predictEngine.Predict();

                predictedLoad1H.Add(predict.Predict[0]);

                // update model with truth value
                predictEngine.Predict(new ForecastInput()
                {
                    Load = load,
                });
            }

            var rmse = Enumerable.Zip(testDataset.GetColumn<float>("load"), predictedLoad1H)
                                   .Select(x => Math.Pow(x.First - x.Second, 2))
                                   .Average();
            rmse = Math.Sqrt(rmse);

            return new TrialResult()
            {
                Metric = rmse,
                Model = model,
                TrialSettings = settings,
                DurationInMilliseconds = stopWatch.ElapsedMilliseconds,
            };

        }
        catch (Exception)
        {
            return new TrialResult()
            {
                Metric = double.MaxValue,
                Model = null,
                TrialSettings = settings,
                DurationInMilliseconds = 0,
            };
        }
    }
}

## Split train-test dataset.
The following code shows how to split train-test dataset. Unlike classification or regression, where we randomly sample a subset of dataset as test set. In forecasting, in order to avoid leakage, we will want to make sure we're not using future data to train our model. So we need to take first _N_ rows as training set and keep the rest as test set.

In [1]:
var rowCount = df.Rows.Count();
var evaluateCount = 24*7;
var trainDf = df.Head(rowCount -evaluateCount);
var evaluateDf = df.Tail(evaluateCount);

### Construct training pipeline
The following code shows how to construct a sweepable training pipeline. It first copies column `load` to `newLoad`, which doesn't have any actual meaning other than translating a single estimator into a pipeline, then followed by a sweepable estimator. That sweepable estimator takes a lambda function, which takes in `Parameter` and returns a trainable `IEstimator<ITransformer>`, and a search space. During hyper-parameter optimization, a `Parameter` will be sampled from that search space and fed into that lambda function from sweepable estimator, which returns a trainable `IEstimator<ITransformer>`.

In [1]:
var mlContext = new MLContext();
var searchSpace = new SearchSpace<ForecastBySsaSearchSpace>();
var pipeline = mlContext.Transforms.CopyColumns("newLoad", "load")
                .Append(mlContext.Auto().CreateSweepableEstimator((context, ss) =>
                {
                    return mlContext.Forecasting.ForecastBySsa("predict", "load", ss.WindowSize, ss.SeriesLength, Convert.ToInt32(trainDf.Rows.Count), 1, rank: ss.Rank, variableHorizon: true);
                }, searchSpace));

### Run Hyper-parameter optimization using AutoMLExperiment
The following code shows how to config an `AutoMLExperiment` with `pipeline` and `SSARunner`. One thing to notice is that it's likely that the first few trials fail (return `Infinity`). This is because the parameters sampled from the search space do not satisfy the pre-requisite when creating SSA, which is probably why `windowsSize` is smaller than `rank`. But as training continues, there will be more successful running trials because the tuner can learn from the failure trial and try to propose parameter that has the potential to succeed the next time.

In [1]:
// Configure AutoML
var ssaTrialRunner = new SSARunner(mlContext, trainDf, evaluateDf);
// NotebookMonitor plots trials and show best run nicely in notebook output cell.
var monitor = new NotebookMonitor();

var experiment = mlContext.Auto().CreateExperiment()
                    .SetPipeline(pipeline)
                    .SetTrainingTimeInSeconds(120)
                    .SetTrialRunner(ssaTrialRunner)
                    .SetEvaluateMetric(RegressionMetric.RootMeanSquaredError, "load", "Score")
					.SetMonitor(monitor);

// Configure Visualizer			
monitor.SetUpdate(monitor.Display());

// Start Experiment
var res = await experiment.RunAsync();

index,Trial,Metric,Trainer,Parameters
⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️


## Evaluate model using test dataset
The following code shows how to use the best model produced by hyper-parameter optimization from `AutoMLExperiment` to predict the Luna `load` for next two weeks, then compare it with the actual `load` and calculate rmse metric. The evaluating way should keep the same with `SSARunner` so we need to calculate rolling-update rmse as well.

In [1]:
var model = res.Model;
// evaluate
var predictEngine = model.CreateTimeSeriesEngine<ForecastInput, ForecastOutnput>(mlContext);

var predictLoads1H = new List<float>();
foreach (var load in evaluateDf.GetColumn<Single>("load"))
{
    // firstly, get next n predict where n is horizon
    var predict = predictEngine.Predict();

    predictLoads1H.Add(predict.Predict[0]);

    // update model with truth value
    predictEngine.Predict(new ForecastInput()
    {
        Load = load,
    });
}

evaluateDf["predict_load_1h"] = DataFrameColumn.Create("predict_load_1h", predictLoads1H);

var mse = (evaluateDf["load"] - evaluateDf["predict_load_1h"]).Cast<float>().Select(x => x * x).Average();
var rmse = Math.Sqrt(mse);
rmse

## Plot both predicted and truth sonar in test dataset

In [1]:
var predicted = evaluateDf["predict_load_1h"].Cast<float>();
var truth = evaluateDf["load"].Cast<float>();
var X = Enumerable.Range(0, truth.Count());
var predictedChart = Chart2D.Chart.Line<int, float, string>(X, predicted, Name: "predict_load_1h");
var truthChart = Chart2D.Chart.Line<int, float, string>(X, truth, Name: "truth");
var combineChart = Chart.Combine(new[]{ predictedChart, truthChart});
combineChart.Display()