# Training and AutoML

## In this Notebook, you will learn
- What does __Training__ mean?
- Introduction to trainers,  some of their differences, and how to decide which one to use.
- How hyper-parameters impact training performance.
- How to use AutoML to simplify your training process.

## What does __Training__ mean.
Before diving into code, let's first talk a little about what does "train a model" actually mean. 

In ML.Net, "train a model" usually means call `model.Fit(X)` in ML.Net, where `X` is an `IDataView` which includes both feature and label. So what happen when you call `Fit`? Generally speaking, `Fit` updates parameters in the trainer so it can predict label that is **close** to the actual label in `X`, or in another word, to decrease the distance between predicted and actual label.

In machine learning, the difference or distance between predicted and actual label is usually called **loss** and you use different loss measures based on the task. For classification softmax is a common loss measure. For regression, Root Mean Squared Error (RMSE) is a common loss measure. In general though, they are all metrics to quantify the distance between the predicted and actual label. In most of cases, a **lower loss means a better model**. For more information, see the [ML.NET evaluation metrics guide](https://docs.microsoft.com/dotnet/machine-learning/resources/metrics).

So what `Fit` does is to apply an algorithm to your data to identify patterns and adjust parameters in that algorithm to lowers the loss. When you train a model, you want to decrease its loss to make the prediction of that model closer to the actual label.

## Trainers in ML.Net
ML.NET provides a variety of trainers. You can find most of them under the [StandardTrainersCatalog](https://docs.microsoft.com/dotnet/api/microsoft.ml.standardtrainerscatalog?view=ml-dotnet). Examples of trainers include linear trainers like `SDCA`, `Lbfgs`, `LinearSvm` and tree-based non-linear trainers like `FastTree`, `RandomForest` and `LightGbm`. Generally, each trainer's capability is different. Non-linear models sometimes have better training performance (lower loss) than linear ones, but it doesn't always mean they are always the better choice. Picking the right trainer to build the best model for your data requires many attempts of trial and error.

### Overfitting and Underfitting
Overfitting and underfitting are the two most common problems you encounter when training a model. Underfitting means the selected trainer is not capable enough to fit training dataset and usually result in a high loss during training and low score/metric on test dataset. To resolve this you need to either select a more powerful model or perform more feature engineering. Overfitting is the opposite, which happens when model learns the training data too well. This usually results in low loss metric during training but high loss on test dataset.

A good analogy for these concepts is studying for an exam. Let's say you knew the questions and answers ahead of time. After studying, you take the test and get a perfect score. Great news! However, when you're given the exam again with the questions rearranged and with slightly different wording you get a lower score. That suggests you memorized the answers and didn't actually learn the concepts you were being tested on. This is an example of overfitting. Underfitting is the opposite where the study materials you were given don't accurately represent what you're evaluated on for the exam. As a result, you resort to guessing the answers since you don't have enough knowledge to answer correctly.

### Difference in parameter and hyper-parameter
In the nutshell, parameters are internal to a trainer, and is updated based on training dataset during training(`Fit`) process. While hyper-parameters are external to a trainer and control training process. For example, in `LightGbm`, `LearningRate` is a hyper-parameter which you can designate when creating and it controls the updating steps for the tree nodes weight during training. And tree nodes weight is parameter which is adjusted during `Fit` process.

### Hyper-parameter optimization
Choosing the right trainer impacts your final training performance. Choosing the right hyper-parameters also has a huge impact over the final training performance, especially for tree-base trainers. Hyper-parameters are important because it controls how parameter being updated. For example, larger `numberOfLeaves` in `LightGbm` produces a larger model and usually enables it to fit on a more complex dataset, but it might have countereffect on small dataset and cause **overfitting**. Conversely, if the dataset is complex but you set a small `numberOfLeaves`, it might impair `LightGbm`'s ability on fitting that dataset and cause **underfit**.

The process of finding the best configuration for your trainer is known as hyper-parameter optimization (HPO). Like the process of choosing your trainer it involves a lot of trial and error. The built-in Automated ML (AutoML) capabilities in ML.NET simplify the HPO process.

In the next section, we will go through two examples. The first example trains a regression model on a linear dataset using both linear and more advanced non-linear trainers to highlight the importance of selecting the right trainer. The second example trains a regression model on a non-linear dataset using `LightGbm` with different hyper-parameters to show the importance of hyper-parameter optimization

## Example 1: Linear regression
In the below section, we are going to show the difference of trainers via a linear regression task. First, we fit the linear dataset with the linear trainer, `SDCA`. Then we git the linear dataset with `LightGbm`, a tree-base non-linear trainer. Their performance is evaluated against a test dataset. The code below:
- Creates a linear dataset and splits it into train/test sets
- Create training pipelines using `SDCA` and `LightGbm`
- Trains both `SDCA` and `LightGbm` on the linear training set, and evaluates them on the test set.

In [1]:
// using nightly-build
#i "nuget:https://pkgs.dev.azure.com/dnceng/public/_packaging/MachineLearning/nuget/v3/index.json"
#r "nuget: Plotly.NET.Interactive, 3.0.2"
#r "nuget: Plotly.NET.CSharp, 0.0.1"
#r "nuget: Microsoft.ML.AutoML, 0.20.0-preview.22356.1"
#r "nuget: Microsoft.Data.Analysis, 0.20.0-preview.22356.1"

Loading extensions from `Microsoft.ML.AutoML.Interactive.dll`

Loading extensions from `Microsoft.Data.Analysis.Interactive.dll`

Loading extensions from `Plotly.NET.Interactive.dll`

In [1]:
// Import usings.
using Microsoft.Data.Analysis;
using System;
using System.IO;
using Microsoft.ML;
using Microsoft.ML.AutoML;
using Microsoft.ML.Data;

## Create linear dataset
The code below creates a linear dataset with a random residual. The dataset is loaded into train and test set.

In [1]:
var rand = new Random(0);
var context =new MLContext(seed: 1);
var x = Enumerable.Range(-10000, 10000).Select(_x => _x * 0.1f).ToArray();
var y = x.Select(_x => 100 * _x + (rand.NextSingle() - 0.5f) * 10).ToArray();
var df = new DataFrame();
df["X"] = DataFrameColumn.Create("X", x);
df["y"] = DataFrameColumn.Create("y", y);
var trainTestSplit = context.Data.TrainTestSplit(df);
df.Head(10)

index,X,y
0,-1000.0,-99997.734
1,-999.9,-99986.83
2,-999.8,-99977.32
3,-999.7,-99969.42
4,-999.60004,-99962.94
5,-999.5,-99949.414
6,-999.4,-99935.94
7,-999.3,-99930.58
8,-999.2,-99915.23
9,-999.10004,-99912.266


## Construct pipeline
The code below shows how to construct training pipelines for both `SDCA` and `LightGbm`. The `Concatenate` transformer is required to convert a `single` column into `Vector<single>` type, which is the expected feature type for both `SDCA` and `LightGbm` regressor.

In [1]:
var sdcaPipeline = context.Transforms.Concatenate("Features", "X")
                            .Append(context.Regression.Trainers.Sdca("y"));

In [1]:
var lgbmPipeline = context.Transforms.Concatenate("Features", "X")
                            .Append(context.Regression.Trainers.LightGbm("y"));

## Train and evaluate model
The code below first trains `sdcaPipeline` and `lgbmPipeline` which are created above, then evaluate their performance on test dataset by calculating `Root Mean Square Error` between predicted and actual value. `SDCA` has better performance with a significantly lower `Root Mean Square Error` compared to `LightGbm` even though it's a simpler linear model. This is because the training dataset is also linear, so `SDCA` can fit the dataset better than `LightGbm`.

In [1]:
var sdcaModel = sdcaPipeline.Fit(trainTestSplit.TrainSet);
var lgbmModel = lgbmPipeline.Fit(trainTestSplit.TrainSet);

// evaluate on train set
var sdcaTrainEval = sdcaModel.Transform(trainTestSplit.TrainSet);
var sdcaTrainMetric = context.Regression.Evaluate(sdcaTrainEval, "y");

var lgbmTrainEval = lgbmModel.Transform(trainTestSplit.TrainSet);
var lgbmTrainMetric = context.Regression.Evaluate(lgbmTrainEval, "y");

Console.WriteLine($"sdca rmse on trainset: {sdcaTrainMetric.RootMeanSquaredError}, lgbm rmse on trainset: {lgbmTrainMetric.RootMeanSquaredError}");

// evaluate on test set
var sdcaTestEval = sdcaModel.Transform(trainTestSplit.TestSet);
var sdcaTestMetric = context.Regression.Evaluate(sdcaTestEval, "y");

var lgbmTestEval = lgbmModel.Transform(trainTestSplit.TestSet);
var lgbmTestMetric = context.Regression.Evaluate(lgbmTestEval, "y");
Console.WriteLine($"sdca rmse on testset: {sdcaTestMetric.RootMeanSquaredError}, lgbm rmse on testset: {lgbmTestMetric.RootMeanSquaredError}");

sdca rmse on trainset: 2.9576339604324984, lgbm rmse on trainset: 117.9591141671402
sdca rmse on testset: 2.989504150889505, lgbm rmse on testset: 119.860254032606

## Example 2: Non-linear regression on LightGbm.
This example shows the importance of hyper-parameter optimization. First we create a non-linear dataset and two pipelines. One pipeline has `LightGbm` with `numberOfLeaves` set to `10`, the other's set to `1000`. Both pipelines are trained with the same training dataset and their training performance is evaluated on the same test dataset.

## Create non-linear dataset
The code below creates a non-linear dataset with a random residual. The dataset is loaded into train and test set

In [1]:
var rand = new Random(0);
var context =new MLContext(seed: 1);
var x = Enumerable.Range(-10000, 10000).Select(_x => _x * 0.1f).ToArray();
var y = x.Select(_x => 100 * _x * _x + (rand.NextSingle() - 0.5f) * 10).ToArray();
var df = new DataFrame();
df["X"] = DataFrameColumn.Create("X", x);
df["y"] = DataFrameColumn.Create("y", y);
var trainTestSplit = context.Data.TrainTestSplit(df);
df.Head(10)

index,X,y
0,-1000.0,100000000
1,-999.9,99980000
2,-999.8,99960000
3,-999.7,99940010
4,-999.60004,99920020
5,-999.5,99900024
6,-999.4,99880050
7,-999.3,99860050
8,-999.2,99840070
9,-999.10004,99820090


## Construct pipeline
The code below shows how to construct training pipelines for `LightGbm` with different hyper-parameters. The `Concatenate` transformer is required because it converts a `single` column into `Vector<single>` type, which is the expected feature type for the `LightGbm` trainer.

In [1]:
var smallLgbmPipeline = context.Transforms.Concatenate("Features", "X")
                            .Append(context.Regression.Trainers.LightGbm("y", numberOfLeaves: 10));

In [1]:
var largeLgbmPipeline = context.Transforms.Concatenate("Features", "X")
                            .Append(context.Regression.Trainers.LightGbm("y", numberOfLeaves: 1000));

## Train and evaluate model
The code below first trains `smallLgbmPipeline` and `largeLgbmPipeline` which are created above, then evaluates their performance on the test dataset by calculating the `Root Mean Square Error` between predicted and actual value. The model created by `largeLgbmPipeline` has better performance with a lower RMSE.

In [1]:
var smallLgbmModel = smallLgbmPipeline.Fit(trainTestSplit.TrainSet);
var largeLgbmModel = largeLgbmPipeline.Fit(trainTestSplit.TrainSet);

// evaluate on test set
var smallTestEval = smallLgbmModel.Transform(trainTestSplit.TrainSet);
var smallLgbmMetric = context.Regression.Evaluate(smallTestEval, "y");

var largeLgbmEval = largeLgbmModel.Transform(trainTestSplit.TrainSet);
var largeLgbmMetric = context.Regression.Evaluate(largeLgbmEval, "y");

Console.WriteLine($"small lgbm rmse on testset: {smallLgbmMetric.RootMeanSquaredError}, large lgbm rmse on testset: {largeLgbmMetric.RootMeanSquaredError}");

small lgbm rmse on testset: 173938.52924678137, large lgbm rmse on testset: 132927.2510939994

## Use AutoML to simplify trainer selection and hyper-parameter optimization.
Trainer selection and Hyper-parameter optimization is an important process with lots of trial and error. This process can be automated and simplified using the built-in `AutoMLExperiment`. `AutoMLExperiment` applies the latest research from Microsoft Research to conduct a swift, accurate and thorough hyper-parameter optimization given a limited time budget.

The code below shows how to use `AutoMLExperiment` for finding the best trainer along with its best hyper parrameter on the non-linear dataset used in Example 2. Firstly, a `SweepableEstimatorPipeline` is created via `context.Auto().Regression("y")`, which returns the most popular regressors along with their default search space in ML.Net. Then an `AutoMLExperiment` is created. It uses `RootMeanSquaredError` as optimization metric and train-test validation to evaluate trial score, and uses `NotebookMonitor` to present training process. Once the training is completed, it returns the best trial as result.

In [1]:
var context =new MLContext(seed: 1);
var pipeline = context.Transforms.Concatenate("Features", "X")
    .Append(context.Auto().Regression("y", useLbfgs: false, useSdca: false, useFastForest: false));

var monitor = new NotebookMonitor();
var experiment = context.Auto().CreateExperiment();
experiment.SetPipeline(pipeline)
        .SetEvaluateMetric(RegressionMetric.RootMeanSquaredError, "y")
        .SetTrainingTimeInSeconds(30)
        .SetDataset(trainTestSplit.TrainSet, trainTestSplit.TestSet)
        .SetMonitor(monitor);

// Configure Visualizer			
monitor.SetUpdate(monitor.Display());

var res = await experiment.RunAsync();

index,Trial,Metric,Trainer,Parameters
⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️,⏮⏪◀️Page1▶️⏩⏭️


## Check model/metric from experiment result

In [1]:
// get model
var model  = res.Model;
var eval = model.Transform(trainTestSplit.TestSet);
var metric = context.Regression.Evaluate(eval, "y");

// should be identical with res.Metric
metric.RootMeanSquaredError

# Continue learning

> [⏩ Next Module - Model Evaluation](./04-Model%20Evaluation.ipynb)  
> [⏪ Last Module - Data Prep and Feature Engineering](./02-Data%20Preparation%20and%20Feature%20Engineering.ipynb)  