## ==**This notebook is under active development**==

# Model Evaluation

In this notebook you will learn:

- What is model evaluation?
- How to evaluate a model in ML.NET
- Train and evaluate models with cross-validation
- Model explainability
- How to improve your model

## What is model evaluation?

Training is the process of applying algorithms to historical data in order to create a model that accurately represents that data. That model is then used on new data to make predictions. 

Model evaluation is the process of using metrics to quantify how effectively your model learned patterns within your data and applies those learnings to new and unseen data. 

## How to evaluate a model in ML.NET

### Reference ML.NET Daily Build Feed

In [1]:
#i "nuget:https://pkgs.dev.azure.com/dnceng/public/_packaging/MachineLearning/nuget/v3/index.json"

### Install dependencies

In [1]:
#r "nuget:Microsoft.Data.Analysis,0.20.0-preview.22226.2"
#r "nuget:Microsoft.ML.AutoML,0.20.0-preview.22226.2"

Loading extensions from `Microsoft.Data.Analysis.Interactive.dll`

Reference packages with `using` statements

In [1]:
using System.Text.Json;
using Microsoft.Data.Analysis;
using Microsoft.ML;
using Microsoft.ML.AutoML;
using Microsoft.ML.Data;
using Microsoft.ML.Trainers.FastTree;
using static Microsoft.ML.Transforms.OneHotEncodingEstimator;

### Load your data

Use the `#!value` and `#!share` magic commands to fetch the data from GitHub, store it in the `taxi_data` variable and load it into a `DataFrame` 

In [1]:
#!value --name taxi_data --from-url https://github.com/dotnet/csharp-notebooks/raw/main/machine-learning/data/taxi-fare.csv

In [1]:
#!share taxi_data --from value

In [1]:
var df = DataFrame.LoadCsvFromString(taxi_data);

Once the data is loaded, use the `Head` method to preview the first five rows.

In [1]:
df.Head(5)

index,vendor_id,rate_code,passenger_count,trip_time_in_secs,trip_distance,payment_type,fare_amount
0,CMT,1,1,1271,3.8,CRD,17.5
1,CMT,1,1,474,1.5,CRD,8.0
2,CMT,1,1,637,1.4,CRD,8.5
3,CMT,1,1,181,0.6,CSH,4.5
4,CMT,1,1,661,1.1,CRD,8.5


### Initialize MLContext

All ML.NET operations start in the [MLContext](https://docs.microsoft.com/dotnet/api/microsoft.ml.mlcontext) class. Initializing mlContext creates a new ML.NET environment that can be shared across the model creation workflow objects. It's similar, conceptually, to DBContext in Entity Framework.

In [1]:
var mlContext = new MLContext();

### Split the data into train, validation and test sets

The original dataset is split into three subsets: train, validation, and test. The **train** set is what you'll use to learn the patterns of your data. The **validation** set is used to optimize the model hyperparameters. The **test** set is used to evaluate the performance of your model using evaluation metrics for the regression task. 

In this case, 80% of the data is used for training as defined by the `testFraction` parameter. The remaining 20% is split in half and used as the validation and test sets. 

#### Why not use the entire dataset?

Although in general providing your trainer with more examples it can learn from is recommended, you don't want a model that only performs well on historical data. Instead, you're looking for a model that can learn from that historical data and generalize or make accurate predictions on new and unseen data. 

Some common problems you encounter during training are overfitting and underfitting. Underfitting means the selected trainer is not capable enough to fit training dataset and usually result in a high loss during training and low score/metric on test dataset. To resolve this you need to either select a more powerful model or perform more feature engineering. Overfitting is the opposite, which happens when model learns the training data too well. This usually results in low loss metric during training but high loss on test dataset.

A good analogy for these concepts is studying for an exam. Let's say you knew the questions and answers ahead of time. After studying, you take the test and get a perfect score. Great news! However, when you're given the exam again with the questions rearranged and with slightly different wording you get a lower score. That suggests you memorized the answers and didn't actually learn the concepts you were being tested on. This is an example of overfitting. Underfitting is the opposite where the study materials you were given don't accurately represent what you're evaluated on for the exam. As a result, you resort to guessing the answers since you don't have enough knowledge to answer correctly.


In [1]:
var trainTestData = mlContext.Data.TrainTestSplit(df,testFraction:0.2);
var validationTestData = mlContext.Data.TrainTestSplit(trainTestData.TestSet,testFraction:0.5);

In [1]:
var trainSet = trainTestData.TrainSet;
var validationSet = validationTestData.TrainSet;
var testSet = validationTestData.TestSet;

### Create training pipeline

For this dataset, the following transforms are applied:

- `OneHotEncoding` to convert categorical values into numerical values
- `ReplaceMissingValues` which as the name suggests is to replace any missing values.
- `Concatenate` takes all of the features and creates a feature vector

AutoML is used to define a regression experiment using the `fare_amount` column as the column to predict or label column. 

In [1]:
var pipeline = 
	mlContext.Transforms.Categorical.OneHotEncoding(new[] { new InputOutputColumnPair(@"vendor_id", @"vendor_id"), new InputOutputColumnPair(@"payment_type", @"payment_type")},outputKind: OutputKind.Binary)
		.Append(mlContext.Transforms.ReplaceMissingValues(new[] { new InputOutputColumnPair(@"rate_code", @"rate_code"), new InputOutputColumnPair(@"passenger_count", @"passenger_count"), new InputOutputColumnPair(@"trip_time_in_secs", @"trip_time_in_secs"), new InputOutputColumnPair(@"trip_distance", @"trip_distance") }))
        .Append(mlContext.Transforms.Concatenate(@"Features", new[] { @"vendor_id", @"payment_type", @"rate_code", @"passenger_count", @"trip_time_in_secs", @"trip_distance" }))
        .Append(mlContext.Auto().Regression(labelColumnName: "fare_amount"));

### Configure experiment

Use AutoML to configure our experiment to train for 60 seconds using the pipeline you've just defined. 

By default, AutoML evaluates the models it trains using the evaluation metric you want to optimize. In this case it's R-Squared which is calculated by comparing the actual value `fare_amount` against the predicted value `Score`. 

Evaluation metrics are highly dependent on the task. For regression, some common metrics include:

- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- R-Squared

Your dataset and what you're trying to achieve highly impact your metric selection. If you have outiers in your dataset, it may skew your predictions. MAE, MSE, and RMSE calculate the distance between the predicted and actual data points. All of these measures are sensitive to outliers so if you have outliers in your dataset, they will show up in your metrics. R-Squared calculates the correlation between the actual and predicted values. However, as you add more data points, your R-Squared may continue to increase giving the false impression that a model with high a high R-Squared value has good predictive capabilities. The result of a high R-Squared value can someitmes indicate overfitting.

In [1]:
var experiment = 
	mlContext.Auto().CreateExperiment()
		.SetPipeline(pipeline)
        .SetTrainingTimeInSeconds(60)
        .SetDataset(trainSet, validationSet)
        .SetEvaluateMetric(RegressionMetric.RSquared, "fare_amount", "Score");

### Run experiment

In [1]:
var result = await experiment.Run();

### View evaluation metrics of best model

In [1]:
$"R-Squared: {result.Metric}"

R-Squared: 0.9184243629191207

Note that during training, the evaluation metrics were calculated using the validation set. To see how well your model performs on new data, evaluate its performance against the test set. 

Start by getting the best model using the `Model` property from the training results. Then, use the `Transform` method to use the model to make predictions against the test dataset.

In [1]:
ITransformer bestModel = result.Model;
var predictions = bestModel.Transform(testSet);

Inspect the first few predictions (`Score` column) and compare them against the actual value (`fare_amount` column). Then, calculate the difference between them.

In [1]:
var actual = predictions.GetColumn<float>("fare_amount");
var predicted = predictions.GetColumn<float>("Score");

var compare = 
	actual
		.Zip(predicted,(actual,pred) => new {Actual=actual, Predicted=pred, Difference=actual-pred})
		.Take(5);

compare

index,Actual,Predicted,Difference
0,24.5,25.648157,-1.1481571
1,9.5,8.833447,0.66655254
2,4.5,5.2989626,-0.7989626
3,8.0,8.482488,-0.48248768
4,52.0,51.871746,0.12825394


Just by quickly comparing the first few values you can see the predictions are generally just a few cents off from the actual amount. 

With ML.NET, you don't have to manually calculate the evaluation metrics for your models. ML.NET provides a built-in `Evaluate` method for each of the machine learning tasks it supports.  Use the `Evaluate` method for the regression task to calculate the evaluation metrics for the test set where the `fare_amount` column is the actual value and the `Score` column is te predicted value.

In [1]:
var evaluationMetrics = mlContext.Regression.Evaluate(predictions,"fare_amount", "Score");

Using the `Evaluate` method not only calculates the metric you optimized for during training (R-Squared), but also all the metrics for the regression task. 

In [1]:
evaluationMetrics

MeanAbsoluteError,MeanSquaredError,RootMeanSquaredError,LossFunction,RSquared
0.8647471216996511,7.898706700080591,2.8104637873633225,7.89870670766642,0.914446913884524


### Using R-Squared to evaluate the model

Although you have multiple metrics to choose from, when you trained the model you optimized for R-Squared. R-squared (R2), or Coefficient of determination represents the predictive power of the model as a value between -inf and 1.00. 1.00 means there is a perfect fit, and the fit can be arbitrarily poor so the scores can be negative. A score of 0.00 means the model is guessing the expected value for the label. A negative R2 value indicates the fit does not follow the trend of the data and the model performs worse than random guessing. This is only possible with non-linear regression models or constrained linear regression. R2 measures how close the actual test data values are to the predicted values.


For more information on other evaluation metrics, see the guide on how to [evaluate your ML.NET model with metrics](https://docs.microsoft.com/dotnet/machine-learning/resources/metrics).

## Train and evaluate models using cross-validation

### What is cross-validation?

Cross-validation is a training and model evaluation technique that splits the data into several partitions and trains multiple models on these partitions. This technique improves the robustness of the model by holding out data from the training process. In addition to improving performance on unseen observations, in data-constrained environments it can be an effective tool for training models with a smaller dataset.

### Train a model using cross-validation

Start off by initializing the `MLContext`. 

In [1]:
var cvMLContext = new MLContext();

Then, define your pipeline. In this case, the actual trainer is used instead of the `SweepableEstimator` from AutoML.

In [1]:
var cvMLPipeline = 
	cvMLContext.Transforms.Categorical.OneHotEncoding(new[] { new InputOutputColumnPair(@"vendor_id", @"vendor_id"), new InputOutputColumnPair(@"payment_type", @"payment_type")},outputKind: OutputKind.Binary)
		.Append(cvMLContext.Transforms.ReplaceMissingValues(new[] { new InputOutputColumnPair(@"rate_code", @"rate_code"), new InputOutputColumnPair(@"passenger_count", @"passenger_count"), new InputOutputColumnPair(@"trip_time_in_secs", @"trip_time_in_secs"), new InputOutputColumnPair(@"trip_distance", @"trip_distance") }))
        .Append(cvMLContext.Transforms.Concatenate(@"Features", new[] { @"vendor_id", @"payment_type", @"rate_code", @"passenger_count", @"trip_time_in_secs", @"trip_distance" }))
		.Append(cvMLContext.Regression.Trainers.FastForest(labelColumnName: "fare_amount"));

Use the `CrossValidate` method to start the training and evaluation on your data using the defined pipeline. By default, data is split into five subsets but you can set this to any value you prefer using the `numberOfFolds` parameter.

In [1]:
var cvResults = cvMLContext.Regression.CrossValidate(trainSet, cvMLPipeline, labelColumnName: "fare_amount");

In [1]:
cvResults.Select(x => x.Metrics)

index,MeanAbsoluteError,MeanSquaredError,RootMeanSquaredError,LossFunction,RSquared
0,1.3080861318665928,10.43706410043382,3.230644533283385,10.437063966978595,0.8884018879298109
1,1.360165666717988,10.798742155503342,3.2861439645127146,10.798742212339462,0.8850517190709868
2,1.2518280553341603,8.216770508755435,2.866490974825393,8.216770540499184,0.9086191464533752
3,1.3420262660438258,10.22987080266224,3.1984169213319014,10.229870780300226,0.8886983637741152
4,1.5430782453416307,10.122911443078698,3.181652313355232,10.122911478082717,0.88835065401792


### Calculate test set evaluation metrics

Like in the previous example, use the `Evaluate` method on the full test set to evaluate the performance of the models trained using cross-validation. 

In [1]:
var cvTestEvalMetrics = 
	cvResults
		.Select(fold => fold.Model.Transform(trainTestData.TestSet))
		.Select(predictions => cvMLContext.Regression.Evaluate(predictions, "fare_amount", "Score"));

In [1]:
cvTestEvalMetrics

index,MeanAbsoluteError,MeanSquaredError,RootMeanSquaredError,LossFunction,RSquared
0,1.2976353782858168,9.945038672442944,3.1535755377734245,9.945038530343677,0.8922825239280542
1,1.3543408211140406,10.905401842713026,3.3023327880019946,10.905401854019075,0.8818805636922842
2,1.262055192808878,10.292094677957635,3.2081294671439986,10.29209476837981,0.8885234639383487
3,1.3354133990310488,10.336125614281478,3.2149845433969784,10.336125608934791,0.8880465526375378
4,1.5395103668916794,11.414689409280092,3.378563216706192,11.414689431665815,0.8763643286053923


## Model explainability

Evaluation metrics are a good way to quantify how well your model makes predictions on new data. However, a good evaluation metric shouldn't be the only factor you consider when evaluating your model. To build more trust around your model and the decisions it makes, it's important to understand how and why it makes the decisions that it does. 

Models are becoming more commonplace in society and are affecting the lives of individuals. For example, let's imagine that a machine learning model was used to make a medical diagnosis. It's likely the model is right in its diagnosis but the stakes of making the wrong diagnosis are high due to the effects on an individual's health. Therefore, it's important for all stakeholders (patients, physicians, regulators) to understand what drove a model to make that diagnosis in order to feel confident in its decision. 

### Global and local explanations

When explaining machine learning models, you can do so at the global and local level. 

Global explanations are when you make generalizations at the aggregate level. For example, let's say you're building a model to predict taxi fares. When explaining why some fares are more expensive than others, you'll most likely find that rides that rides that travel a longer distance or for a long periodof time are likely to be more expensive. Although this doesn't tell you exactly why any one particular ride was more expensive than another, at an aggregate level you can see which features are important to the model when it makes its decisions. 

If you need more granularity when explaining your model's decisions, that's where local explanations come in. Local explanations allow you to see for any single prediction, which features contributed to the model's decision. For example, let's say a model is used to determine credit risk for personal loans. Given two customers with different amounts of debt, income, and payment history, the model determines which customer is more likely to pay the loan back. Using local explainability techniques, you are able to inspect at the individual level which features contributed to the decision of denying a loan. 

### Explanability techniques in ML.NET

ML.NET provides two techniques for explaining models:

- Permutation Feature Importance (PFI)
- Feature Contribution Calculation (FCC)

#### Permutation Feature Importance (PFI)

Permutation feature importance is a **global** explainability technique. At a high-level it randomly shuffles data one feature at a time for the entire dataset and calculates how much the performance metric of interest decreases. The larger the change, the more important that feature is. For more information, see [Interpret model predictions using Permutation Feature Importance](https://docs.microsoft.com/dotnet/machine-learning/how-to-guides/explain-machine-learning-model-permutation-feature-importance-ml-net)

#### Feature Contrubition Calculation (FCC)

Feature contribution calculation is a **local** explainability technique.  This technique computes a model-specific list of per-feature contributions to each of the predictions. These contributions can be positive (they make the evaluation metric higher) or negative (they make the evaluation metric lower).

### Explain models with Permutation Feature Importance (PFI)

#### Initialize MLContext

In [1]:
var pfiMLContext = new MLContext();

#### Define data preparation pipeline

In [1]:
var pfiDataPipeline = 
	pfiMLContext.Transforms.Categorical.OneHotEncoding(new[] { new InputOutputColumnPair(@"vendor_id", @"vendor_id"), new InputOutputColumnPair(@"payment_type", @"payment_type")},outputKind: OutputKind.Binary)
		.Append(pfiMLContext.Transforms.ReplaceMissingValues(new[] { new InputOutputColumnPair(@"rate_code", @"rate_code"), new InputOutputColumnPair(@"passenger_count", @"passenger_count"), new InputOutputColumnPair(@"trip_time_in_secs", @"trip_time_in_secs"), new InputOutputColumnPair(@"trip_distance", @"trip_distance") }))
        .Append(pfiMLContext.Transforms.Concatenate(@"Features", new[] { @"vendor_id", @"payment_type", @"rate_code", @"passenger_count", @"trip_time_in_secs", @"trip_distance" }));

#### Apply data transformations to the training data

In [1]:
var pfiPreprocessedData = 
	pfiDataPipeline
		.Fit(trainSet)
		.Transform(trainSet);

#### Define your trainer

In [1]:
var pfiTrainer = pfiMLContext.Regression.Trainers.FastForest(labelColumnName: "fare_amount");

#### Fit the trainer to your preprocessed data

In [1]:
var pfiModel = pfiTrainer.Fit(pfiPreprocessedData);

#### Calculate permutation feature importance (PFI)

In [1]:
var permutationFeatureImportance =
    mlContext
        .Regression
        .PermutationFeatureImportance(pfiModel, pfiPreprocessedData, permutationCount:3, labelColumnName: "fare_amount");

#### Extract R-Squared metric

In [1]:
var pfiMetrics = 
	permutationFeatureImportance
		.Select((metric,idx) => new {idx, metric.RSquared})
		.OrderByDescending(x => Math.Abs(x.RSquared.Mean));

#### Get list of feature names

In [1]:
var featureContributionColumn = pfiPreprocessedData.Schema.GetColumnOrNull("Features");
var slotNames = new VBuffer<ReadOnlyMemory<char>>();
featureContributionColumn.Value.GetSlotNames(ref slotNames);
var slotNameValues = slotNames.DenseValues();

#### Map PFI metrics to feature names

In [1]:
var featureImportance = 
	pfiMetrics
		.Zip(slotNameValues, (a,b) => new KeyValuePair<string,double>(b.ToString(),a.RSquared.Mean));

featureImportance

index,Key,Value
0,vendor_id.Bit2,-0.5099934971385574
1,vendor_id.Bit1,-0.2095275246949585
2,vendor_id.Bit0,-0.2048348541128689
3,payment_type.Bit3,-0.0013997624322795
4,payment_type.Bit2,-0.000544155113559
5,payment_type.Bit1,-0.000150191181407
6,payment_type.Bit0,-6.645860994787996e-05
7,rate_code,-5.199206439450895e-07
8,passenger_count,0.0
9,trip_time_in_secs,0.0


### Explain models with Feature Contribution Calculation (FCC)

#### Initialize MLContext

In [1]:
var fccMLContext = new MLContext();

#### Define data preparation pipeline

In [1]:
var fccDataPipeline = 
	fccMLContext.Transforms.Categorical.OneHotEncoding(new[] { new InputOutputColumnPair(@"vendor_id", @"vendor_id"), new InputOutputColumnPair(@"payment_type", @"payment_type")},outputKind: OutputKind.Binary)
		.Append(fccMLContext.Transforms.ReplaceMissingValues(new[] { new InputOutputColumnPair(@"rate_code", @"rate_code"), new InputOutputColumnPair(@"passenger_count", @"passenger_count"), new InputOutputColumnPair(@"trip_time_in_secs", @"trip_time_in_secs"), new InputOutputColumnPair(@"trip_distance", @"trip_distance") }))
        .Append(fccMLContext.Transforms.Concatenate(@"Features", new[] { @"vendor_id", @"payment_type", @"rate_code", @"passenger_count", @"trip_time_in_secs", @"trip_distance" }));

#### Apply data transformations to the training data

In [1]:
var fccPreprocessedData = 
	fccDataPipeline
		.Fit(trainSet)
		.Transform(trainSet);

#### Define your trainer

In [1]:
var fccTrainer = fccMLContext.Regression.Trainers.FastForest(labelColumnName: "fare_amount");

#### Fit the trainer to your preprocessed data

In [1]:
var fccModel = fccTrainer.Fit(fccPreprocessedData);

#### Calculate feature contributions

In [1]:
var featureContributionCalc = 
	fccMLContext.Transforms.CalculateFeatureContribution(fccModel,normalize:false)
		.Fit(fccPreprocessedData)
		.Transform(fccPreprocessedData);

#### Get list of feature names

In [1]:
var featureContributionColumn = featureContributionCalc.Schema.GetColumnOrNull("FeatureContributions");
var slotNames = new VBuffer<ReadOnlyMemory<char>>();
featureContributionColumn.Value.GetSlotNames(ref slotNames);
var slotNameValues = slotNames.DenseValues();

#### Get feature contribution values

In [1]:
var featureContributionValues = featureContributionCalc.GetColumn<float[]>("FeatureContributions");

#### Map feature contribution values with feature names

In [1]:
var featureContributions = 
	featureContributionValues
		.Select(x => x.Zip(slotNameValues, (a,b) => new KeyValuePair<string,float>(b.ToString(),a)));

#### Display feature contributions for the first prediction

In [1]:
featureContributions.First()

index,Key,Value
0,vendor_id.Bit2,0.0
1,vendor_id.Bit1,0.0
2,vendor_id.Bit0,6.46816
3,payment_type.Bit3,0.0
4,payment_type.Bit2,-3.4585545
5,payment_type.Bit1,19.19871
6,payment_type.Bit0,10.600954
7,rate_code,-1779.6366
8,passenger_count,-1.6658905
9,trip_time_in_secs,75.95777


## How can I improve my model?

Model evaluation is an important step in the machine learning workflow to determine whether a model is ready to be deployed into production. If your model doesn't meet your criteria for being production-ready, there's a few things you can try to improve your model. These include:

- **Reframe the problem** - Are you trying to solve the right problem? Consider looking at the problem from different points of view. 
- **Provide more data samples** - Experience is the best teacher. Providing more examples that represent your problem space help the trainers identify more edge cases. 
- **Add features (more context)** -  Building context around the data points helps algorithms as well as subject matter experts better make decisions. For example, the fact that a house has three bedrooms does not on its own give a good indication of its price. However, if you add context and now know that it is in a suburban neighborhood outside of a major metropolitan area where average age is 38, average household income is $80,000 and schools are in the top 20th percentile then the algorithm has more information to base its decisions on.
- **Use meaninful data and features** - More data and features (context) can help improve accuracy but also introduce noise. Consider using permutation feature importance (PFI) and feature contribution calculation(FCC) to determine the features impacting your predictions and remove any features that don't contribute to your model.
- **Use cross-validation** - Cross-validatoion can be an effective tool for training models with smaller datasets. 
- **Hyperparameter tuning** - Hyperparameters are just as important as the parameters learned during training. Finding the right ones is a process of trial and error. Use AutoML search spaces and sweeping estimators to help you find the right hyperparameters for your algorithm.
- **Choose a different algorithm** - Just like hyperparamter tuning, finding the right algorithm to train your model is a process of trial and error. Use AutoML to help you iterate through the various algorithms available in ML.NET to help you choose the optimal algorithm to solve your problem. 

For more information, see the guide on [how to improve your ML.NET model](https://docs.microsoft.com/dotnet/machine-learning/resources/improve-machine-learning-model-ml-net).

## Continue learning

> [⏪ Last Module - Training and AutoML](https://ntbk.io/ml-03-training)

### More End to End Examples

- [Binary Classification with Titanic Dataset](ntbk.io/ml-ref-kaggle-titanic)  
- [Value Prediction/Regression with Taxi Dataset](https://ntbk.io/ml-e2e-taxi)  