# User Study

Now, we'd like you to use Interactive Notebooks to build, train, evaluate, and consume an ML.NET model. 

- **ML.NET** is a machine learning framework for .NET developers, so you can use C# and F# to build custom machine learning models that are trained on your own data. 
- **Interactive Notebooks** are used to create and share documents that contain live code, equations, visualizations, and text.  

Imagine you are a developer for a taxi company responsible for building an app that predicts the price of taxi fare based on factors like distance traveled, payment type, and number of passengers. 

You have already used automated tooling to train the model, and the tooling generated the Interactive Notebook that you see here. 

# Task 1 starts here!

## User Study: Task 1

Explore the Interactive Notebooks in VS Code and use the Notebook to train, evaluate, and consume an ML.NET regression model that predicts the price of taxi fare. 

### This Interactive Notebook was generated by ML.NET Tooling.

The code below demonstrates how to:

1. Define the model input and output schema
1. Load in data from a text file to an IDataView
1. Set up the training pipeline with data transforms
1. Choose an algorithm and append it to the pipeline
1. Train the model
1. Evaluate the model
1. Consume the model

## Your scenario: Regression

A regression model is used to predict the value of the label from a set of related features. The label can be any real value.

## Install the necessary NuGet packages for training ML.NET model and plotting:

In [1]:
#i "nuget:https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet5/nuget/v3/index.json" 
#i "nuget:https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet-tools/nuget/v3/index.json" 

#r "nuget:Microsoft.ML,1.5.1"
#r "nuget:Microsoft.ML.AutoML,0.17.1"
#r "nuget:Microsoft.Data.Analysis,0.4.0"
#r "nuget:XPlot.Plotly.Interactive, 4.0.1"

In [1]:
using static Microsoft.DotNet.Interactive.Formatting.PocketViewTags;
using Microsoft.DotNet.Interactive.Formatting;
using Microsoft.Data.Analysis;
using XPlot.Plotly;

In [1]:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;

## Define the model input and output schemas:

In [1]:
// Define the model input schema (which columns you will be loading in for training)
public class ModelInput
{
        [ColumnName("vendor_id"), LoadColumn(0)]
        public string Vendor_id { get; set; }


        [ColumnName("rate_code"), LoadColumn(1)]
        public float Rate_code { get; set; }


        [ColumnName("passenger_count"), LoadColumn(2)]
        public float Passenger_count { get; set; }


        [ColumnName("trip_time_in_secs"), LoadColumn(3)]
        public float Trip_time_in_secs { get; set; }


        [ColumnName("trip_distance"), LoadColumn(4)]
        public float Trip_distance { get; set; }


        [ColumnName("payment_type"), LoadColumn(5)]
        public string Payment_type { get; set; }


        [ColumnName("fare_amount"), LoadColumn(6)]
        public float Fare_amount { get; set; }


}

In [1]:
// Define the model output schema (what the model will return)
public class ModelOutput
{
    public float Score { get; set; }
}

## Create MLContext and load training data:

In [1]:
// Define path to training data
string taxiDataPath = "taxi-fare-train.csv";

In [1]:
// Create a new MLContext (the starting point for all ML.NET operations)
var mlContext = new MLContext();

// Load data from a text file to an IDataView (a flexible, efficient way of describing tabular data)
IDataView trainingDataView = mlContext.Data.LoadFromTextFile<ModelInput>(
    path: taxiDataPath,
    hasHeader: true,
    separatorChar: ',',
    allowQuoting: true,
    allowSparse: false);

// Display training data schema
display(trainingDataView.Schema);

index,Name,Index,IsHidden,Type,Annotations
0,vendor_id,0,False,String,
1,rate_code,1,False,Single,
2,passenger_count,2,False,Single,
3,trip_time_in_secs,3,False,Single,
4,trip_distance,4,False,Single,
5,payment_type,5,False,String,
6,fare_amount,6,False,Single,


In [1]:
// Show 5 rows of loaded data
public static List<ModelInput> Head(MLContext mlContext, IDataView dataView, int numberOfRows = 4)
{
    var rows = mlContext.Data.CreateEnumerable<ModelInput>(dataView, reuseRowObject: false)
                    .Take(numberOfRows)
                    .ToList();
    
    return rows;
}

display(h4("Showing 5 rows from training DataView:"));

var fewRows = Head(mlContext, trainingDataView, 5);
display(fewRows);

index,Vendor_id,Rate_code,Passenger_count,Trip_time_in_secs,Trip_distance,Payment_type,Fare_amount
0,CMT,1,1,1271,3.8,CRD,17.5
1,CMT,1,1,474,1.5,CRD,8.0
2,CMT,1,1,637,1.4,CRD,8.5
3,CMT,1,1,181,0.6,CSH,4.5
4,CMT,1,1,661,1.1,CRD,8.5


## Create the training pipeline, choose an algorithm, and train the model:

In [1]:
// Create the data processing pipeline (data transforms which will convert data into necessary format for training)
var dataProcessPipeline = mlContext.Transforms.Categorical.OneHotEncoding(new[] { new InputOutputColumnPair("vendor_id", "vendor_id"), new InputOutputColumnPair("payment_type", "payment_type") })
    .Append(mlContext.Transforms.Concatenate("Features", new[] { "vendor_id", "payment_type", "rate_code", "passenger_count", "trip_time_in_secs", "trip_distance" }));

// Set the training algorithm (trainer) 
var trainer = mlContext.Regression.Trainers.LightGbm(labelColumnName: "fare_amount", featureColumnName: "Features");

// Append the trainer to the data processing pipeline
var trainingPipeline = dataProcessPipeline.Append(trainer);

// Train the model (fit the model to the training data)
ITransformer model = trainingPipeline.Fit(trainingDataView);

## Evaluate the model:

In [1]:
// Evaluate the model using the cross validation method
// Learn more about cross validation at https://aka.ms/mlnet-cross-validation
var crossValidationResults = mlContext.Regression.CrossValidate(trainingDataView, trainingPipeline, numberOfFolds: 5, labelColumnName: "fare_amount");

// Define which model evaluation metrics you'd like to see
var L1 = crossValidationResults.Select(r => r.Metrics.MeanAbsoluteError);
var L2 = crossValidationResults.Select(r => r.Metrics.MeanSquaredError);
var RMS = crossValidationResults.Select(r => r.Metrics.RootMeanSquaredError);
var lossFunction = crossValidationResults.Select(r => r.Metrics.LossFunction);
var R2 = crossValidationResults.Select(r => r.Metrics.RSquared);

// Print out the evaluation metrics
var metricNames = new StringDataFrameColumn("Metric Name", new[] {"Average L1 Loss", "Average L2 Loss", "Average RMS", "Average Loss Function", "Average R-Squared"});
var metricValues = new StringDataFrameColumn("Value",new[] {$"{L1.Average():0.###}", $"{L2.Average():0.###}", $"{RMS.Average():0.###}", $"{lossFunction.Average():0.###}", $"{R2.Average():0.###}"});
var stats = new DataFrame(metricNames, metricValues);

stats

index,Metric Name,Value
0,Average L1 Loss,0.419
1,Average L2 Loss,4.893
2,Average RMS,2.21
3,Average Loss Function,4.893
4,Average R-Squared,0.947


## Consume the model

In [1]:
// Define sample model input
var sampleInput = new ModelInput(){
    Vendor_id = "VTS",
    Rate_code = 1f,
    Passenger_count = 1,
    Trip_distance = 3.75f,
    Payment_type = "1",
    Fare_amount = 0 // actual fare for this trip = 15.5
};

// Create a Prediction Engine (used to make single predictions)
var predEngine = mlContext.Model.CreatePredictionEngine<ModelInput, ModelOutput>(model);

// Use the model and Prediction Engine to predict on new sample data
var prediction = predEngine.Predict(sampleInput);

// Print out the actual value and the predicted value
Console.WriteLine($"Single prediction:");
Console.WriteLine($"  Predicted fare: {prediction.Score:0.####}");
Console.WriteLine($"  Actual fare: 15.5");

Single prediction:


  Predicted fare: 13.9358


  Actual fare: 15.5


# Task 1 ends here! Please do not continue to Task 2 until told to do so.

## User Study: Task 2

 Use the given Notebook to explore the taxi fare data and the trained model.


### This Interactive Notebook was generated by ML.NET Tooling.

The code below demonstrates several methods to explain your model, including how to:

1. Get and display a histogram of the distribution of taxi trips per cost.
1. Get and display the importance of different features
1. Get and display the Feature Calculation Contribution

In [1]:
using System;
using System.Collections.Generic;
using System.Collections.Immutable;
using System.IO;
using System.Linq;
using XPlot.Plotly;

In [1]:
// Extract some data into arrays for plotting:
int numberOfRows = 5000;
float[] fares = trainingDataView.GetColumn<float>("fare_amount").Take(numberOfRows).ToArray();
float[] distances = trainingDataView.GetColumn<float>("trip_distance").Take(numberOfRows).ToArray();
float[] passengerCounts = trainingDataView.GetColumn<float>("passenger_count").Take(numberOfRows).ToArray();

// Distribution of taxi trips per cost

var faresHistogram = Chart.Plot(new Histogram(){x = fares, autobinx = false, nbinsx = 20});
var layout = new Layout.Layout(){title="Distribution of taxi trips per cost"};
faresHistogram.WithLayout(layout);
faresHistogram.WithXTitle("Fare ranges");
faresHistogram.WithYTitle("Number of trips");
display(faresHistogram);

### Permutation Feature Importance (PFI)

PFI gives a way to measure the effect of each feature on the model output (or the individual importance of each feature).

In other words, it will tell you if you get rid of feature X, how much the performance of the model will decrease (or increase).

Learn more at https://aka.ms/mlnet-pfi.

In [1]:
// Calculate PFI
var predictor = (ISingleFeaturePredictionTransformer<object>) ((IEnumerable<ITransformer>)model).Last();
var preprocessedTrainData = model.Transform(trainingDataView);


    VBuffer<ReadOnlyMemory<char>> nameBuffer = default;
    preprocessedTrainData.Schema["Features"].Annotations.GetValue("SlotNames", ref nameBuffer); // NOTE: The column name "Features" needs to match the featureColumnName used in the trainer, the name "SlotNames" is always the same regardless of trainer.
    var featureColumnNames = nameBuffer.DenseValues().ToList();

    ImmutableArray<RegressionMetricsStatistics> permutationFeatureImportance =
        mlContext.Regression
        .PermutationFeatureImportance(predictor, preprocessedTrainData, permutationCount: 1, labelColumnName: "fare_amount");

    var featureImportanceMetrics =
        permutationFeatureImportance
        .Select((metric, index) => new { index, metric.RSquared })
        .OrderByDescending(myFeatures => Math.Abs(myFeatures.RSquared.Mean));

    
    var featureNames = new List<string>();
    var featurePFI = new List<double>();
    foreach (var feature in featureImportanceMetrics)
    {
        featureNames.Add($"{featureColumnNames[feature.index],-20}");
        featurePFI.Add(feature.RSquared.Mean);
    }
    var featureImportance = new DataFrame(new StringDataFrameColumn("Feature", featureNames.ToArray() ), new DoubleDataFrameColumn("R-Squared Impact",featurePFI.ToArray()));
    
    featureImportance

index,Feature,R-Squared Impact
0,rate_code,-0.4211640524380117
1,trip_distance,-0.4192952207213478
2,trip_time_in_secs,-0.3527750721171867
3,payment_type.CRD,-0.0143097325293196
4,payment_type.CSH,-0.0100496067726491
5,vendor_id.VTS,-0.0031778669780933
6,passenger_count,-0.0011441970407396
7,payment_type.UNK,-9.156808224808356e-05
8,payment_type.NOC,-6.283711706067674e-05
9,payment_type.DIS,-1.1652870217204736e-05


In [1]:
// Graph the PFI results
var pfiBar = new Bar()
{
    x = featureNames,
    y = featurePFI,
    dy = featurePFI[0]/100
};

var pfiChart = Chart.Plot(pfiBar);
pfiChart.WithXTitle("Feature");
pfiChart.WithYTitle("Contribution (delta R-Squared)");
pfiChart.Width = 600;
pfiChart.Height = 600;
display(pfiChart);

In [1]:
// Compare the true value to the predicted value
var testResults = model.Transform(trainingDataView);

var trueValues = testResults.GetColumn<float>("fare_amount");
var predictedValues = testResults.GetColumn<float>("Score");

var predictedVsTrue = new Scattergl()
{
    x = trueValues,
    y = predictedValues,
    mode = "markers",
};

var maximumValue = Math.Max(trueValues.Max(), predictedValues.Max());

var perfectLine = new Scattergl()
{
    x = new[] {0, maximumValue},
    y = new[] {0, maximumValue},
    mode = "lines",
};

var chart = Chart.Plot(new[] {predictedVsTrue, perfectLine });
chart.WithXTitle("True Values");
chart.WithYTitle("Predicted Values");
chart.WithLegend(false);
chart.Width = 600;
chart.Height = 600;
display(chart);

# Task 2 ends here! Please do not continue to Task 3 until told to do so.

## User Study: Task 3

Use the given Interactive Notebook to explore and test a Web API that consumes the trained ML.NET taxi fare model. 

### This Interactive Notebook was generated by ML.NET Tooling.

The code below demonstrates how to:

1. Consume an ML.NET model in a Web API
1. Test a Web API in an Interactive Notebook

In [1]:
#r "nuget:microsoft.dotnet.interactive.aspnetcore,*-*"

Installing package microsoft.dotnet.interactive.aspnetcore, version *-*...

Error: [object Object]

In [1]:
#!aspnet
Endpoints.MapInteractive("/Endpoint", async context =>
{
    await context.Response.WriteAsJsonAsync(new
    {
        Hello = ".NET Interactive!" 
    });
});
await HttpClient.GetAsync("/Endpoint")