# Regression with Taxi Dataset

This notebook demonstrates how to:

1. Define the model input and output schema
1. Load in data from a text file to an IDataView
1. Set up the training pipeline with data transforms
1. Choose an algorithm and append it to the pipeline
1. Train the model
1. Evaluate the model
1. Consume the model

## Install the necessary NuGet packages for training ML.NET model and plotting:

In [1]:

/* Notebook files contain both code snippets and rich text elements.
Use the "run" button in the left margin to execute each code snippet and explore ML.NET. */

#i "nuget:https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet5/nuget/v3/index.json" 
#i "nuget:https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet-tools/nuget/v3/index.json"
#i "nuget:https://mlnetcli.blob.core.windows.net/mlnetcli/index.json"

// add nightly build for ml.net
#i "nuget:https://pkgs.dev.azure.com/dnceng/public/_packaging/MachineLearning/nuget/v3/index.json"
#r "nuget:Microsoft.DotNet.Interactive.Formatting, 1.0.0-beta.22256.1"
#r "nuget:MLNetAutoML.InteractiveExtension,0.2.0"
#r "nuget:XPlot.Plotly.Interactive,4.0.6"
#r "nuget:Microsoft.ML.AutoML,0.20.0-preview.22226.2"
#r "nuget:Microsoft.Data.Analysis,0.20.0-preview.22226.2"

Loading extensions from `XPlot.Plotly.Interactive.dll`

Configuring PowerShell Kernel for XPlot.Plotly integration.

Installed support for XPlot.Plotly.

Loading extensions from `MLNetAutoML.InteractiveExtension.dll`

Loading extensions from `Microsoft.Data.Analysis.Interactive.dll`

In [1]:

// Import common usings.
using static Microsoft.DotNet.Interactive.Formatting.PocketViewTags;
using static Microsoft.ML.Transforms.OneHotEncodingEstimator;
using Microsoft.Data.Analysis;
using System;
using System.IO;
using Microsoft.ML;
using Microsoft.ML.AutoML;
using Microsoft.ML.Data;
using MLNetAutoML.InteractiveExtension;

In [1]:

//Load File
var trainDataPath = Path.Combine(Directory.GetCurrentDirectory(),"data", "taxi-fare.csv");
var df = DataFrame.LoadCsv(trainDataPath);
var mlContext = new MLContext();

// Append the trainer to the data processing pipeline
var pipeline = mlContext.Transforms.Categorical.OneHotEncoding(new[] { new InputOutputColumnPair(@"vendor_id", @"vendor_id"), new InputOutputColumnPair(@"payment_type", @"payment_type")},outputKind: OutputKind.Binary)
                 .Append(mlContext.Transforms.ReplaceMissingValues(new[] { new InputOutputColumnPair(@"rate_code", @"rate_code"), new InputOutputColumnPair(@"passenger_count", @"passenger_count"), new InputOutputColumnPair(@"trip_time_in_secs", @"trip_time_in_secs"), new InputOutputColumnPair(@"trip_distance", @"trip_distance") }))
                 .Append(mlContext.Transforms.Concatenate(@"Features", new[] { @"vendor_id", @"payment_type", @"rate_code", @"passenger_count", @"trip_time_in_secs", @"trip_distance" }))
                 .Append(mlContext.Auto().Regression(labelColumnName: "fare_amount"));

// Configure AutoML
var trainTestSplit = mlContext.Data.TrainTestSplit(df, 0.2);
var validateTestSplit = mlContext.Data.TrainTestSplit(trainTestSplit.TestSet, 0.5);
var monitor = new NotebookMonitor();

 var experiment = mlContext.Auto().CreateExperiment()
                    .SetPipeline(pipeline)
                    .SetTrainingTimeInSeconds(50)
                    .SetDataset(trainTestSplit.TrainSet, validateTestSplit.TrainSet)
                    .SetEvaluateMetric(RegressionMetric.RSquared, "fare_amount", "Score")
					.SetMonitor(monitor);

					// Configure Visualizer			
monitor.SetUpdate(monitor.Display());

// Start Experiment
var result = experiment.Run().Result;

index,Trial,Metric,Pipeline
0,0,0.9095436,Unknown=>FastForestRegression
1,1,0.6717748,Unknown=>SdcaRegression
2,2,0.8743625,Unknown=>LightGbmRegression
3,3,-0.20010807,Unknown=>FastTreeRegression
4,4,0.920631,Unknown=>FastForestRegression
5,5,-2.4914448,Unknown=>LbfgsPoissonRegressionRegression
6,6,0.8163028,Unknown=>SdcaRegression
7,7,0.83085066,Unknown=>LightGbmRegression
8,8,-0.19410412,Unknown=>FastTreeRegression
9,9,0.9095436,Unknown=>FastForestRegression


## Consume the model

In [1]:
// Define sample data
var data  = new DataFrame(new StringDataFrameColumn("vendor_id"), new PrimitiveDataFrameColumn<float>("rate_code"), new PrimitiveDataFrameColumn<float>("passenger_count"), new PrimitiveDataFrameColumn<float>("trip_time_in_secs"), new PrimitiveDataFrameColumn<float>("trip_distance"),new StringDataFrameColumn("payment_type"));
data.Append(new List<KeyValuePair<string,object>>()
{
	new KeyValuePair<string,object>("vendor_id",@"CMT"),
	new KeyValuePair<string,object>("rate_code",1F),
	new KeyValuePair<string,object>("passenger_count",1F),
	new KeyValuePair<string, object> ("trip_time_in_secs",474F),
	new KeyValuePair<string, object> ("trip_distance",1.5F),
	new KeyValuePair<string, object> ("payment_type",@"payment_type")
},true);
var model = result.Model;

//Use the model to transform the sample data.
var output = model.Transform(data);

// Get the predicted score with the sample data.
var predictedScore = output.GetColumn<float>("Score");
predictedScore
			
			

index,value
0,8.222628


## Evaluate model

In [1]:
var model = result.Model;
var eval= model.Transform(validateTestSplit.TestSet);
var metric=mlContext.Regression.Evaluate(eval,"fare_amount");
metric	


MeanAbsoluteError,MeanSquaredError,RootMeanSquaredError,LossFunction,RSquared
0.9331983797550202,7.705879028995748,2.775946510470933,7.705879033632088,0.9165354839474686




The code below demonstrates several methods to explain your model, including how to get and display

1. A Histogram of the distribution of number of instances
1. A Scatter Plot
1. Compare actual values to predicted values in a scatter plot
1. The importance of different features

In [1]:
using System;
using System.Collections.Generic;
using System.Collections.Immutable;
using System.IO;
using System.Linq;
using XPlot.Plotly;

## Compare Distribution of Number of Instances

In [1]:
// Extract some data into arrays for plotting

int numberOfRows = 5000;

// Columns was determined by inputted data
float[] fare_amount = df.GetColumn<float>("fare_amount").Take(numberOfRows).ToArray();

// Distribution of Number of Instances
var histogram = Chart.Plot(new Histogram(){x = fare_amount, autobinx = false, nbinsx = 20});
var layout = new Layout.Layout(){title="fare_amount vs Number of Instances"};
histogram.WithLayout(layout);
histogram.WithXTitle("fare_amount");
histogram.WithYTitle("Number of Instances");

display(histogram);

## Compare actual values to predicted values in a scatter plot

In [1]:
// Number of rows to display in charts.
int numberOfRows = 1000;
// Use the model to make batch predictions on training data
var testResults = model.Transform(df);

// Get the actual values from the dataset
var trueValues = testResults.GetColumn<float>("fare_amount").Take(numberOfRows);;

// Get the predicted values from the test results
var predictedValues = testResults.GetColumn<float>("Score").Take(numberOfRows);

// Create scatter plot of actual vs predicted values
var predictedVsTrue = new Scattergl()
{
    x = trueValues,
    y = predictedValues,
    mode = "markers",
};

var maximumValue = Math.Max(trueValues.Max(), predictedValues.Max());

var perfectLine = new Scattergl()
{
    x = new[] {0, maximumValue},
    y = new[] {0, maximumValue},
    mode = "lines",
};

var chart = Chart.Plot(new[] {predictedVsTrue, perfectLine });
chart.WithXTitle("Actual Values");
chart.WithYTitle("Predicted Values");
chart.WithLegend(false);
chart.Width = 600;
chart.Height = 600;
display(chart);

## Calculate and graph the Permutation Feature Importance (PFI)

In [1]:
// Calculate PFI
var preprocessedTrainData = model.Transform(df);

ImmutableDictionary<string, RegressionMetricsStatistics> permutationFeatureImportance =
    mlContext.Regression
	.PermutationFeatureImportance(
                model,
                preprocessedTrainData,
                labelColumnName: "fare_amount",
                useFeatureWeightFilter: false,
                numberOfExamplesToUse: null,
                permutationCount: 1);

var featureImportanceMetrics =
     permutationFeatureImportance
     .Select((kvp) => new { kvp.Key, kvp.Value.RSquared })
     .OrderByDescending(myFeatures => Math.Abs(myFeatures.RSquared.Mean));

    
var featureNames = new List<string>();
var featurePFI = new List<double>();
foreach (var feature in featureImportanceMetrics)
{
     featureNames.Add(feature.Key);
     featurePFI.Add(Math.Abs(feature.RSquared.Mean));
}
var featureImportance = new DataFrame(new StringDataFrameColumn("Feature", featureNames.ToArray() ), new DoubleDataFrameColumn("R-Squared Impact",featurePFI.ToArray()));
    
featureImportance

index,Feature,R-Squared Impact
0,trip_distance,0.6659205600529384
1,rate_code,0.3015361833993564
2,trip_time_in_secs,0.1731902453258378
3,payment_type.Bit0,0.0
4,vendor_id.Bit0,0.0
5,vendor_id.Bit1,0.0
6,payment_type.Bit1,0.0
7,vendor_id.Bit2,0.0
8,passenger_count,0.0
9,payment_type.Bit2,0.0


In [1]:
// Graph the PFI results
var pfiBar = new Bar()
{
    x = featureNames,
    y = featurePFI,
    dy = featurePFI[0]/100
};

var pfiChart = Chart.Plot(pfiBar);
pfiChart.WithXTitle("Feature");
pfiChart.WithYTitle("Contribution (delta R-Squared)");
pfiChart.Width = 600;
pfiChart.Height = 600;
display(pfiChart);

In [1]:
var topFeatureName = featureNames.First();
float[] fare_amount = df.GetColumn<float>("fare_amount").Take(numberOfRows).ToArray();
float[] topFeature = df.GetColumn<float>(topFeatureName).Take(numberOfRows).ToArray();


var chartFareVsPassengers = Chart.Plot(
    new Scatter()
    {
        x = topFeature,
        y = fare_amount,
        mode = "markers",
    }
);

var layout = new Layout.Layout(){title=$"Plot fare_amount depending on {topFeatureName}"};
chartFareVsPassengers.WithLayout(layout);
chartFareVsPassengers.Width = 500;
chartFareVsPassengers.Height = 500;
chartFareVsPassengers.WithXTitle(topFeatureName);
chartFareVsPassengers.WithYTitle("fare_amount");
chartFareVsPassengers.WithLegend(false);

display(chartFareVsPassengers);