# Install packages

Finally, let’s see how we can utilize .NET Interactive Notebooks together with ML.NET. The goal of this tutorial is to create ML.NET machine learning model that is able to classify PalmerPenguin data. This model predicts the class of the penguin based on the rest of the data. First, we need to install all necessary NuGet packages:

In [None]:
#r "nuget:Microsoft.Data.Analysis,0.18.0"
#r "nuget:Microsoft.ML,1.3.1"
#r "nuget:Microsoft.DotNet.Interactive.ExtensionLab,1.0.0-beta.21506.4"

Loading extensions from `Microsoft.Data.Analysis.Interactive.dll`

Loading extensions from `Microsoft.DotNet.Interactive.ExtensionLab.dll`

In [None]:
using System.IO;
using System.Text;

using Microsoft.Data.Analysis;
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Trainers;
using Microsoft.ML.Transforms;

# Visulising of Data

Get data using SandDance

In [None]:
var data = DataFrame.LoadCsv("data\\penguins.csv");
data.ExploreWithSandDance().Display();

# Data Model

Ok, now we can fit ML.NET code into this. In order to load data from the dataset and use it with ML.NET algorithms, we need to implement classes that are going to model this data. So, we create a cell that implements two classes: PalmerPenguinData and PricePalmerPenguinPredictions. These classes model input and output data. Output is the class of the penguin, while the rest of the data is input.

In [None]:
public class PalmerPenguinsData
{
    [LoadColumn(0)]
    public string Label { get; set; }

    [LoadColumn(1)]
    public string Island { get; set; }

    [LoadColumn(2)]
    public float CulmenLength { get; set; }

    [LoadColumn(3)]
    public float CulmenDepth { get; set; }

    [LoadColumn(4)]
    public float FliperLength { get; set; }

    [LoadColumn(5)]
    public float BodyMass { get; set; }

    [LoadColumn(6)]
    public string Sex { get; set; }
}

public class PalmerPenguinsPrediction
{
    [ColumnName("PredictedLabel")]
    public string PredictedLabel { get; set; }
}

# Load Data

Once data model classes are created, we can use them to load the data. To do so, we create a new cell:

In [None]:
var mlContext = new MLContext();

IDataView trainingDataView = mlContext.Data.
LoadFromTextFile<PalmerPenguinsData>("data\\penguins.csv",
hasHeader: true, separatorChar: ',');

DataOperationsCatalog.TrainTestData dataSplit =
mlContext.Data.TrainTestSplit(trainingDataView, testFraction: 0.3);

In this cell, we have done more than just data loading, we actually initialize a complete ML.NET functionality by creating an MlContext object. The core of ML.NET can be found within two classes MLContext and DataView. The MLContext class is a singleton class, and its object provides access to most of the ML.NET functionalities, like various machine learning algorithms which are called trainers in the context of ML.NET.

The dataSplit field contains loaded data. Data is split into train and test datasets within this structure. We can actually see how this data looks by using the following cell:

In [None]:
dataSplit.TestSet.ToTabularDataResource()

Label,Island,CulmenLength,CulmenDepth,FliperLength,BodyMass,Sex,SamplingKeyColumn
Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,0.17168188
Adelie,Torgersen,38.9,17.8,181.0,3625.0,female,0.1854974
Adelie,Torgersen,37.8,17.1,186.0,3300.0,,0.25095105
Adelie,Torgersen,34.4,18.4,184.0,3325.0,female,0.12809682
Adelie,Biscoe,37.8,18.3,174.0,3400.0,female,0.22980833
Adelie,Biscoe,37.9,18.6,172.0,3150.0,female,0.10179412
Adelie,Dream,37.2,18.1,178.0,3900.0,male,0.25451255
Adelie,Dream,39.5,17.8,188.0,3300.0,female,0.2852285
Adelie,Dream,42.2,18.5,180.0,3550.0,female,0.1627289
Adelie,Dream,36.5,18.0,182.0,3150.0,female,0.26413596


# Initialize and Train Machine Learning ML.NET Model

Now to the important and fun bit. We want to initialize and train the model. In fact, we want to create a complete training pipeline, which pre processes the data, train the model and save the model. Here is how we do it:

In [None]:
SdcaNonCalibratedMulticlassTrainer model = mlContext.MulticlassClassification.Trainers.
SdcaNonCalibrated(labelColumnName: "Label", featureColumnName: "Features");

EstimatorChain<KeyToValueMappingTransformer> pipeline = mlContext.Transforms.Conversion.
MapValueToKey(inputColumnName: nameof(PalmerPenguinsData.Label), outputColumnName: "Label")
                .Append(mlContext.Transforms.Text.FeaturizeText
                (inputColumnName: "Sex", outputColumnName: "SexFeaturized"))
                .Append(mlContext.Transforms.Text.FeaturizeText
                (inputColumnName: "Island", outputColumnName: "IslandFeaturized"))
                .Append(mlContext.Transforms.Concatenate("Features",
                                               "IslandFeaturized",
                                               nameof(PalmerPenguinsData.CulmenLength),
                                               nameof(PalmerPenguinsData.CulmenDepth),
                                               nameof(PalmerPenguinsData.BodyMass),
                                               nameof(PalmerPenguinsData.FliperLength),
                                               "SexFeaturized"
                                               ))
               .Append(mlContext.Transforms.NormalizeMinMax("Features", "Features"))
               .Append(model)
               .Append(mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel"));;

TransformerChain<KeyToValueMappingTransformer> trainedModel = pipeline.Fit(dataSplit.TrainSet);

mlContext.Model.Save(trainedModel, dataSplit.TrainSet.Schema, "penguinModel.mdl");

First, we create an object of SdcaNonCalibrated class. This object is the machine learning algorithm we use for this problem. Essentially is a variation of logistic regression that is based on the Stochastic Dual Coordinate Ascent (SDCA) method. The algorithm can be scaled because it’s a streaming training algorithm as described in a KDD best paper.

Then we create a training pipeline. This pipeline first does some pre-processing of the data and then utilizes mentioned machine learning model. Then we call the Fit method on this pipeline. With this, we initiate the training process. Finally, we save the model into the model.mdl file. 

# Evaluate the Model

To evaluate the model, we use Evaluate method with test data:

In [None]:
IDataView testSetTransform = trainedModel.Transform(dataSplit.TestSet);
MulticlassClassificationMetrics metrics = mlContext.
    MulticlassClassification.Evaluate(testSetTransform);

The output is the metrics variable, which contains some useful information about our model. For example, we can print out Macro Accuracy:

In [None]:
metrics.MacroAccuracy

# Using the Model for Prediction

Here is how we can use the model that is saved in the file, to run predictions on new samples:

In [None]:
var newSample = new PalmerPenguinsData
                    {
                        Island = "Torgersen",
                        CulmenDepth = 18.7f,
                        CulmenLength = 39.3f,
                        FliperLength = 180,
                        BodyMass = 3700,
                        Sex = "MALE"
                    };

using (var stream = new FileStream("penguinModel.mdl", FileMode.Open,
FileAccess.Read, FileShare.Read))
{
    ITransformer loadedModel = mlContext.Model.Load(stream, out _);
    PredictionEngine<PalmerPenguinsData,
    PalmerPenguinsPrediction> predictionEngine = mlContext.Model.
    CreatePredictionEngine<PalmerPenguinsData, PalmerPenguinsPrediction>(loadedModel);

    PalmerPenguinsPrediction prediction = predictionEngine.Predict(newSample);

    Console.WriteLine($"Prediction: {prediction.PredictedLabel}");
}

Prediction: Adelie
