In [1]:
//ML.NET
#r "nuget:Microsoft.ML"
#r "nuget:Microsoft.ML.LightGbm"
#r "nuget:Microsoft.ML.DataView"
    

//Install Daany.DataFrame 
//Nuget package installation
#r "nuget:Daany.DataFrame,1.1.0"
#r "nuget:Daany.DataFrame.Ext,1.1.0"
#r "nuget: Daany.Stat,1.1.0"
#r "../bin/Daany.Util.dll"
    
//Plot capabilities
#r "nuget: Microsoft.DotNet.Interactive.Formatting, 1.0.0-beta.21506.4"

//Plot capabilities
#r "nuget: XPlot.Plotly.Interactive,4.0.2"

using System;
using System.Linq;

//Daany data frame
using Daany;
using Daany.Ext;
using Daany.Util;

//Plotting functionalities
using XPlot.Plotly;

//ML.NET using
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Trainers.LightGbm;

//custom display implementation
using static Microsoft.DotNet.Interactive.Formatting.PocketViewTags;
using Microsoft.AspNetCore.Html;
using Microsoft.DotNet.Interactive.Formatting;
using static System.Diagnostics.Debug;
using System.Globalization;

Installed package Microsoft.ML.DataView version 1.6.0

Installed package Microsoft.ML.LightGbm version 1.6.0

Installed package XPlot.Plotly.Interactive version 4.0.2

Installed package Daany.Stat version 1.1.0

Installed package Daany.DataFrame version 1.1.0

Installed package Microsoft.ML version 1.6.0

Installed package Daany.DataFrame.Ext version 1.1.0

Installed package Microsoft.DotNet.Interactive.Formatting version 1.0.0-beta.21506.4

Loading extensions from `XPlot.Plotly.Interactive.dll`

Configuring PowerShell Kernel for XPlot.Plotly integration.

Installed support for XPlot.Plotly.

In [2]:
Formatter.Register<DataFrame>((df, writer) =>
{
    var headers = new List<IHtmlContent>();

    headers.Add(th(i($"({df.Index.Name})")));
    headers.AddRange(df.Columns.Select(c => (IHtmlContent) th(c)));
    
    //renders the rows
    var rows = new List<List<IHtmlContent>>();
    var take = 20;
    
    //
    for (var i = 0; i < Math.Min(take, df.RowCount()); i++)
    {
        var cells = new List<IHtmlContent>();
        cells.Add(td(df.Index[i]));
        foreach (var obj in df[i])
        {
            cells.Add(td(obj));
        }
        rows.Add(cells);
    }
    
    var t = table(
        thead(
            headers),
        tbody(
            rows.Select(
                r => tr(r))));
    
    writer.Write(t);
}, "text/html");

# Iris flower detection - ML project



# Introduction
[from Wikipedia](https://en.wikipedia.org/wiki/Iris_flower_data_set)

<img src="../img/campus02-lecture-img09.jpg" alt="drawing" height="200"/>
The Iris flower data set or `Fisher's Iris data set` is a multivariate data set introduced by the British statistician and biologist `Ronald Fisher` in his `1936` paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. It is sometimes called `Anderson's Iris data set` because `Edgar Anderson` collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the `Gaspé Peninsula` "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus".

## Quick decsription of the dataset:

- three types of flowers 
    - setosa
    - verisicolor
    - virginica
    
<img src="../img/iris-machinelearning.png" alt="drawing" width="600"/>

# How flower are measured in order to generate data set

- 4 types of measurements in centimeters
    - sepal_width
    - petal_width
    - sepal_length
    - petal_length
  
 <img src="../img/campus02-lecture-img08.jpg" alt="drawing" width="600"/>

# Iris data set

At the end he created the data set consisting of 5 columns and 150 rows:

<img src="../img/campus02-lecture-img11.jpg" alt="drawing" width="600"/>

# Problem statement

Build a machine learning program which can identify the iris flower based on its dimensions of the sepal and petal parts.

<img src="../img/iris_ml_problem.png" alt="drawing" width="600"/>

# Procedure to build the ML program

1. Identify the data set
2. Load the data and perform analysis
3. EDA - perform exploratory data analysis
4. Prepare the data set for machine learning - feature engineering
5. Define ML algorith
6. Train the model
7. Evaluate the model
8. Test the model



# Loading Iris Dataset into the memory

In [5]:
var url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data";

var cols = new string[] {"sepal_length","sepal_width", "petal_length", "petal_width", "flower_type"};

var df = DataFrame.FromWeb(url, sep:',',names:cols);

df.Head(15)

(index),sepal_length,sepal_width,petal_length,petal_width,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


## EDA - Exploratory Data Analysis

In this part we are going to present the analysis of Iris data set. We are going to shows graphic representation of the data. First lets see how meny types of flower ar emeasured:


In [6]:
//plot the data in order to see how areas are spread in the 2d plane
//XPlot Histogram reference: http://tpetricek.github.io/XPlot/reference/xplot-plotly-graph-histogram.html

var faresHistogram = Chart.Plot(new Histogram(){x = df["flower_type"], autobinx = false, nbinsx = 20});
var layout = new Layout.Layout(){title="Distribution of iris flower"};
faresHistogram.WithLayout(layout);
faresHistogram

## Feature Engineering


FE is very important in data set preparation for the ML, and here we are going to create two new columns in the dataset. 

The new columns will be `sepalArea` and `petalArea` area for the flower.

The expression we are going to use is:


$$
PetalArea = petal\_width \cdot petal\_length;\\
SepalArea = sepal\_width \cdot sepal\_length;
$$

As can be seen, the $\LaTeX$ is fully supported in the notebook.

The above formulas are implemented in the following code:

In [7]:
//calculate two new columns into dataset
df.AddCalculatedColumns(new string[] { "SepalArea", "PetalArea" }, 
        (r, i) =>
        {
            var aRow = new object[2];
            aRow[0]=Convert.ToSingle(r["sepal_width"]) * Convert.ToSingle(r["sepal_length"]);
            aRow[1] = Convert.ToSingle(r["petal_width"]) * Convert.ToSingle(r["petal_length"]);
            return aRow;

        });
var featuredDf=df["SepalArea","PetalArea","flower_type"]; 
df.Head(5)

(index),sepal_length,sepal_width,petal_length,petal_width,flower_type,SepalArea,PetalArea
0,5.1,3.5,1.4,0.2,Iris-setosa,17.85,0.28
1,4.9,3.0,1.4,0.2,Iris-setosa,14.700001,0.28
2,4.7,3.2,1.3,0.2,Iris-setosa,15.04,0.26
3,4.6,3.1,1.5,0.2,Iris-setosa,14.259999,0.3
4,5.0,3.6,1.4,0.2,Iris-setosa,18.0,0.28


As can be seen, two new columns have been added to the dataset. So now we have six columns where the last two are generated from the previous columns.


Now we can visual the calculated columns so that we can see where the data are genering in the space.

In [12]:
// Plot Sepal vs. Petal area with flower type
var chart = Chart.Plot(
                new Scatter[] {
                    new Scatter
                    {
                        x = df.Filter("flower_type","Iris-virginica", FilterOperator.Equal)["SepalArea"],
                        y = df.Filter("flower_type","Iris-virginica", FilterOperator.Equal)["PetalArea"],
                        mode = "markers",name="Iris-virginica",
                        marker = new Marker(){color=2, colorscale = "Jet"}
                    },
                    new Scatter
                    {
                        x = df.Filter("flower_type","Iris-versicolor", FilterOperator.Equal)["SepalArea"],
                        y = df.Filter("flower_type","Iris-versicolor", FilterOperator.Equal)["PetalArea"],
                        mode = "markers",name="Iris-versicolor",
                        marker = new Marker(){color=2, colorscale = "Jet"}
                    },
                    new Scatter
                    {
                        x = df.Filter("flower_type","Iris-setosa", FilterOperator.Equal)["SepalArea"],
                        y = df.Filter("flower_type","Iris-setosa", FilterOperator.Equal)["PetalArea"],
                        mode = "markers",name="Iris-setosa",
                        marker = new Marker(){ color=3, colorscale = "Jet"}
                    },
                }
            );

var layout = new Layout.Layout(){title="Plot Sepal vs. Petal Area & color scale on flower type"};
chart.WithLayout(layout);
chart.WithLegend(true);
chart.WithXTitle("Sepal Area");
chart.WithYTitle("Petal Area");
chart.Width = 800;
chart.Height = 400;
chart

From the graph above, we can see that we clearly separated the flower types in the plane. 

The data frame has two new columns. They indicate the areas for the flower. In order to see basic statistics parameters for each of the defined columns, first we are going to group the data by flower_type adn then shows the description for each group.

In [13]:
display("Total data set");
display(df.Describe());
display("**********************************************");
//see descriptive stats of the final ds
var groups = df.GroupBy("flower_type");
foreach(var g in groups.Group)
{
   var s =  g.Value.Describe(false);
   display(g.Key);
   display(s);
   display("**********************************************");
   
}

Total data set

(index),sepal_length,sepal_width,petal_length,petal_width,SepalArea,PetalArea
Count,150.0,150.0,150.0,150.0,150.0,150.0
Unique,35.0,23.0,43.0,22.0,108.0,101.0
Top,5.0,3.0,1.5,0.2,13.200001,0.28
Freq,10.0,26.0,14.0,28.0,5.0,8.0
Mean,5.843333,3.054,3.758667,1.198667,17.806534,5.793133
Std,0.828066,0.433594,1.76442,0.763161,3.368692,4.713499
Min,4.3,2.0,1.0,0.1,10.0,0.11
25%,5.1,2.8,1.6,0.3,15.645,0.42
Median,5.8,3.0,4.35,1.3,17.66,5.615
75%,6.4,3.3,5.1,1.8,20.325001,9.69


**********************************************

Iris-setosa

(index),sepal_length,sepal_width,petal_length,petal_width,flower_type,SepalArea,PetalArea
Count,50.0,50.0,50.0,50.0,50,50.0,50.0
Unique,15.0,16.0,9.0,6.0,1,38.0,22.0
Top,5.1,3.4,1.5,0.2,Iris-setosa,15.19,0.28
Freq,8.0,9.0,14.0,28.0,50,3.0,8.0
Mean,5.006,3.418,1.464,0.244,<null>,17.208799,0.3628
Std,0.35249,0.381024,0.173511,0.10721,<null>,2.947688,0.183248
Min,4.3,2.3,1.0,0.1,<null>,10.349999,0.11
25%,4.8,3.125,1.4,0.2,<null>,15.04,0.265
Median,5.0,3.4,1.5,0.2,<null>,17.0,0.3
75%,5.2,3.675,1.575,0.3,<null>,19.155,0.42


**********************************************

Iris-versicolor

(index),sepal_length,sepal_width,petal_length,petal_width,flower_type,SepalArea,PetalArea
Count,50.0,50.0,50.0,50.0,50,50.0,50.0
Unique,21.0,14.0,19.0,9.0,1,42.0,36.0
Top,5.5,3.0,4.5,1.3,Iris-versicolor,13.200001,6.75
Freq,5.0,8.0,7.0,13.0,50,3.0,5.0
Mean,5.936,2.77,4.26,1.326,<null>,16.526199,5.7204
Std,0.516171,0.313798,0.469911,0.197753,<null>,2.866882,1.368403
Min,4.9,2.0,3.0,1.0,<null>,10.0,3.3
25%,5.6,2.525,4.0,1.2,<null>,14.347499,4.86
Median,5.9,2.8,4.35,1.3,<null>,16.385,5.615
75%,6.3,3.0,4.6,1.5,<null>,18.495001,6.75


**********************************************

Iris-virginica

(index),sepal_length,sepal_width,petal_length,petal_width,flower_type,SepalArea,PetalArea
Count,50.0,50.0,50.0,50.0,50,50.0,50.0
Unique,21.0,13.0,20.0,12.0,1,44.0,44.0
Top,6.3,3.0,5.1,1.8,Iris-virginica,19.5,9.69
Freq,6.0,12.0,7.0,11.0,50,3.0,2.0
Mean,6.588,2.974,5.552,2.026,<null>,19.684599,11.2962
Std,0.63588,0.322497,0.551895,0.27465,<null>,3.458783,2.157412
Min,4.9,2.2,4.5,1.4,<null>,12.25,7.5
25%,6.225,2.8,5.1,1.8,<null>,17.429999,9.7175
Median,6.5,3.0,5.55,2.0,<null>,20.059998,11.445
75%,6.9,3.175,5.875,2.3,<null>,21.412501,12.79


**********************************************

As can be seen the total dataset shows as that the maximum petal area at virginica, and the minimumn petal area at setosa. However for sepal area we cannot conclude much because arease are very close to each other. 

# Preparation data set for the Machine Learning

Once we finished with data transformation and visualization we can define the final data frame before machine learning application. To end this we are going to select only two columns for features and one label column which will be flower type.

In [14]:
//create new data-frame by selecting only three columns
var derivedDF = df["SepalArea","PetalArea","flower_type"];
derivedDF.Head(5)

(index),SepalArea,PetalArea,flower_type
0,17.85,0.28,Iris-setosa
1,14.700001,0.28,Iris-setosa
2,15.04,0.26,Iris-setosa
3,14.259999,0.3,Iris-setosa
4,18.0,0.28,Iris-setosa


Since we are going to use ML.NET, we need to declare `Iris` class which holds the properties we are going to use in the machine learning.

In [15]:
//Define a Iris class for machine learning.
class Iris
{
    public float PetalArea { get; set; }
    public float SepalArea { get; set; }
    public string Species { get; set; }
}
//Create ML COntext
MLContext mlContext = new MLContext(seed:2019);

In [16]:
//Load Data Frame into Ml.NET data pipeline
IDataView dataView = mlContext.Data.LoadFromEnumerable<Iris>(derivedDF.GetEnumerator<Iris>((oRow) =>
{
    //convert row object array into Iris row

    var prRow = new Iris();
    prRow.SepalArea = Convert.ToSingle(oRow["SepalArea"]);
    prRow.PetalArea = Convert.ToSingle(oRow["PetalArea"]);
    prRow.Species = Convert.ToString(oRow["flower_type"]);
    //
    return prRow;
}));

Once we have data, we can split it into `train` and `test` sets:

In [17]:
//Split dataset in two parts: TrainingDataset (80%) and TestDataset (20%)
var trainTestData = mlContext.Data.TrainTestSplit(dataView, testFraction: 0.2);
var trainData = trainTestData.TrainSet;
var testData = trainTestData.TestSet;

In [18]:
//one encoding output category column by defining KeyValues for each category
var dataPipeline = mlContext.Transforms
.Concatenate("Features",nameof(Iris.SepalArea), nameof(Iris.PetalArea))
.Append(mlContext.Transforms.Conversion.MapValueToKey("Label", nameof(Iris.Species)));





In [19]:
 // Define LightGbm algorithm estimator
IEstimator<ITransformer> lightGbm = mlContext.MulticlassClassification.Trainers.LightGbm();
//train the ML model
var finalPipeline = dataPipeline.Append(lightGbm)
.Append(mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel"));


var model = finalPipeline.Fit(trainData);

In [20]:
//evaluate train set
var predictions = model.Transform(trainData);
var metricsTrain = mlContext.MulticlassClassification.Evaluate(predictions);
ConsoleHelper.PrintMultiClassClassificationMetrics("TRAIN Iris DataSet", metricsTrain);
ConsoleHelper.ConsoleWriteHeader("Train Iris DataSet Confusion Matrix ");
ConsoleHelper.ConsolePrintConfusionMatrix(metricsTrain.ConfusionMatrix);

************************************************************
*    Metrics for TRAIN Iris DataSet multi-class classification model   
*-----------------------------------------------------------
    AccuracyMacro = 1, a value between 0 and 1, the closer to 1, the better
    AccuracyMicro = 1, a value between 0 and 1, the closer to 1, the better
    LogLoss = 0.0195, the closer to 0, the better
    LogLoss for class 1 = 0.0112, the closer to 0, the better
    LogLoss for class 2 = 0.0293, the closer to 0, the better
    LogLoss for class 3 = 0.019, the closer to 0, the better
************************************************************
 
Train Iris DataSet Confusion Matrix 
####################################
 

Confusion table
PREDICTED          ||     0 |     1 |     2 | Recall
0.     Iris-setosa ||    45 |     0 |     0 | 1.0000
1. Iris-versicolor ||     0 |    40 |     0 | 1.0000
2.  Iris-virginica ||     0 |     0 |    38 | 1.0000
Precision          ||1.0000 |1.0000 |1.0000 |


In [21]:
//evaluate test set
var testPrediction = model.Transform(testData);
var metricsTest = mlContext.MulticlassClassification.Evaluate(testPrediction);

ConsoleHelper.PrintMultiClassClassificationMetrics("TEST Iris Dataset", metricsTest);
ConsoleHelper.ConsoleWriteHeader("Test Iris DataSet Confusion Matrix ");
ConsoleHelper.ConsolePrintConfusionMatrix(metricsTest.ConfusionMatrix);

************************************************************
*    Metrics for TEST Iris Dataset multi-class classification model   
*-----------------------------------------------------------
    AccuracyMacro = 1, a value between 0 and 1, the closer to 1, the better
    AccuracyMicro = 1, a value between 0 and 1, the closer to 1, the better
    LogLoss = 0.0624, the closer to 0, the better
    LogLoss for class 1 = 0.0068, the closer to 0, the better
    LogLoss for class 2 = 0.0759, the closer to 0, the better
    LogLoss for class 3 = 0.0744, the closer to 0, the better
************************************************************
 
Test Iris DataSet Confusion Matrix 
###################################
 

Confusion table
PREDICTED          ||     0 |     1 |     2 | Recall
0.     Iris-setosa ||     5 |     0 |     0 | 1.0000
1. Iris-versicolor ||     0 |    10 |     0 | 1.0000
2.  Iris-virginica ||     0 |     0 |    12 | 1.0000
Precision          ||1.0000 |1.0000 |1.0000 |


In [22]:
var p1 = predictions.GetColumn<string>("PredictedLabel").Select(x=>x.ToString()).ToList();

var p2 = testPrediction.GetColumn<string>("PredictedLabel").Select(x => x.ToString()).ToList();

//join train and test
p1.AddRange(p2);


//add new column into df
var dic = new Dictionary<string, List<object>> { { "PredictedLabel", p1.Select(x=>(object)x).ToList() } };
var dff = derivedDF.AddColumns(dic);
dff.Head()

(index),SepalArea,PetalArea,flower_type,PredictedLabel
0,17.85,0.28,Iris-setosa,Iris-setosa
1,14.700001,0.28,Iris-setosa,Iris-setosa
2,15.04,0.26,Iris-setosa,Iris-setosa
3,14.259999,0.3,Iris-setosa,Iris-setosa
4,18.0,0.28,Iris-setosa,Iris-setosa


In [23]:
dff.Tail()

(index),SepalArea,PetalArea,flower_type,PredictedLabel
145,20.099998,11.959999,Iris-virginica,Iris-virginica
146,15.75,9.5,Iris-virginica,Iris-virginica
147,19.5,10.4,Iris-virginica,Iris-virginica
148,21.08,12.42,Iris-virginica,Iris-virginica
149,17.7,9.179999,Iris-virginica,Iris-virginica


# Testing the Model

Now that we completed and trained our model, it is time for test. In that context put random values for the SepalArea and PetaArea and let tha model predict the flower type.

In [25]:
DataViewSchema modelSchema;
class IrisPrediction{public string PredictedLabel{get;set;}};

// Create PredictionEngines
var predictionEngine = mlContext.Model.CreatePredictionEngine<Iris, IrisPrediction>(model);

In [26]:
var row = new Iris(){SepalArea=17.85f,PetalArea=0.28f, Species="Iris-setosa" };

var prediction = predictionEngine.Predict(row);
var predictionLabel = prediction.PredictedLabel;

In [27]:
var result = $"True Value'{row.Species}';\n PredictedValue:'{prediction.PredictedLabel}'";
result

True Value'Iris-setosa';
 PredictedValue:'Iris-setosa'