# Predicting Iris flower using ML.NET 

# Introduction
[from Wikipedia](https://en.wikipedia.org/wiki/Iris_flower_data_set)

<img src="img/campus02-lecture-img09.jpg" alt="drawing" height="200"/>
The Iris flower data set or `Fisher's Iris data set` is a multivariate data set introduced by the British statistician and biologist `Ronald Fisher` in his `1936` paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. It is sometimes called `Anderson's Iris data set` because `Edgar Anderson` collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the `Gaspé Peninsula` "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus".

## Quick decsription of the dataset:

- three types of flowers 
    - setosa
    - verisicolor
    - virginica
    
<img src="img/campus02-lecture-img07.jpg" alt="drawing" width="600"/>

- 4 types of measurements 
    - sepal_width
    - petal_width
    - sepal_length
    - petal_length
  
 <img src="img/campus02-lecture-img08.jpg" alt="drawing" width="600"/>
    

# Iris data set

At the end he created the data set consisting of 5 columns and 150 rows:

<img src="img/campus02-lecture-img11.jpg" alt="drawing" width="600"/>

## EDA Exploratory Data Analysis

In this part we are going to present the analysis of Iris data. First install some Nuet packages

In [2]:
//Install Daany packages
#r "nuget:Daany.DataFrame"
#r "nuget:Daany.DataFrame.Ext"
#r "nuget:Daany.Stat"

Installed package Daany.Stat version 0.6.4

Installed package Daany.DataFrame.Ext version 0.6.4

Installed package Daany.DataFrame version 0.6.4

In [3]:
//using statement of Daany package
using Daany;
using Daany.MathStuff;
using Daany.Ext;

//using Microsoft.ML.Data;
using XPlot.Plotly;
using System;
using System.Collections.Generic;
using System.Drawing;
using System.Linq;



//ML.NET using
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Trainers.LightGbm;

In [1]:
#r "nuget:Microsoft.ML"
#r "nuget:Microsoft.ML.LightGbm"
//Install XPlot package
#r "nuget:XPlot.Plotly"

Installed package XPlot.Plotly version 3.0.1

Installed package Microsoft.ML.LightGbm version 1.4.0

Installed package Microsoft.ML version 1.4.0

In [4]:
//declare iris class type
class Iris
{
    public float PetalArea { get; set; }
    public float SepalArea { get; set; }
    public string Species { get; set; }
}
//Create ML COntext
MLContext mlContext = new MLContext(seed:2019);

In [5]:
// Temporal DataFrame formatter for this early preview
using Microsoft.AspNetCore.Html;
Formatter<DataFrame>.Register((df, writer) =>
{
    var headers = new List<IHtmlContent>();
    headers.Add(th(i("index")));
    headers.AddRange(df.Columns.Select(c => (IHtmlContent) th(c)));
    
    //renders the rows
    var rows = new List<List<IHtmlContent>>();
    var take = 20;
    
    //
    for (var i = 0; i < Math.Min(take, df.RowCount()); i++)
    {
        var cells = new List<IHtmlContent>();
        cells.Add(td(df.Index[i]));
        foreach (var obj in df[i])
        {
            cells.Add(td(obj));
        }
        rows.Add(cells);
    }
    
    var t = table(
        thead(
            headers),
        tbody(
            rows.Select(
                r => tr(r))));
    
    writer.Write(t);
}, "text/html");

In [6]:
//define file path
var orgdataPath = "data/iris.txt";

//read the iris data and create DataFrame object
var df = DataFrame.FromCsv(orgdataPath,sep:'\t');
df.Head(10)

index,sepal_length,sepal_width,petal_length,petal_width,species
0,6.5,3.0,5.5,1.8,virginica
1,5.6,2.7,4.2,1.3,versicolor
2,4.4,2.9,1.4,0.2,setosa
3,4.9,3.1,1.5,0.1,setosa
4,5.9,3.0,4.2,1.5,versicolor
5,6.9,3.1,4.9,1.5,versicolor
6,7.9,3.8,6.4,2.0,virginica
7,5.2,3.5,1.5,0.2,setosa
8,5.6,2.5,3.9,1.1,versicolor
9,4.6,3.1,1.5,0.2,setosa


In [7]:
//calculate two new columns into dataset
df.AddCalculatedColumns(new string[] { "SepalArea", "PetalArea" }, 
        (r, i) =>
        {
            var aRow = new object[2];
            aRow[0]=Convert.ToSingle(r["sepal_width"]) * Convert.ToSingle(r["sepal_length"]);
            aRow[1] = Convert.ToSingle(r["petal_width"]) * Convert.ToSingle(r["petal_length"]);
            return aRow;

        });
df.Head(5)

index,sepal_length,sepal_width,petal_length,petal_width,species,SepalArea,PetalArea
0,6.5,3.0,5.5,1.8,virginica,19.5,9.9
1,5.6,2.7,4.2,1.3,versicolor,15.12,5.4599996
2,4.4,2.9,1.4,0.2,setosa,12.76,0.28
3,4.9,3.1,1.5,0.1,setosa,15.19,0.15
4,5.9,3.0,4.2,1.5,versicolor,17.7,6.2999997


In [8]:
//see descriptive stats of the final ds
df.Describe(false)

index,sepal_length,sepal_width,petal_length,petal_width,species,SepalArea,PetalArea
Count,150.0,150.0,150.0,150.0,150,150.0,150.0
Unique,35.0,23.0,43.0,22.0,3,108.0,101.0
Top,5.0,3.0,1.4,0.2,virginica,13.200001,0.28
Freq,10.0,26.0,13.0,29.0,50,5.0,8.0
Mean,5.843333,3.04,3.758,1.199333,<null>,17.822866,5.794066
Std,0.828066,0.541933,1.765298,0.762238,<null>,3.361853,4.71239
Min,4.3,2.0,1.0,0.1,<null>,10.0,0.11
25%,5.1,3.0,1.6,0.3,<null>,15.660001,0.42
Median,5.8,3.0,4.35,1.3,<null>,17.66,5.615
75%,6.4,3.0,5.1,1.8,<null>,20.325001,9.69


In [9]:
//create new data-frame by selecting only three columns
var derivedDF = df["SepalArea","PetalArea","species"];
derivedDF

index,SepalArea,PetalArea,species
0,19.5,9.9,virginica
1,15.12,5.4599996,versicolor
2,12.76,0.28,setosa
3,15.19,0.15,setosa
4,17.7,6.2999997,versicolor
5,21.39,7.3500004,versicolor
6,30.02,12.8,virginica
7,18.199999,0.3,setosa
8,14.0,4.29,versicolor
9,14.259999,0.3,setosa


In [10]:
//plot the data in order to see how areas are spread in the 2d plane
//XPlot Histogram reference: http://tpetricek.github.io/XPlot/reference/xplot-plotly-graph-histogram.html

var faresHistogram = Chart.Plot(new Graph.Histogram(){x = derivedDF["species"], autobinx = false, nbinsx = 20});
var layout = new Layout.Layout(){title="Distribution of iris flower"};
faresHistogram.WithLayout(layout);
display(faresHistogram);

In [11]:
// Plot Sepal vs. Petal area with flower type

var chart = Chart.Plot(
    new Graph.Scatter()
    {
        x = derivedDF["SepalArea"],
        y = derivedDF["PetalArea"],
        mode = "markers",
        marker = new Graph.Marker()
        {
            color = derivedDF["species"].Select(x=>x.ToString()=="virginica"?1:(x.ToString()=="versicolor"?2:3)),
            colorscale = "Jet"
        }
    }
);

var layout = new Layout.Layout(){title="Plot Sepal vs. Petal Area & color scale on Species"};
chart.WithLayout(layout);
chart.WithLegend(true);
chart.WithLabels(new[]{"virginica","versicolor", "setosa"});
chart.WithXTitle("SepalArea");
chart.WithYTitle("Petal Area");
chart.Width = 700;
chart.Height = 500;

display(chart);

In [12]:
var mlContext= new MLContext();
//Load Data Frame into Ml.NET data pipeline
IDataView dataView = mlContext.Data.LoadFromEnumerable<Iris>(derivedDF.GetEnumerator<Iris>((oRow) =>
{
    //convert row object array into Iris row

    var prRow = new Iris();
    prRow.SepalArea = Convert.ToSingle(oRow["SepalArea"]);
    prRow.PetalArea = Convert.ToSingle(oRow["PetalArea"]);
    prRow.Species = Convert.ToString(oRow["species"]);
    //
    return prRow;
}));

In [13]:
//Split dataset in two parts: TrainingDataset (80%) and TestDataset (20%)
var trainTestData = mlContext.Data.TrainTestSplit(dataView, testFraction: 0.2);
var trainData = trainTestData.TrainSet;
var testData = trainTestData.TestSet;

In [14]:
//one encoding output category column by defining KeyValues for each category
IEstimator<ITransformer> dataPipeline =
mlContext.Transforms.Conversion.MapValueToKey(outputColumnName: "Label", inputColumnName: nameof(Iris.Species))

//define features columns
.Append(mlContext.Transforms.Concatenate("Features",nameof(Iris.SepalArea), nameof(Iris.PetalArea)));

In [15]:
 // Define LightGbm algorithm estimator
IEstimator<ITransformer> lightGbm = mlContext.MulticlassClassification.Trainers.LightGbm();
//train the ML model
TransformerChain<ITransformer> model = dataPipeline.Append(lightGbm).Fit(trainData);

In [16]:
//evaluate train set
var predictions = model.Transform(trainData);
var metricsTrain = mlContext.MulticlassClassification.Evaluate(predictions);
ConsoleHelper.PrintMultiClassClassificationMetrics("TRAIN Iris DataSet", metricsTrain);
ConsoleHelper.ConsoleWriteHeader("Train Iris DataSet Confusion Matrix ");
ConsoleHelper.ConsolePrintConfusionMatrix(metricsTrain.ConfusionMatrix);

************************************************************
*    Metrics for TRAIN Iris DataSet multi-class classification model   
*-----------------------------------------------------------
    AccuracyMacro = 1, a value between 0 and 1, the closer to 1, the better
    AccuracyMicro = 1, a value between 0 and 1, the closer to 1, the better
    LogLoss = 0.0208, the closer to 0, the better
    LogLoss for class 1 = 0.0216, the closer to 0, the better
    LogLoss for class 2 = 0.0312, the closer to 0, the better
    LogLoss for class 3 = 0.0089, the closer to 0, the better
************************************************************
 
Train Iris DataSet Confusion Matrix 
####################################
 

Confusion table
PREDICTED     ||     0 |     1 |     2 | Recall
0.  virginica ||    40 |     0 |     0 | 1.0000
1. versicolor ||     0 |    46 |     0 | 1.0000
2.     setosa ||     0 |     0 |    43 | 1.0000
Precision     ||1.0000 |1.0000 |1.0000 |


In [17]:
//evaluate test set
var testPrediction = model.Transform(testData);
var metricsTest = mlContext.MulticlassClassification.Evaluate(testPrediction);
ConsoleHelper.PrintMultiClassClassificationMetrics("TEST Iris Dataset", metricsTest);
ConsoleHelper.ConsoleWriteHeader("Test Iris DataSet Confusion Matrix ");
ConsoleHelper.ConsolePrintConfusionMatrix(metricsTest.ConfusionMatrix);

************************************************************
*    Metrics for TEST Iris Dataset multi-class classification model   
*-----------------------------------------------------------
    AccuracyMacro = 0.9167, a value between 0 and 1, the closer to 1, the better
    AccuracyMicro = 0.9524, a value between 0 and 1, the closer to 1, the better
    LogLoss = 0.1293, the closer to 0, the better
    LogLoss for class 1 = 0.0177, the closer to 0, the better
    LogLoss for class 2 = 0.6281, the closer to 0, the better
    LogLoss for class 3 = 0.0037, the closer to 0, the better
************************************************************
 
Test Iris DataSet Confusion Matrix 
###################################
 

Confusion table
PREDICTED     ||     0 |     1 |     2 | Recall
0.  virginica ||    10 |     0 |     0 | 1.0000
1. versicolor ||     1 |     3 |     0 | 0.7500
2.     setosa ||     0 |     0 |     7 | 1.0000
Precision     ||0.9091 |1.0000 |1.0000 |


As can bee seen, we have amnost 100% accurate model for Iris flower recognition. Now lets add new column into data frame called Prediction in order to have model prediction in the data frame.

In [18]:
var flowerLabels = DataFrameExt.GetLabels(predictions.Schema).ToList();
var p1 = predictions.GetColumn<uint>("PredictedLabel").Select(x=>(int)x).ToList();
var p2 = testPrediction.GetColumn<uint>("PredictedLabel").Select(x => (int)x).ToList();
//join train and test
p1.AddRange(p2);
var p = p1.Select(x => (object)flowerLabels[x-1]).ToList();
//add new column into df
var dic = new Dictionary<string, List<object>> { { "PredictedLabel", p } };

var dff = derivedDF.AddColumns(dic);
dff.Head()

index,SepalArea,PetalArea,species,PredictedLabel
0,19.5,9.9,virginica,virginica
1,15.12,5.4599996,versicolor,versicolor
2,12.76,0.28,setosa,setosa
3,15.19,0.15,setosa,setosa
4,17.7,6.2999997,versicolor,versicolor


In [19]:
dff.Tail()

index,SepalArea,PetalArea,species,PredictedLabel
145,22.08,13.11,virginica,setosa
146,18.0,8.64,virginica,versicolor
147,16.2,6.75,versicolor,setosa
148,21.319998,0.15,setosa,setosa
149,14.400001,0.14,setosa,virginica
