# Getting started with C# Jupyter Notebook!!

In this blog post, we are going to explore the main features in the new C# Juypter Notebook. For those who used Notebook from other programming languages like python or R, this would be an easy task. First of all, the Notebook concept provide quick, simple and straightforward ways to present a mix of text, programming implementation, the result of programming routines,  and latex support. So this means you have a full-featured platform to write a paper or blog post, presentation slides, lecture notes, and other educated materials. 

The notebook is split into cells, where the user can write code or markdown text. Once he completes the cell content confirmation for cell editing can be achieved by `Ctrl+Enter` or by press run button from the note book tool bar. The image below shows the notebook toolbar, with run button. The popoup combobox shows the type of cell the user can define. In case of text, `Markdown` should be selected, for writting source code the cell should be `Code`.
![run button](img/jupyter_notebook_b2_img01.png)

In order to start writing code to C# Notebook, the first thing we shoudl do is install nuget packages or add assembly references, and define using statemeents. In order to do that, the folloing code installs several `nuget packages`, and declare several `using statements`. But before writing code, we shoud add new cell by pressing `+` toolbar button. 

The frst two nuget packages we need is `Daany.DataFrame` for data exploration and analysis, and ˙`XPlot` for data visualization.

In [2]:
//Install Daany.DataFrame 
#r "nuget:Daany.DataFrame"
//Install XPlot package
#r "nuget:XPlot.Plotly"

Once the nuget packages are installed successfuly, we can start with data exploration. But befor this declare few using statements:

In [3]:
using System;
using System.Linq;

//Daany data frame
using Daany;

//Plotting functionalities
using XPlot.Plotly;

We can define classes or methods globally. The following code implements formatter method for displaying `Daany.DataFrame` in the output cell.

In [4]:
// Temporal DataFrame formatter for this early preview
using Microsoft.AspNetCore.Html;
Formatter<DataFrame>.Register((df, writer) =>
{
    var headers = new List<IHtmlContent>();
    headers.Add(th(i("index")));
    headers.AddRange(df.Columns.Select(c => (IHtmlContent) th(c)));
    
    //renders the rows
    var rows = new List<List<IHtmlContent>>();
    var take = 20;
    
    //
    for (var i = 0; i < Math.Min(take, df.RowCount()); i++)
    {
        var cells = new List<IHtmlContent>();
        cells.Add(td(df.Index[i]));
        foreach (var obj in df[i])
        {
            cells.Add(td(obj));
        }
        rows.Add(cells);
    }
    
    var t = table(
        thead(
            headers),
        tbody(
            rows.Select(
                r => tr(r))));
    
    writer.Write(t);
}, "text/html");

For this demo we will used famous `Iris` data set. We will download the file from the internet, load it by using Daany.DataFrame, a display few first rows. In order to do that we run the folloing code: 

In [5]:
var url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data";
var cols = new string[] {"sepal_length","sepal_width", "petal_length", "petal_width", "flower_type"};
var df = DataFrame.FromWeb(url, sep:',',names:cols);
df.Head(5)

index,sepal_length,sepal_width,petal_length,petal_width,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


As can be seen, the last line from the previous code has no semicolon, which means the line should be displayed in the output cell. Lets move on, and implement two new columns. Th enew columns will be `sepal` and `petal` area for the flower.
The expression we are going to use is:


$$
PetalArea = petal\_width \cdot petal\_length;\\
SepalArea = sepal\_width \cdot sepal\_length;
$$

As can be seen, the $\LaTeX$ is fuly supported in the notebook.

The above formulea is implemented in the following code:

In [6]:
//calculate two new columns into dataset
df.AddCalculatedColumn("SepalArea", (r, i) => Convert.ToSingle(r["sepal_width"]) * Convert.ToSingle(r["sepal_length"]));
df.AddCalculatedColumn("PetalArea", (r, i) => Convert.ToSingle(r["petal_width"]) * Convert.ToSingle(r["petal_length"]));
df.Head(5)

index,sepal_length,sepal_width,petal_length,petal_width,flower_type,SepalArea,PetalArea
0,5.1,3.5,1.4,0.2,Iris-setosa,17.85,0.28
1,4.9,3.0,1.4,0.2,Iris-setosa,14.700001,0.28
2,4.7,3.2,1.3,0.2,Iris-setosa,15.04,0.26
3,4.6,3.1,1.5,0.2,Iris-setosa,14.259999,0.3
4,5.0,3.6,1.4,0.2,Iris-setosa,18.0,0.28


The data frame has two new columns. They indicate the areas for the flower. In order to see baseic statistics parameters for each of the defined columns we call `Describe` method.

In [7]:
//see descriptive stats of the final ds
df.Describe(false)

index,sepal_length,sepal_width,petal_length,petal_width,flower_type,SepalArea,PetalArea
Count,150.0,150.0,150.0,150.0,150,150.0,150.0
Unique,35.0,23.0,43.0,22.0,3,108.0,101.0
Top,5.0,3.0,1.5,0.2,Iris-setosa,13.200001,0.28
Freq,10.0,26.0,14.0,28.0,50,5.0,8.0
Mean,5.843333,3.054,3.758667,1.198667,<null>,17.806534,5.793133
Std,0.828066,0.433594,1.76442,0.763161,<null>,3.368692,4.713499
Min,4.3,2.0,1.0,0.1,<null>,10.0,0.11
25%,5.1,2.8,1.6,0.3,<null>,15.645,0.42
Median,5.8,3.0,4.35,1.3,<null>,17.66,5.615
75%,6.4,3.3,5.1,1.8,<null>,20.325001,9.69


From the table above, we can see the `flower` column has only 3 values. The most frequent values has frequency equal to 50, which is indicator for balanced dataset.



# Data visulaization

The most powerfull feture in NOtebook is data visualization. In this section we are going to plot some interesting charts.

In order to see how sepat and petal areas are spreaded in 2D plane, the following plot is implemented:

In [9]:
//plot the data in order to see how areas are spread in the 2d plane
//XPlot Histogram reference: http://tpetricek.github.io/XPlot/reference/xplot-plotly-graph-histogram.html

var faresHistogram = Chart.Plot(new Graph.Histogram(){x = df["flower_type"], autobinx = false, nbinsx = 20});
var layout = new Layout.Layout(){title="Distribution of iris flower"};
faresHistogram.WithLayout(layout);
display(faresHistogram);

The chart is also indication of balaced dataset.

Now lets plot areas depending of the flower type:

In [10]:
// Plot Sepal vs. Petal area with flower type

var chart = Chart.Plot(
    new Graph.Scatter()
    {
        x = df["SepalArea"],
        y = df["PetalArea"],
        mode = "markers",
        marker = new Graph.Marker()
        {
            color = df["flower_type"].Select(x=>x.ToString()=="Iris-virginica"?1:(x.ToString()=="Iris-versicolor"?2:3)),
            colorscale = "Jet"
        }
    }
);

var layout = new Layout.Layout(){title="Plot Sepal vs. Petal Area & color scale on flower type"};
chart.WithLayout(layout);
chart.WithLegend(true);
chart.WithLabels(new string[3]{"Iris-virginica","Iris-versicolor", "Iris-setosa"});
chart.WithXTitle("Sepal Area");
chart.WithYTitle("Petal Area");
chart.Width = 800;
chart.Height = 400;

display(chart);

As can be seen from the chart above, flower types are separated almost linearly, since we used petal and sepal areas instead of width and length. With this transformation, we can get a 100% accurate ml model.

# Machine Learning 

Once we finished with data transformation and visualization we can define the final data frame before machine learning application. To end this we are going to select only two columns for features and one label column which will be flower type.

In [12]:
//create new data-frame by selecting only three columns
var derivedDF = df["SepalArea","PetalArea","flower_type"];
derivedDF.Head(5)

index,SepalArea,PetalArea,flower_type
0,17.85,0.28,Iris-setosa
1,14.700001,0.28,Iris-setosa
2,15.04,0.26,Iris-setosa
3,14.259999,0.3,Iris-setosa
4,18.0,0.28,Iris-setosa


Since we are going to use ML.NET, we need ML.NET NuGet package. Let's install the latest Ml.NET NuGet packages for the code component and LightGBM algorithm.

In [13]:
#r "nuget:Microsoft.ML"
#r "nuget:Microsoft.ML.LightGbm"

In [15]:
//ML.NET using
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Trainers.LightGbm;

In [16]:
//Define a Iris class for machine learning.
class Iris
{
    public float PetalArea { get; set; }
    public float SepalArea { get; set; }
    public string Species { get; set; }
}
//Create ML COntext
MLContext mlContext = new MLContext(seed:2019);

In [21]:
//Load Data Frame into Ml.NET data pipeline
IDataView dataView = mlContext.Data.LoadFromEnumerable<Iris>(derivedDF.GetEnumerator<Iris>((oRow) =>
{
    //convert row object array into Iris row

    var prRow = new Iris();
    prRow.SepalArea = Convert.ToSingle(oRow["SepalArea"]);
    prRow.PetalArea = Convert.ToSingle(oRow["PetalArea"]);
    prRow.Species = Convert.ToString(oRow["flower_type"]);
    //
    return prRow;
}));

In [22]:
//Split dataset in two parts: TrainingDataset (80%) and TestDataset (20%)
var trainTestData = mlContext.Data.TrainTestSplit(dataView, testFraction: 0.2);
var trainData = trainTestData.TrainSet;
var testData = trainTestData.TestSet;

In [23]:
//one encoding output category column by defining KeyValues for each category
IEstimator<ITransformer> dataPipeline =
mlContext.Transforms.Conversion.MapValueToKey(outputColumnName: "Label", inputColumnName: nameof(Iris.Species))

//define features columns
.Append(mlContext.Transforms.Concatenate("Features",nameof(Iris.SepalArea), nameof(Iris.PetalArea)));

In [24]:
%%time
 // Define LightGbm algorithm estimator
IEstimator<ITransformer> lightGbm = mlContext.MulticlassClassification.Trainers.LightGbm();
//train the ML model
TransformerChain<ITransformer> model = dataPipeline.Append(lightGbm).Fit(trainData);

Unhandled Exception: Value cannot be null. (Parameter 'libraryPath')

Wall time: 60.681200000000004ms

In order to print result with formatting, we are going to install Daany DataFrame extension which has implementation of printing results 

In [59]:
//Install Daany.DataFrame.Ext 
#r "nuget:Daany.DataFrame.Ext"
using Daany.Ext;

In [66]:
//evaluate train set
var predictions = model.Transform(trainData);
var metricsTrain = mlContext.MulticlassClassification.Evaluate(predictions);
ConsoleHelper.PrintMultiClassClassificationMetrics("TRAIN Iris DataSet", metricsTrain);
ConsoleHelper.ConsoleWriteHeader("Train Iris DataSet Confusion Matrix ");
ConsoleHelper.ConsolePrintConfusionMatrix(metricsTrain.ConfusionMatrix);

************************************************************
*    Metrics for TRAIN Iris DataSet multi-class classification model   
*-----------------------------------------------------------
    AccuracyMacro = 1, a value between 0 and 1, the closer to 1, the better
    AccuracyMicro = 1, a value between 0 and 1, the closer to 1, the better
    LogLoss = 0.0166, the closer to 0, the better
    LogLoss for class 1 = 0.0074, the closer to 0, the better
    LogLoss for class 2 = 0.0225, the closer to 0, the better
    LogLoss for class 3 = 0.0196, the closer to 0, the better
************************************************************
 
Train Iris DataSet Confusion Matrix 
####################################
 

Confusion table
PREDICTED          ||     0 |     1 |     2 | Recall
0.     Iris-setosa ||    42 |     0 |     0 | 1.0000
1. Iris-versicolor ||     0 |    43 |     0 | 1.0000
2.  Iris-virginica ||     0 |     0 |    44 | 1.0000
Precision          ||1.0000 |1.0000 |1.0000 |


In [67]:
//evaluate test set
var testPrediction = model.Transform(testData);
var metricsTest = mlContext.MulticlassClassification.Evaluate(testPrediction);
ConsoleHelper.PrintMultiClassClassificationMetrics("TEST Iris Dataset", metricsTest);
ConsoleHelper.ConsoleWriteHeader("Test Iris DataSet Confusion Matrix ");
ConsoleHelper.ConsolePrintConfusionMatrix(metricsTest.ConfusionMatrix);

************************************************************
*    Metrics for TEST Iris Dataset multi-class classification model   
*-----------------------------------------------------------
    AccuracyMacro = 1, a value between 0 and 1, the closer to 1, the better
    AccuracyMicro = 1, a value between 0 and 1, the closer to 1, the better
    LogLoss = 0.0102, the closer to 0, the better
    LogLoss for class 1 = 0.0083, the closer to 0, the better
    LogLoss for class 2 = 0.0062, the closer to 0, the better
    LogLoss for class 3 = 0.0172, the closer to 0, the better
************************************************************
 
Test Iris DataSet Confusion Matrix 
###################################
 

Confusion table
PREDICTED          ||     0 |     1 |     2 | Recall
0.     Iris-setosa ||     8 |     0 |     0 | 1.0000
1. Iris-versicolor ||     0 |     7 |     0 | 1.0000
2.  Iris-virginica ||     0 |     0 |     6 | 1.0000
Precision          ||1.0000 |1.0000 |1.0000 |


As can bee seen, we have 100% accurate model for Iris flower recognition.