Note: The .NET Jupyter notebook for this blog post can be found [here](). 

# Your first data analysis with .NET Jupyter Notebook and  Daany.DataFrame

In [None]:
//Nuget package installation
#r "nuget:Daany.DataFrame,1.1.0"
#r "nuget:Daany.DataFrame.Ext,1.1.0"
#r "nuget: Daany.Stat,1.1.0"
//Plot capabilities
#r "nuget: XPlot.Plotly.Interactive,4.0.2"

Loading extensions from `XPlot.Plotly.Interactive.dll`

Configuring PowerShell Kernel for XPlot.Plotly integration.

Installed support for XPlot.Plotly.

In [None]:
//using statement of Daany package
using System;
using Daany;
using Daany.MathStuff;
using Daany.Ext;

//PLot support
using XPlot.Plotly;
//custom display implementation
using static Microsoft.DotNet.Interactive.Formatting.PocketViewTags;
using Microsoft.AspNetCore.Html;
using Microsoft.DotNet.Interactive.Formatting;
using static System.Diagnostics.Debug;
using System.Globalization;

Formatter.Register<DataFrame>((df, writer) =>
{
    var headers = new List<IHtmlContent>();

    headers.Add(th(i($"({df.Index.Name})")));
    headers.AddRange(df.Columns.Select(c => (IHtmlContent) th(c)));
    
    //renders the rows
    var rows = new List<List<IHtmlContent>>();
    var take = 20;
    
    //
    for (var i = 0; i < Math.Min(take, df.RowCount()); i++)
    {
        var cells = new List<IHtmlContent>();
        cells.Add(td(df.Index[i]));
        foreach (var obj in df[i])
        {
            cells.Add(td(obj));
        }
        rows.Add(cells);
    }
    
    var t = table(
        thead(
            headers),
        tbody(
            rows.Select(
                r => tr(r))));
    
    writer.Write(t);
}, "text/html");


# The Structure of  ```Daany.DataFrame```

The main part of ```Daany``` project is ```Daany.DataFrame``` -  an c\# implementation of a data frame. A data frame is software component used for handling tabular data, especially for data preparation, feature engineering and analysis during development of machine learning models. The concept of `Daany.DataFrame` implementation is based on simplicity and .NET coding standard. It represents tabular data consisting of columns and rows. Each column has name and type and each row has its index and label.

Usually, rows indicate a `zero` axis, while columns indicate axis `one`.

The following image shows a data frame structure

![data frame structure](https://bhrnjica.files.wordpress.com/2019/12/daany_data_frame_structure.png)

The basic components of the data frame are:

-   ```header``` - list of column names,
-   ```index```  – list of object representing each row,
-   ```data``` – list of values in the data frame,
-   ```missing value``` – data with no values in data frame.

The image above shows the data frame components visually, and how they are
positioned in the data frame.

# Create Data Frame from a text based file

The data we used are stored in files, and they must be load into application memory in order to be analyzed and transformed. Loading data from files by using `Daany.DataFrame` is as easy as calling one method.

By using static method ```DataFrame.FromCsv``` a user can create data frame object
from the ``csv`` file. Otherwise, data frame can be persisted on disk by calling
static method ```DataFrame.ToCsv```. 

The following code shows how to use static methods ```ToCsv``` and ```FromCsv``` to show persisting and loading data to data frame:

In [None]:
string filename = "df_file.txt";
//define a dictionary of data
var dict = new Dictionary<string, List<object>>
{
    { "ID",new List<object>() { 1,2,3} },
    { "City",new List<object>() { "Sarajevo", "Seattle", "Berlin" } },
    { "Zip Code",new List<object>() { 71000,98101,10115 } },
    { "State",new List<object>() {"BiH","USA","GER" } },
    { "IsHome",new List<object>() { true, false, false} },
    { "Values",new List<object>() { 3.14, 3.21, 4.55 } },
    { "Date",new List<object>() { DateTime.Now.AddDays(-20) , DateTime.Now.AddDays(-10) , DateTime.Now.AddDays(-5) } },

};

//create data frame with 3 rows and 7 columns
var df = new DataFrame(dict);

//first Save data frame on disk and load it
DataFrame.ToCsv(filename, df);

//create data frame with 3 rows and 7 columns
var dfFromFile = DataFrame.FromCsv(filename, sep:',');

//show dataframe
dfFromFile

(index),ID,City,Zip Code,State,IsHome,Values,Date
0,1,Sarajevo,71000,BiH,True,3.14,2021-06-14 22:19:59Z
1,2,Seattle,98101,USA,False,3.21,2021-06-24 22:19:59Z
2,3,Berlin,10115,GER,False,4.55,2021-06-29 22:19:59Z


In case the performance is important, you should pass column types to `FromCSV` method in order to achieve up to 50% of loading time. 
For example the following code loads the data from the file, by passing predefined column types:

In [None]:
//defined types of the column 
var colTypes1 = new ColType[] { ColType.I32, ColType.IN, ColType.I32, ColType.STR, ColType.I2, ColType.F32, ColType.DT };
//create data frame with 3 rows and 7 columns
var dfFromFile = DataFrame.FromCsv(filename, sep: ',', colTypes: colTypes1);
//show dataframe
dfFromFile

(index),ID,City,Zip Code,State,IsHome,Values,Date
0,1,Sarajevo,71000,BiH,True,3.14,2021-06-14 22:19:59Z
1,2,Seattle,98101,USA,False,3.21,2021-06-24 22:19:59Z
2,3,Berlin,10115,GER,False,4.55,2021-06-29 22:19:59Z


# Loading Real Data from the Web

Data can be loaded directly from the web storage by using `FromWeb`static method. The following code shows how to load the `Concrete Slump Test` data from the web. The data set includes 103 data points. There are 7 input variables, and 3 output variables in the data set: `Cement`, `Slag`, `Fly ash`, `Water`, `SP`, `Coarse Aggr.`,`Fine Aggr.`, `SLUMP (cm)`, `FLOW (cm)`, `Strength (Mpa)`. 
The following code load the `Concrete Slump Test` data set into Daany DataFrame:

In [None]:
//define web url where the data is stored
var url = "https://archive.ics.uci.edu/ml/machine-learning-databases/concrete/slump/slump_test.data";
//
var df = DataFrame.FromWeb(url);
df.Head(5)

(index),No,Cement,Slag,Fly ash,Water,SP,Coarse Aggr.,Fine Aggr.,SLUMP(cm),FLOW(cm),Compressive Strength (28-day)(Mpa)
0,1,273,82,105,210,9,904,680,23,62.0,34.99
1,2,163,149,191,180,12,843,746,0,20.0,41.14
2,3,162,148,191,179,16,840,743,1,20.0,41.81
3,4,162,148,190,179,19,838,741,3,21.5,42.08
4,5,154,112,144,220,10,923,658,20,64.0,26.82


Once we have the data in to the application memeory, we can perform some statistical calculation. First, lets see the structure of the data by calling `Describe` metod:

In [None]:
df.Describe(false)

(index),No,Cement,Slag,Fly ash,Water,SP,Coarse Aggr.,Fine Aggr.,SLUMP(cm),FLOW(cm),Compressive Strength (28-day)(Mpa)
Count,103.0,103.0,103.0,103.0,103.0,103.0,103.0,103.0,103.0,103.0,103.0
Unique,103.0,80.0,63.0,58.0,70.0,32.0,92.0,90.0,39.0,51.0,83.0
Top,1.0,159.0,0.0,0.0,210.0,6.0,904.0,757.0,0.0,20.0,34.990002
Freq,1.0,4.0,26.0,20.0,3.0,24.0,3.0,4.0,11.0,17.0,2.0
Mean,52.0,229.864078,77.951456,149.029126,197.145631,8.543689,883.990291,739.582524,18.058252,49.582524,36.039417
Std,29.877528,78.912591,60.461846,85.432631,20.2254,2.810264,88.417736,63.346158,8.791512,17.547428,7.838232
Min,1.0,137.0,0.0,0.0,160.0,4.0,708.0,641.0,0.0,20.0,17.190001
25%,26.5,152.0,0.0,115.5,180.0,6.0,819.5,684.5,14.0,38.5,30.9
Median,52.0,248.0,100.0,164.0,196.0,8.0,879.0,743.0,22.0,54.0,35.52
75%,77.5,304.0,125.0,236.0,209.5,10.0,953.0,788.0,24.0,64.0,41.205


Now, we see we have data frame with `103` rows and all columns are of numerical type. Frequency of the data indicated that values are mostly not repeated. From the maximum and minimum values we can see the data have no outliners, since distributions of the values are tends to be normal.

# Data Visualization

Lets perform some visualization just to see how visualy data look like. As first let's see the `Slump` distribution with respect of `SP` and `Fly ash`:

In [None]:
var chart = Chart.Plot(
    new Scatter()
    {
        x = df["SP"],
        y = df["Fly ash"],
        mode = "markers",
        marker = new Marker()
        {
            color = df["SLUMP(cm)"].Select(x=>x),
            colorscale = "Jet"
        }
    }
);

var layout = new Layout.Layout(){title="Slump vs. Cement and Slag"};
chart.WithLayout(layout);
chart.WithXTitle("Cement");
chart.WithYTitle("Slag");
chart


Now lets look correlation between `Slump` and `Flow`:

In [None]:
var chart = Chart.Plot(
    new Scatter()
    {
        x = df["SLUMP(cm)"],
        y = df["FLOW(cm)"],
        mode = "markers",
    }
);

var layout = new Layout.Layout(){title="Slump vs. Cement and Slag"};
chart.WithLayout(layout);
chart.WithLegend(true);
chart.WithXTitle("Slump");
chart.WithYTitle("Flow");
chart

We can see some relation in the chart and the relation is positive. This means as `Slupm` is growing, `Flow` value grows as well. If we want to measure the relation between the columns we can do that with the following code:

In [None]:
var x1= df["SLUMP(cm)"].Select(x=>Convert.ToDouble(x)).ToArray();
var x2= df["FLOW(cm)"].Select(x=>Convert.ToDouble(x)).ToArray();

//The Pearosn coeficient is calculated by
var r=x1.R(x2);
r

As can be seen, the Pearosn coeficient is very high. 