# An Example of Classical Machine Learning Pipeline

This section describes a classical machine learning pipeline.

The dataset, called `iris`, is loaded by exploiting [RDatasets](https://vincentarelbundock.github.io/Rdatasets/) library, originally intended to be distributed within the software environment of `R` language.

The data is processed as required by the `DecisionTree` library; we leverage it to train decision tree models for classifying different types of iris flowers.

Later in the notebook, we will repeat the process but leveraging `Sole.jl` library, and more-than-propositional logic.

## Data Loading and Description

In [1]:
using RDatasets

iris = dataset("datasets", "iris")
first(iris, 5)

Row,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Cat…
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa


In [18]:
iris[1, 2]

3.5

In [16]:
iris[1, :SepalWidth]

3.5

In [17]:
iris[1:3, :SepalLength]

3-element Vector{Float64}:
 5.1
 4.9
 4.7

In [2]:
describe(iris)

Row,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,DataType
1,SepalLength,5.84333,4.3,5.8,7.9,0,Float64
2,SepalWidth,3.05733,2.0,3.0,4.4,0,Float64
3,PetalLength,3.758,1.0,4.35,6.9,0,Float64
4,PetalWidth,1.19933,0.1,1.3,2.5,0,Float64
5,Species,,setosa,,virginica,0,"CategoricalValue{String, UInt8}"


## Data Preprocessing

In the limited scenario of this exercise, there is not much space for complex preprocessing of our data. For example, we are not dealing with unbalanced classes, missing data and complex encodings. 

In the cell below, we just separate all the attributes (`X`) from the target column, encoding the class we want to learn how to predict (`y`).

In [20]:
X = Matrix(iris[:, 1:4])

println("The attributes of the first three instances are:")
X[1:3, :]

The attributes of the first three instances are:


3×4 Matrix{Float64}:
 5.1  3.5  1.4  0.2
 4.9  3.0  1.4  0.2
 4.7  3.2  1.3  0.2

Classes are encoded as `CategoricalValue`s for efficiency. Instead of repeating one string (e.g., "setosa") many times, each class essentially is a small integer (an `Int8`) and gets mapped to a string value.

In [26]:
iris[:, :Species][[1,51,101]]

3-element CategoricalArrays.CategoricalArray{String,1,UInt8}:
 "setosa"
 "versicolor"
 "virginica"

# Learning with Sole.jl


In [None]:
# TODO: see Day1-Appetizer.ipynb