This notebook explores classification with MLJ, in particular with KNN and Multiclass logistic regression. 

The notebook is more oriented to learning Julia and MLJ than inspecting the data or the results. 

In [32]:
# NOTEBOOK SETTINGS ------
d_packages= false # false 
# ENV ------
versioninfo()

Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 5 5600X 6-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)


In [2]:
# GET PACKAGES ------
if d_packages
    using Pkg
    Pkg.add("HTTP")
    Pkg.add("MLJ")
    Pkg.add("PyPlot")
    Pkg.add("DataFrames")
    Pkg.add("UrlDownload")
    Pkg.add("NearestNeighborModels")
    Pkg.add("MLJLinearModels")
end

In [3]:
# PACKAGES ------
using HTTP
using MLJ
using PyPlot
import DataFrames: DataFrame, describe
using UrlDownload

# GET DATA ------

url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"
header = ["Class", "Alcool", "Malic acid", "Ash", "Alcalinity of ash",
          "Magnesium", "Total phenols", "Flavanoids",
          "Nonflavanoid phenols", "Proanthcyanins", "Color intensity",
          "Hue", "OD280/OD315 of diluted wines", "Proline"]
data = urldownload(url, true, format=:CSV, header=header);



In [4]:
df = DataFrame(data)
describe(df)

Unnamed: 0_level_0,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Float64,Real,Float64,Real,Int64,DataType
1,Class,1.9382,1.0,2.0,3.0,0,Int64
2,Alcool,13.0006,11.03,13.05,14.83,0,Float64
3,Malic acid,2.33635,0.74,1.865,5.8,0,Float64
4,Ash,2.36652,1.36,2.36,3.23,0,Float64
5,Alcalinity of ash,19.4949,10.6,19.5,30.0,0,Float64
6,Magnesium,99.7416,70.0,98.0,162.0,0,Int64
7,Total phenols,2.29511,0.98,2.355,3.88,0,Float64
8,Flavanoids,2.02927,0.34,2.135,5.08,0,Float64
9,Nonflavanoid phenols,0.361854,0.13,0.34,0.66,0,Float64
10,Proanthcyanins,1.5909,0.41,1.555,3.58,0,Float64


From the describe table we can see that the target has 3 possible values. **Class** is the target column, it has a minimum of 1, a maximum of 3 and a median of 2. 

Since the column is of type Int, describe gives a mean. The mean is 1.93 indicating a slight imbalance that favorizes the first class but it is very light. 

### Create Dependent and independent variables 

MLJ has this function **unpack** that separates a DF in multiple parts. The parts are called __filters__, the set that satisfies a filter (yielding true) becomes a subset of columns. Since we know that the column __Class__ is teh target, we pass ==(:Class). I think col->true sets to true everything else (X). I find that  !=(:Class) may be more explicit. 

In [5]:
y, X = unpack(df, ==(:Class), col->true); # equivalent to y, X = unpack(df, ==(:Class), !=(:Class));

Now, as a common practice, we can check the schema. The Schema gives the types of the stored data and the MLJ interpretation in terms of scientific types. 

In this case, almost everything is **Continuous** and we have some **Count** columns (Magnesium and Proline). 


It's also nice to see the number of rows. 

In [6]:
schema(X)

┌──────────────────────────────┬─────────┬────────────┐
│[22m _.names                      [0m│[22m _.types [0m│[22m _.scitypes [0m│
├──────────────────────────────┼─────────┼────────────┤
│ Alcool                       │ Float64 │ Continuous │
│ Malic acid                   │ Float64 │ Continuous │
│ Ash                          │ Float64 │ Continuous │
│ Alcalinity of ash            │ Float64 │ Continuous │
│ Magnesium                    │ Int64   │ Count      │
│ Total phenols                │ Float64 │ Continuous │
│ Flavanoids                   │ Float64 │ Continuous │
│ Nonflavanoid phenols         │ Float64 │ Continuous │
│ Proanthcyanins               │ Float64 │ Continuous │
│ Color intensity              │ Float64 │ Continuous │
│ Hue                          │ Float64 │ Continuous │
│ OD280/OD315 of diluted wines │ Float64 │ Continuous │
│ Proline                      │ Int64   │ Count      │
└──────────────────────────────┴─────────┴────────────┘
_.nrows = 178


In [7]:
methods(X)

### Investigate a little

As we thought, Y has 3 possible values (1,2,3). 

In [8]:
unique(y)

3-element Vector{Int64}:
 1
 2
 3

The column will be undestood as __Count__ type. 

In [9]:
scitype(y)

AbstractVector{Count} (alias for AbstractArray{Count, 1})

But we should change that since it's more of a factor column rather than a numerical one. 
To change the scitype of a column we use the __coerce__ function. 

In [10]:
y= coerce(y, OrderedFactor);

In [11]:
scitype(y)

AbstractVector{OrderedFactor{3}} (alias for AbstractArray{OrderedFactor{3}, 1})

So now we have an OrderedFactor with 3 classes. Great!

As for the predictors, we are dealing exclusively with continuous data. We should coerce Count to Continuous.

In [12]:
X_coerced= coerce(X, Count=>Continuous); # following "old"=> "new"
schema(X_coerced)

┌──────────────────────────────┬─────────┬────────────┐
│[22m _.names                      [0m│[22m _.types [0m│[22m _.scitypes [0m│
├──────────────────────────────┼─────────┼────────────┤
│ Alcool                       │ Float64 │ Continuous │
│ Malic acid                   │ Float64 │ Continuous │
│ Ash                          │ Float64 │ Continuous │
│ Alcalinity of ash            │ Float64 │ Continuous │
│ Magnesium                    │ Float64 │ Continuous │
│ Total phenols                │ Float64 │ Continuous │
│ Flavanoids                   │ Float64 │ Continuous │
│ Nonflavanoid phenols         │ Float64 │ Continuous │
│ Proanthcyanins               │ Float64 │ Continuous │
│ Color intensity              │ Float64 │ Continuous │
│ Hue                          │ Float64 │ Continuous │
│ OD280/OD315 of diluted wines │ Float64 │ Continuous │
│ Proline                      │ Float64 │ Continuous │
└──────────────────────────────┴─────────┴────────────┘
_.nrows = 178


As this is not a deep dive in the problem, we will not check every column's distribution and correlation with the others. From the __describe__ table, we can see that columns values vary in magnitude (look at __Proline__ for ex). It might be a good idea to standardize the data, specially if we use a distance metrice or an iterative solver like gradient descent.  

### Separate Train and Testing 

In MLJ, the partition usually is done for the indices rather than for the tables. Then the indices are used to subset the tables. 

In [13]:
### Separate Train and Testing ---
train, test = partition(collect(eachindex(y)), 0.8, shuffle=true, rng=123); # Usual 20 % testing 
# eachindex is a safe 1:length(y)


In [14]:
# Subset ---
Xtrain = selectrows(X_coerced, train)
Xtest = selectrows(X_coerced, test)
ytrain = selectrows(y, train)
ytest = selectrows(y, test);

We then define the models. 

In [15]:
# MODELS ---
KNNC = @load KNNClassifier  # A nearest neighbors model 
MNC = @load MultinomialClassifier pkg=MLJLinearModels; # a logistic model for multiclass data 

KnnPipe = @pipeline(Standardizer(), KNNC())
MnPipe = @pipeline(Standardizer(), MNC());

┌ Info: For silent loading, specify `verbosity=0`. 
└ @ Main /root/.julia/packages/MLJModels/5itei/src/loading.jl:168


import NearestNeighborModels ✔
import MLJLinearModels ✔

┌ Info: For silent loading, specify `verbosity=0`. 
└ @ Main /root/.julia/packages/MLJModels/5itei/src/loading.jl:168





As always, we wrap a model with a machine. Machine holds the parameters of a model.

In [16]:
# MACHINES --- 
knn = machine(KnnPipe, Xtrain, ytrain)
regression = machine(MnPipe, Xtrain, ytrain)

[34mMachine{Pipeline266,…} @929[39m trained 0 times; caches data
  args: 
    1:	[34mSource @592[39m ⏎ `Table{AbstractVector{Continuous}}`
    2:	[34mSource @716[39m ⏎ `AbstractVector{OrderedFactor{3}}`


As our models are __trained 0 times__, we call the fit method. 

In [17]:
# FITTING --- 
fit!(knn)
fit!(regression)

┌ Info: Training [34mMachine{Pipeline259,…} @627[39m.
└ @ MLJBase /root/.julia/packages/MLJBase/rN59G/src/machines.jl:354
┌ Info: Training [34mMachine{Standardizer,…} @759[39m.
└ @ MLJBase /root/.julia/packages/MLJBase/rN59G/src/machines.jl:354
┌ Info: Training [34mMachine{KNNClassifier,…} @028[39m.
└ @ MLJBase /root/.julia/packages/MLJBase/rN59G/src/machines.jl:354
┌ Info: Training [34mMachine{Pipeline266,…} @929[39m.
└ @ MLJBase /root/.julia/packages/MLJBase/rN59G/src/machines.jl:354
┌ Info: Training [34mMachine{Standardizer,…} @247[39m.
└ @ MLJBase /root/.julia/packages/MLJBase/rN59G/src/machines.jl:354
┌ Info: Training [34mMachine{MultinomialClassifier,…} @905[39m.
└ @ MLJBase /root/.julia/packages/MLJBase/rN59G/src/machines.jl:354


[34mMachine{Pipeline266,…} @929[39m trained 1 time; caches data
  args: 
    1:	[34mSource @592[39m ⏎ `Table{AbstractVector{Continuous}}`
    2:	[34mSource @716[39m ⏎ `AbstractVector{OrderedFactor{3}}`


Great that was fast. We can get the prediction calling the predict method. 

One thing to note is that the model gives a prediction for every class. The weights matrix of the model is then of shape (n_columns, classes). 

In [18]:
predict(regression, Xtrain[1:3,:]) 

3-element MLJBase.UnivariateFiniteVector{OrderedFactor{3}, Int64, UInt32, Float64}:
 UnivariateFinite{OrderedFactor{3}}(1=>0.993, 2=>0.00696, 3=>9.58e-5)
 UnivariateFinite{OrderedFactor{3}}(1=>0.00551, 2=>0.994, 3=>2.81e-5)
 UnivariateFinite{OrderedFactor{3}}(1=>1.0, 2=>2.65e-5, 3=>8.27e-5)

In [19]:
W=fitted_params(regression)
W[1].coefs

13-element Vector{Pair{Symbol, SubArray{Float64, 1, Matrix{Float64}, Tuple{Int64, Base.Slice{Base.OneTo{Int64}}}, true}}}:
                                :Alcool => [0.7440262558090187, -0.983628753301021, 0.23960249749200227]
                   Symbol("Malic acid") => [0.1801466521718594, -0.3929356927714954, 0.21278904059963544]
                                   :Ash => [0.4109091488604062, -0.7643041885115611, 0.35339503965115504]
            Symbol("Alcalinity of ash") => [-0.7928476360704427, 0.5965465593788984, 0.19630107669154379]
                             :Magnesium => [0.10296940790786051, -0.13595956522364963, 0.03299015731578979]
                Symbol("Total phenols") => [0.1775501748972599, 0.18696576707637386, -0.3645159419736337]
                            :Flavanoids => [0.6010862996227639, 0.3344089580899755, -0.9354952577127393]
         Symbol("Nonflavanoid phenols") => [-0.17596525958347306, 0.16472099831170142, 0.011244261271772183]
                        :P

To take the argmax prediction from the 3 classes at eachrow, we use the predict_mode method. 

In [20]:
# PREDICT SOFTMAX --- 
knn_y_hat=predict_mode(knn, Xtrain)
regression_y_hat=predict_mode(regression, Xtrain);

With these prediction we can know calculate several things, let's try accuracy and misclassification_rate

In [21]:
print("KNN MISCLASSIFICATION")
misclassification_rate(knn_y_hat, ytrain)

KNN MISCLASSIFICATION

0.028169014084507043

In [22]:
print("REGRESSION MISCLASSIFICATION")
misclassification_rate(regression_y_hat, ytrain)

REGRESSION MISCLASSIFICATION

0.0

So weirdly, the regression model did a perfect fit for the training set. The data seems to separate very well the classes. 

### PREDICT TESTING SOFTMAX --- 
Now we make our prediction on the testing set. 

In [23]:
knn_y_test=predict_mode(knn, Xtest)
regression_y_test=predict_mode(regression, Xtest);

In [24]:
print("[TEST] KNN MISCLASSIFICATION")
round(misclassification_rate(knn_y_test, ytest),sigdigits=3)

[TEST] KNN MISCLASSIFICATION

0.0278

In [25]:
print("[TEST] REGRESSION MISCLASSIFICATION")
round(misclassification_rate(regression_y_test, ytest), sigdigits=5)

[TEST] REGRESSION MISCLASSIFICATION

0.0

Since the misclassification rate is 0, we should find a perfect accuracy. 

In [27]:
print("[TEST] REGRESSION ACCURACY")
accuracy(regression_y_test, ytest)

[TEST] REGRESSION ACCURACY

1.0