### Logistic classification with MNIST

(two predictors)

#### Libraries

In [None]:
using MLDatasets           # mnist
using Images
using PreprocessingImages; pim = PreprocessingImages
using PreprocessingArrays; pa  = PreprocessingArrays

using MLJ                  # make_blobs, rmse, confmat, categorical
using MLDataUtils          # label, nlabel, labelfreq
using GLM

using Metrics              # r2-score
using Random
using Plots; gr()
using StatsPlots
using DataFrames

### Functions

In [None]:
# metrics
function printMetrics(ŷ, y)
    display(confmat(ŷ, y))
    println("accuracy: ", round(accuracy(ŷ, y); digits=3))
    println("f1-score: ", round(f1score(ŷ, y);  digits=3))
end


### MNIST

In [None]:
# load mnist
datasetX,    datasetY    = MNIST(:train)[:]
validationX, validationY = MNIST(:test)[:]

display( size(datasetX) )

img  = datasetX[:, :, 1:5]
img2 = permutedims(img, (2, 1, 3))

display(datasetY[1:5]')
mosaicview( Gray.(img2)  ; nrow=1)

In [None]:
# split trainset, testset from dataset
Random.seed!(1)
(trainX, trainY), (testX, testY) = stratifiedobs((datasetX, datasetY), p = 0.7)
size(trainX), size(testX), size(validationX)

#### Preprocessing
Data preprocessing depends on the data source, thus can widely vary from what is shown here.

In [None]:
# select classes for prediction
c = (1, 5)


In [None]:
# feature extraction
meanIntensity(img) = mean(Float64.(img))

function hSymmetry(img)
    imgFloat = Float32.(img)
    imgReverse = reverse(imgFloat, dims=1)
    return -mean( abs.(imgFloat - imgReverse) )
end


In [None]:
# preprocessing
function preprocess(X, y)
    # process X
    Xs = pim.batchImage2Vector( Float32.(X) )

    # data selection from chosen classes
    Xs = vcat( Xs[y .== c[1] ], Xs[ y .== c[2] ] )
    ys = vcat(  y[y .== c[1] ],  y[ y .== c[2] ] )

    # extract parameters from X
    N = size(Xs)[1]
    x1 = [meanIntensity(Xs[i]) for i in 1:N]
    x2 = [hSymmetry(Xs[i])     for i in 1:N]
    Xs = hcat(x1, x2)
    Xs = pa.rescaleByColumns( Float32.(Xs) )
    
    # formatting for MLJ
    Xs = DataFrame(Xs, :auto)
    ys = coerce(ys, OrderedFactor)
    
    return (Xs, ys)
end


trainXLog, trainYLog = preprocess(trainX, trainY)
size(trainXLog), size(trainYLog), levels(trainYLog)'

### Training, Testing, Validation

#### Load the algorithm

In [None]:
LogisticClassifier = @load LogisticClassifier pkg=MLJLinearModels verbosity=0

#### Instantiate the model

In the context of MLJ, "model" means just a container for hyper-parameters.

It is worth to note the output of the below command line, which is a list of the actual values assigned for each hyper-parameters, including the default ones. This information can be useful, for exemple, for tuning the parameter at a later stage.

In [None]:
model = LogisticClassifier()

In [None]:
info(LogisticClassifier)

#### Creates a machine

In MLJ, "machine" means an object with all learning parameters (i.e. hyper-parameters + trainset).

In [None]:
mach = MLJ.machine(model, trainXLog, trainYLog)

#### Train the machine

The machine (or model) is trained according to the programmed hyper-parameters and dataset:

In [None]:
fit!(mach,
    # acceleration = CPUThreads(),   # https://alan-turing-institute.github.io/MLJ.jl/v0.7/acceleration_and_parallelism/
    verbosity=2)

After training, one can inspect the learning parameters:

In [None]:
fitted_params(mach)

Everything else the developer might be interested in, if any, can be accesses from the training report:

In [None]:
report(mach)

#### Predict an outcome

The trained machine/model, stored in the object created for that purpose, is now used to predict the outcome for the trainset:

In [None]:
p = MLJ.predict(mach, trainXLog);


We can inspect a few rows of the prediction, then just a single row:

In [None]:
display(p[1:5])
p[1]

For this particular model, the prediction is represented as probabilities for each of the classes. To translate that as the most likely class, we have:

In [None]:
ŷ = predict_mode(mach, trainXLog)
ŷ[1:5]

We can also extract relevant metrics as in the below example:

In [None]:
printMetrics(ŷ, trainYLog)

#### Tune a single hyper-parameter

When this particular model was instantiated above, one can see that the hyper-parameter "Lambda" could be of relevance to improve the model. Let's tune it as an attempt to minimize the cross-entropy loss and maximize accuracy.

First, we define the parameter and limits to scan:

In [None]:
r = range(model, :lambda, lower = 1e-5, upper=1e-1, scale = :log10)

Then, we define a 10-fold cross-validation, and capture the range parameter(lambdas) and the cross-entropy losses vectors (losses). The first two parameters of the tuple out of the function "learning_curve" are not relevent for this example, so are ignored:

In [None]:
_, _, lambdas, losses = learning_curve(mach,
                                        range=r,
                                        resampling=CV(nfolds=10),
                                        resolution=100,                 # default 30
                                        measure=cross_entropy,
                                        acceleration=CPUProcesses());   # useful if more than one parameter is plot

In [None]:
plot(lambdas, losses, title="Error function", size=(500,300), linewidth=2, legend=false)
xlabel!("Lambda")
ylabel!("Cross-entropy loss")

As seen on the chart above, the best tuning parameter is:

In [None]:
best_lambda = lambdas[argmin(losses)]

#### Retrain with best tuning parameter

In [None]:
model.lambda = best_lambda
fit!(mach,
    verbosity=2);

In [None]:
ŷ = predict_mode(mach, trainXLog)
printMetrics(ŷ, trainYLog)

#### Evaluate

(in progress)

In [None]:
MLJ.evaluate!(mach,
    resampling=CV(nfolds=10),
    measures=[f1score])


In [None]:
fitted_params(mach)

In [None]:
ŷ = predict_mode(mach, trainXLog)
printMetrics(ŷ, trainYLog)

#### Testing

In [None]:
testXLog, testYLog = preprocess(testX, testY)

ŷ = predict_mode(mach, testXLog)
printMetrics(ŷ, testYLog)

#### Validation

In [None]:
validationXLog, validationYLog = preprocess(validationX, validationY)

ŷ = predict_mode(mach, validationXLog)
printMetrics(ŷ, validationYLog)