# Classification

Put simply, classification is the task of predicting a label for a given observation. For example: you are given certain physical descriptions of an animal, and your taks is to classify them as either a dog or a cat. Here, we will classify iris flowers.

As we will see later, we will use different classifiers and at the end of this notebook, we will compare them. We will define our accuracy function right now to get it out of the way. We will use a simple accuracy function that returns the ratio of the number of correctly classified observations to the total number of predictions.

In [25]:
findaccuracy(predictedvals,groundtruthvals) = sum(predictedvals.==groundtruthvals)/length(groundtruthvals)

findaccuracy (generic function with 1 method)

In [1]:
using Pkg
Pkg.activate(".")

[32m[1m  Activating[22m[39m 

environment at `e:\Projects\julia-intro\data-science\Project.toml`


In [11]:
using GLMNet
using RDatasets
using MLBase
using Plots
using DecisionTree
using Distances
using NearestNeighbors
using Random
using LinearAlgebra
using DataStructures
using LIBSVM

Import Datasets

In [12]:
iris = dataset("datasets", "iris")

Unnamed: 0_level_0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Cat…
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa
7,4.6,3.4,1.4,0.3,setosa
8,5.0,3.4,1.5,0.2,setosa
9,4.4,2.9,1.4,0.2,setosa
10,4.9,3.1,1.5,0.1,setosa


In [13]:
X = Matrix(iris[:, 1:4])
irislabels = iris[:,5]

150-element CategoricalArrays.CategoricalArray{String,1,UInt8}:
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 ⋮
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"

In [14]:
X

150×4 Matrix{Float64}:
 5.1  3.5  1.4  0.2
 4.9  3.0  1.4  0.2
 4.7  3.2  1.3  0.2
 4.6  3.1  1.5  0.2
 5.0  3.6  1.4  0.2
 5.4  3.9  1.7  0.4
 4.6  3.4  1.4  0.3
 5.0  3.4  1.5  0.2
 4.4  2.9  1.4  0.2
 4.9  3.1  1.5  0.1
 ⋮              
 6.9  3.1  5.1  2.3
 5.8  2.7  5.1  1.9
 6.8  3.2  5.9  2.3
 6.7  3.3  5.7  2.5
 6.7  3.0  5.2  2.3
 6.3  2.5  5.0  1.9
 6.5  3.0  5.2  2.0
 6.2  3.4  5.4  2.3
 5.9  3.0  5.1  1.8

In [15]:
irislabelsmap = labelmap(irislabels)
y = labelencode(irislabelsmap, irislabels)

150-element Vector{Int64}:
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 ⋮
 3
 3
 3
 3
 3
 3
 3
 3
 3


In classification, we often want to use some of the data to fit a model, and the rest of the data to validate (commonly known as training and testing data). We will get this data ready now so that we can easily use it in the rest of this notebook.

In [16]:
function perclass_splits(y, at)
    uids = unique(y)
    keepids = []
    for ui in uids 
        curids = findall(y.==ui)
        rowids = randsubseq(curids, at)
        push!(keepids, rowids...)
    end
    return keepids 
end


perclass_splits (generic function with 1 method)

# Examples

In [17]:
trainids = perclass_splits(y, 0.7)
testids = setdiff(1:length(y), trainids)

49-element Vector{Int64}:
   1
   4
   7
   8
  13
  14
  15
  16
  20
  26
   ⋮
 111
 113
 123
 124
 126
 127
 132
 134
 149

We will need one more function, and that is the function that will assign classes based on the predicted values when the predicted values are continuous.



In [20]:
assign_class(predictedvalue) = argmin(abs.(predictedvalue .-[1,2,3]))

assign_class (generic function with 1 method)

# Method 1: lasso

In [21]:
path = glmnet(X[trainids,:], y[trainids])
cv = glmnetcv(X[trainids,:], y[trainids])

Least Squares GLMNet Cross Validation
72 models for 4 predictors in 10 folds
Best λ 0.001 (mean loss 0.047, std 0.006)

In [22]:
# Choose the best lambda to predict with.
path = glmnet(X[trainids,:], y[trainids])
cv = glmnetcv(X[trainids, :], y[trainids])
mylambda = path.lambda[argmin(cv.meanloss)]

path = glmnet(X[trainids,:], y[trainids], lambda=[mylambda])

Least Squares GLMNet Solution Path (1 solutions for 4 predictors in 73 passes):
────────────────────────────
     df   pct_dev          λ
────────────────────────────
[1]   3  0.931138  0.0116204
────────────────────────────

In [23]:
q = X[testids, :]
predictions_lasso = GLMNet.predict(path, q)

49×1 Matrix{Float64}:
 0.9276082836849912
 1.00238662336695
 1.0162475368481927
 0.9548942295291533
 0.9339538181973837
 0.8995883745026947
 0.8255439981588705
 0.9422031605043685
 0.9643794929626932
 1.0296725692111122
 ⋮
 2.7094933436681297
 2.8597839863701826
 2.956098901823534
 2.620120127166794
 2.666972764317659
 2.5928341813226323
 2.763425478669522
 2.4087742593655133
 2.9306225571227618

In [26]:
predictions_lasso = assign_class.(predictions_lasso)
findaccuracy(predictions_lasso, y[testids])

0.9387755102040817

# Method 2: Ridge

We will use the same function but set alpha to zero.

In [28]:
# Choose the best lambda to predict with.
path = glmnet(X[trainids,:], y[trainids], alpha=0);
cv = glmnetcv(X[trainids,:], y[trainids], alpha=0)
mylambda = path.lambda[argmin(cv.meanloss)]
path = glmnet(X[trainids,:], y[trainids], alpha=0, lambda=[mylambda]);

q = X[testids,:];
predictions_ridge = GLMNet.predict(path, q)
predictions_ridge = assign_class.(predictions_ridge)
findaccuracy(predictions_ridge, y[testids])

0.9387755102040817

# Method 3: Elastic Net

We will use the same function but set alpha to 0.5 (it's the combination of lasso and ridge).

In [46]:
# Choose the best lambda to predict with.
path = glmnet(X[trainids,:], y[trainids], alpha=0.5);
cv = glmnetcv(X[trainids,:], y[trainids], alpha=0.5)
mylambda = path.lambda[argmin(cv.meanloss)]
path = glmnet(X[trainids,:], y[trainids], alpha=0.5, lambda=[mylambda]);

q = X[testids,:];
predictions_EN = GLMNet.predict(path, q)
predictions_EN = assign_class.(predictions_EN)
findaccuracy(predictions_EN, y[testids])

0.9591836734693877

# Method 4: Decision Trees

We will use the package DecisionTree

In [30]:
model = DecisionTreeClassifier(max_depth=2)
DecisionTree.fit!(model, X[trainids,:], y[trainids])

DecisionTreeClassifier
max_depth:                2
min_samples_leaf:         1
min_samples_split:        2
min_purity_increase:      0.0
pruning_purity_threshold: 1.0
n_subfeatures:            0
classes:                  [1, 2, 3]
root:                     Decision Tree
Leaves: 3
Depth:  2

In [31]:
q = X[testids,:];
predictions_DT = DecisionTree.predict(model, q)
findaccuracy(predictions_DT, y[testids])

0.9387755102040817

# Method 5: Random Forest

The `RandomForestClassifier` is available through the `DecisionTree` package as well.

In [50]:
model = RandomForestClassifier(n_trees=20)
DecisionTree.fit!(model, X[trainids,:], y[trainids])

RandomForestClassifier
n_trees:             20
n_subfeatures:       -1
partial_sampling:    0.7
max_depth:           -1
min_samples_leaf:    1
min_samples_split:   2
min_purity_increase: 0.0
classes:             [1, 2, 3]
ensemble:            Ensemble of Decision Trees
Trees:      20
Avg Leaves: 6.0
Avg Depth:  4.5

In [51]:
q = X[testids, :];
predictions_RF = DecisionTree.predict(model, q)
findaccuracy(predictions_RF, y[testids])

0.9387755102040817

# Method 6: Nearest Neighbor method

Use the `NearestNeighbors` package here.

In [35]:
Xtrain = X[trainids, :]
ytrain = y[trainids]
kdtree = KDTree(Xtrain')

KDTree{StaticArrays.SVector{4, Float64}, Euclidean, Float64}
  Number of points: 101
  Dimensions: 4
  Metric: Euclidean(0.0)
  Reordered: true

In [37]:
queries = X[testids, :]

49×4 Matrix{Float64}:
 5.1  3.5  1.4  0.2
 4.6  3.1  1.5  0.2
 4.6  3.4  1.4  0.3
 5.0  3.4  1.5  0.2
 4.8  3.0  1.4  0.1
 4.3  3.0  1.1  0.1
 5.8  4.0  1.2  0.2
 5.7  4.4  1.5  0.4
 5.1  3.8  1.5  0.3
 5.0  3.0  1.6  0.2
 ⋮              
 6.5  3.2  5.1  2.0
 6.8  3.0  5.5  2.1
 7.7  2.8  6.7  2.0
 6.3  2.7  4.9  1.8
 7.2  3.2  6.0  1.8
 6.2  2.8  4.8  1.8
 7.9  3.8  6.4  2.0
 6.3  2.8  5.1  1.5
 6.2  3.4  5.4  2.3

In [39]:
idxs, dists = knn(kdtree, queries', 5, true)

([[10, 3, 22, 18, 23], [2, 26, 24, 21, 5], [2, 8, 24, 20, 23], [22, 29, 10, 17, 8], [1, 6, 26, 19, 2], [21, 24, 5, 2, 26], [9, 11, 7, 28, 4], [4, 9, 11, 7, 28], [13, 27, 28, 3, 10], [19, 6, 1, 26, 29]  …  [85, 67, 84, 88, 96], [100, 74, 98, 90, 75], [92, 93, 79, 98, 81], [77, 85, 88, 76, 67], [99, 82, 47, 55, 71], [84, 67, 85, 96, 79], [82, 91, 99, 55, 41], [76, 88, 70, 85, 67], [55, 47, 82, 41, 71], [89, 74, 100, 93, 90]], [[0.09999999999999998, 0.1414213562373093, 0.14142135623730964, 0.14142135623730995, 0.17320508075688743], [0.24494897427831802, 0.26457513110645925, 0.29999999999999954, 0.29999999999999954, 0.2999999999999997], [0.264575131106459, 0.3000000000000002, 0.3162277660168373, 0.4123105625617666, 0.42426406871192884], [0.09999999999999964, 0.14142135623730964, 0.1999999999999999, 0.22360679774997902, 0.22360679774997916], [0.1414213562373099, 0.17320508075688812, 0.19999999999999998, 0.20000000000000034, 0.264575131106459], [0.244948974278318, 0.31622776601683816, 0.3464

In [41]:
c = ytrain[hcat(idxs...)]
possible_labels = map(i->counter(c[:,i]), 1:size(c,2))
predictions_NN = map(i->parse(Int, string(string(argmax(possible_labels[i])))), 1:size(c,2))
findaccuracy(predictions_NN, y[testids])

0.9387755102040817

# Method 7: Support Vector Machines

We will use the `LIBSVM` package here

In [42]:
Xtrain = X[trainids,:]
ytrain = y[trainids]

101-element Vector{Int64}:
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 ⋮
 3
 3
 3
 3
 3
 3
 3
 3
 3

In [43]:
model = svmtrain(Xtrain', ytrain)

LIBSVM.SVM{Int64}(SVC, LIBSVM.Kernel.RadialBasis, nothing, 4, 3, [1, 2, 3], Int32[1, 2, 3], Float64[], Int32[], LIBSVM.SupportVectors{Vector{Int64}, Matrix{Float64}}(36, Int32[5, 17, 14], [1, 1, 1, 1, 1, 2, 2, 2, 2, 2  …  3, 3, 3, 3, 3, 3, 3, 3, 3, 3], [4.4 5.7 … 6.5 5.9; 2.9 3.8 … 3.0 3.0; 1.4 1.7 … 5.2 5.1; 0.2 0.3 … 2.0 1.8], Int32[5, 11, 14, 15, 16, 30, 32, 34, 35, 38  …  80, 82, 84, 87, 91, 94, 95, 99, 100, 101], LIBSVM.SVMNode[LIBSVM.SVMNode(1, 4.4), LIBSVM.SVMNode(1, 5.7), LIBSVM.SVMNode(1, 4.6), LIBSVM.SVMNode(1, 5.1), LIBSVM.SVMNode(1, 4.8), LIBSVM.SVMNode(1, 7.0), LIBSVM.SVMNode(1, 6.9), LIBSVM.SVMNode(1, 6.3), LIBSVM.SVMNode(1, 4.9), LIBSVM.SVMNode(1, 5.0)  …  LIBSVM.SVMNode(1, 5.6), LIBSVM.SVMNode(1, 6.1), LIBSVM.SVMNode(1, 7.2), LIBSVM.SVMNode(1, 6.1), LIBSVM.SVMNode(1, 6.0), LIBSVM.SVMNode(1, 6.9), LIBSVM.SVMNode(1, 5.8), LIBSVM.SVMNode(1, 6.3), LIBSVM.SVMNode(1, 6.5), LIBSVM.SVMNode(1, 5.9)]), 0.0, [0.646210754642604 0.7367208204248435; 0.585671002611582 0.79347426660048

In [44]:
predictions_SVM, decision_values = svmpredict(model, X[testids, :]')
findaccuracy(predictions_SVM, y[testids])

0.9591836734693877

Putting all the results together:

In [52]:
overall_accuracies = zeros(7)
methods = ["lasso","ridge","EN", "DT", "RF","kNN", "SVM"]
ytest = y[testids]
overall_accuracies[1] = findaccuracy(predictions_lasso,ytest)
overall_accuracies[2] = findaccuracy(predictions_ridge,ytest)
overall_accuracies[3] = findaccuracy(predictions_EN,ytest)
overall_accuracies[4] = findaccuracy(predictions_DT,ytest)
overall_accuracies[5] = findaccuracy(predictions_RF,ytest)
overall_accuracies[6] = findaccuracy(predictions_NN,ytest)
overall_accuracies[7] = findaccuracy(predictions_SVM,ytest)
hcat(methods, overall_accuracies)

7×2 Matrix{Any}:
 "lasso"  0.938776
 "ridge"  0.938776
 "EN"     0.959184
 "DT"     0.938776
 "RF"     0.938776
 "kNN"    0.938776
 "SVM"    0.959184