# Partitioning Data

For all datasets excluding the third, the madelon dataset, partitioning the data into training and test sets involves the use of the MLJ function, partition. This function, when passing in a range of numbers, generates two subsets of that range that contain randomly chosen numbers, and each set is mutually exclusive. For each dataset, partitioning is done in a 0.75-0.25 ratio. The returned values are train and test, which correspond to the indices of the subsets. We can then index by these returned values to get all the rows that correspond to these sets, named accordingly.

For the partition function, a random seed was implemented to ensure consistency between results for debugging. The seed value can be adjusted freely, and can be removed from the function (the "rng = rng" part) if desired.

We only need to partition the data once before feeding them into the models, and this process is roughly identical even between datasets; thus the variable names are roughly the same. Caution must be exercised when running models for other datasets: the code cell directly under "Dataset N: " must be ran first before applying any of the subsequent models.


# Parameter Tuning

All parameters were tuned automatically by the ScikitLearn package. To make predictions more accurate, however, simplifications often had to be made on a case-by-case basis. For instance, in datasets 2 and 4, certain classes were grouped together to allow for more accurate predictions. How this occurs will be explained more thoroughly in comments embedded in the code.

# Tabulated Results


**Important Note:** These values are relative to my machine and are presented for convenience. Thus, all values are subject to change, especially training and testing time. Accuracy is expected to be roughly similar due to setting a random seed.

When running a code chunk for a model, training time, testing time, accuracy, and model size will be displayed under the chunk, particular to the machine running it.



Training Time (seconds)

| Dataset | Classification Tree | Naive Bayes | Support Vector Machine | Neural Network 
| ----------- | ----------- | ----------- | ----------- | ----------- |
| Car Evaluation | 0.003052 | 0.006951 | 0.053788  | 0.908673 |
| Abalone | 0.020845 | 0.011424 | 0.105136 |  1.750691 | 
| Madelon | 0.391014 | 0.014903 | 0.616664 |  0.399532 | 
| KDD Cup | 12.201162 | 12.587327 | 70.719620 | 177.687514 | 





Testing Time (seconds)

| Dataset | Classification Tree | Naive Bayes | Support Vector Machine | Neural Network 
| ----------- | ----------- | ----------- | ----------- | ----------- |
| Car Evaluation | 0.000720 | 0.000942 | 0.023795 | 0.001075 |
| Abalone | 0.003780 | 0.003791 | 0.003925 | 0.004368 | 
| Madelon | 0.001105 | 0.004305 | 0.292484 | 0.002390 | 
| KDD Cup | 2.920390 | 3.126324 | 3.197580 | 3.299938 | 


Accuracy (Truncated to 3 Decimals)

| Dataset | Classification Tree | Naive Bayes | Support Vector Machine | Neural Network 
| ----------- | ----------- | ----------- | ----------- | ----------- |
| Car Evaluation | 0.967 | 0.884 | 0.974 | 0.995 |
| Abalone | 0.547 | 0.577 | 0.651 | 0.660 | 
| Madelon | 0.731 | 0.591 | 0.686 | 0.538 | 
| KDD Cup | 0.999 | 0.978 | 0.990 | 0.997 | 



Model Size (Kilobytes)

| Dataset | Classification Tree | Naive Bayes | Support Vector Machine | Neural Network 
| ----------- | ----------- | ----------- | ----------- | ----------- |
| Car Evaluation | 17.712  | 5.341 | 142.076  | 92.098 |
| Abalone | 105.545 | 1.101 | 0.961 | 105.545 | 
| Madelon |23.131 | 16.696 | 7301.657 | 1611.615 | 
| KDD Cup | 20.967 | 2.005 | 1.071 | 144.074 | 


Before anything, we must import all the necessary packages.

In [1]:
using Pkg
Pkg.add("StableRNGs")
Pkg.add("MLJ")
Pkg.add("ScikitLearn")
Pkg.add("CSV")
Pkg.add("DataFrames")
Pkg.add("PyCall")
Pkg.add("Random")

using StableRNGs
using MLJ:partition
using ScikitLearn
using CSV
using DataFrames
using PyCall:pyimport
using Random


# For dumping files to get size
joblib = pyimport("joblib")

[32m[1m   Updating[22m[39m registry at `C:\Users\rabbl\.julia\registries\General`
[32m[1m  Resolving[22m[39m package versions...
[32m[1mNo Changes[22m[39m to `C:\Users\rabbl\OneDrive - University of Florida\Spring 2021\Intro to Data Science\code\Project.toml`
[32m[1mNo Changes[22m[39m to `C:\Users\rabbl\OneDrive - University of Florida\Spring 2021\Intro to Data Science\code\Manifest.toml`
[32m[1m  Resolving[22m[39m package versions...
[32m[1mNo Changes[22m[39m to `C:\Users\rabbl\OneDrive - University of Florida\Spring 2021\Intro to Data Science\code\Project.toml`
[32m[1mNo Changes[22m[39m to `C:\Users\rabbl\OneDrive - University of Florida\Spring 2021\Intro to Data Science\code\Manifest.toml`
[32m[1m  Resolving[22m[39m package versions...
[32m[1mNo Changes[22m[39m to `C:\Users\rabbl\OneDrive - University of Florida\Spring 2021\Intro to Data Science\code\Project.toml`
[32m[1mNo Changes[22m[39m to `C:\Users\rabbl\OneDrive - University of Florida\Sp

PyObject <module 'joblib' from 'C:\\Users\\rabbl\\.julia\\conda\\3\\lib\\site-packages\\joblib\\__init__.py'>

# Dataset 1: Car Evaluation

This dataset is relatively straight forward to work with. The data is clumped together, so we must partition the data into training and testing sets ourselves. Since it contains strings as some variables, we must also encode them to integers to be accepted into the model. No further processing is necessary.

In [2]:
# Load the Data and separate it into X (feature) and Y (class) data
# We also need to cinvert the dataframe into an array to that the model will accept it
df = CSV.File("Dataset1/car.data", header = false) |> DataFrame
carX = convert(Array, df[:, [1, 2, 3, 4, 5, 6]])
carY = convert(Array, df[:, 7])

# Generate the partition indices for training and subset data
rng = StableRNG(566)
train, test = partition(1:length(carY), 0.75, shuffle = true, rng = rng)

# Define the training and testing subsets
trainX = carX[train, :]
trainY = carY[train, :]
testX = carX[test, :]
testY = carY[test, :]

# Being that carX contains categorical (namely non-integer) data,
# we must encode it so that the model accepts it.
@sk_import preprocessing: OneHotEncoder
enc = OneHotEncoder(handle_unknown = "ignore")

# Fit the encoder to the whole x data
enc.fit(carX)

# Encode the train and test subsets using the fitted encoder
trainX = enc.transform(trainX).toarray()
testX = enc.transform(testX).toarray()


432×21 Array{Float64,2}:
 1.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0  …  0.0  0.0  1.0  0.0  0.0  1.0  0.0
 1.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0     1.0  0.0  1.0  0.0  0.0  1.0  0.0
 1.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0     0.0  0.0  1.0  0.0  0.0  1.0  0.0
 0.0  0.0  1.0  0.0  0.0  1.0  0.0  0.0     1.0  0.0  0.0  1.0  0.0  0.0  1.0
 0.0  1.0  0.0  0.0  0.0  0.0  0.0  1.0     0.0  0.0  0.0  1.0  1.0  0.0  0.0
 0.0  1.0  0.0  0.0  0.0  0.0  1.0  0.0  …  0.0  0.0  1.0  0.0  0.0  0.0  1.0
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0     1.0  0.0  1.0  0.0  1.0  0.0  0.0
 0.0  0.0  1.0  0.0  0.0  0.0  1.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0
 0.0  0.0  1.0  0.0  0.0  0.0  1.0  0.0     1.0  0.0  1.0  0.0  1.0  0.0  0.0
 0.0  0.0  0.0  1.0  1.0  0.0  0.0  0.0     0.0  0.0  1.0  0.0  0.0  0.0  1.0
 0.0  0.0  1.0  0.0  0.0  0.0  1.0  0.0  …  0.0  1.0  0.0  0.0  0.0  1.0  0.0
 1.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0     0.0  1.0  0.0  0.0  0.0  1.0  0.0
 1.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0

### Classification Tree

In [6]:
# Import the model
@sk_import  tree: DecisionTreeClassifier
tree_model = DecisionTreeClassifier()


# We now fit the model to the training data and time this process
print("Training time:")
@time begin
fit!(tree_model, trainX, trainY)
end 

# With the fitted mode, we make predictions and compare it to the testY, 
# to see if the model was correct. Get the proportion of correct predictions.
# We also time this process
print("Testing time:")
@time begin
accuracy = sum(predict(tree_model, testX) .== testY) / length(testY)
end

println("Proportion correct: ", accuracy)

# To get the model size we dump the variable into the local directory
# using a python module, then check the size using base julia.
joblib.dump(tree_model, "model")
sz = stat("model").size
print("Model size: $(sz/1000) KB")


Training time:  0.003052 seconds (2.61 k allocations: 41.344 KiB)
Testing time:  0.000720 seconds (2.65 k allocations: 77.656 KiB)
Proportion correct: 0.9675925925925926
Model size: 17.712 KB

**Note:** The process above for importing, training, testing, and modeling will be identical for each model within each data set. Thus, future comments will not be as comprehensive for the modeling code blocks.

### Naive Bayes

In [5]:
# Model
@sk_import  naive_bayes: CategoricalNB
bayes_model = CategoricalNB()

# Training and evaluation
print("Training time:")
@time begin
fit!(bayes_model, trainX, trainY)
end

print("Testing time:")
@time begin
accuracy = sum(predict(bayes_model, testX) .== testY) / length(testY)
end

println("Proportion correct: ", accuracy)

# Model size
joblib.dump(bayes_model, "model")
sz = stat("model").size
print("Model size: $(sz/1000) KB")

Training time:  0.005423 seconds (2.61 k allocations: 41.344 KiB)
Testing time:  0.001064 seconds (2.65 k allocations: 77.656 KiB)
Proportion correct: 0.8842592592592593
Model size: 5.341 KB

### Support Vector Machine

In [4]:
# Model
@sk_import  svm: SVC
support_model = SVC()

# Training and evaluation
print("Training time:")
@time begin
fit!(support_model, trainX, trainY)
end

print("Testing time:")
@time begin
accuracy = sum(predict(support_model, testX) .== testY) / length(testY)
end

println("Proportion correct: ", accuracy)

# Model size
joblib.dump(support_model, "model")
sz = stat("model").size
print("Model size: $(sz/1000) KB")


Training time:  0.053105 seconds (2.61 k allocations: 41.344 KiB)
Testing time:  0.024272 seconds (2.65 k allocations: 77.656 KiB)
Proportion correct: 0.9745370370370371
Model size: 142.076 KB

### Neural Network

In [3]:
# Model
@sk_import neural_network: MLPClassifier
nn_model = MLPClassifier()

# Training and evaluation
print("Training time:")
@time begin
fit!(nn_model, trainX, trainY)
end

print("Testing time:")
@time begin
accuracy = sum(predict(nn_model, testX) .== testY) / length(testY)
end

println("Proportion correct: ", accuracy)

# Model size
joblib.dump(nn_model, "model")
sz = stat("model").size
print("Model size: $(sz/1000) KB")


Training time:  1.069224 seconds (302.91 k allocations: 14.971 MiB)
Testing time:  0.247096 seconds (653.91 k allocations: 33.713 MiB, 4.02% gc time)
Proportion correct: 0.9976851851851852
Model size: 92.098 KB

# Dataset 2: Abalone

This dataset had only one string column, so instead of using OneHotEncoder (which is not great for high dimensional data), we use label encoder to manually encode certain columns.

The quirk about this dataset is that there are 29 classes to predict from. Thus, to make the problem more sensible, we group the classes together based on the median. This was done in spirit of a paper mentioned in the abalone.names data. We will group classes 1-8, 9-10, and 11-29 together as 3 distinct classes.

In [7]:
# Load the data, get the X and Y. Convert to arrays
df = CSV.File("Dataset2/abalone.data", header = false) |> DataFrame
abaX = convert(Array, df[:, 1:8])
abaY = convert(Array, df[:, 9])

# Transform y into 3 ranges based on ring number
# class 1-8: 0
# class 9-10: 1
# class 11-29: 2

# For all values less than or equal to 8, turn them into a 0
abaY[abaY .<= 8] .= 0

# For all values that are 9 or 10, turn them into a 1
abaY[abaY .== 9] .= 1
abaY[abaY .== 10] .= 1

# For all values greater than or equal to 11, turn them into a 2
abaY[abaY .>= 11] .= 2

# Now we generate the partitioned indices
rng = StableRNG(566)
train, test = partition(1:length(abaY), 0.75, shuffle = true, rng = rng)

#  Get the training data for Y. We need to encode the X data first, so we generate
# trainX and testX later.
trainY = abaY[train, :]
testY = abaY[test, :]

# Load the encoder and fit it to the whole X data
@sk_import preprocessing: LabelEncoder
lenc = LabelEncoder()
lenc.fit(abaX[:, 1])

# Encode the 1st column which contains string data
abaX[:, 1] = lenc.transform(abaX[:, 1])

# Subset the remaining train and test data for X
trainX = abaX[train, :]
testX = abaX[test, :]

1044×8 Array{Any,2}:
 2  0.595  0.465  0.175  1.115   0.4015  0.254   0.39
 2  0.62   0.495  0.18   1.2555  0.5765  0.254   0.355
 2  0.57   0.465  0.125  0.849   0.3785  0.1765  0.24
 1  0.275  0.2    0.065  0.1035  0.0475  0.0205  0.03
 1  0.485  0.375  0.13   0.6025  0.2935  0.1285  0.16
 2  0.505  0.4    0.125  0.77    0.2735  0.159   0.255
 2  0.62   0.49   0.19   1.218   0.5455  0.2965  0.355
 2  0.76   0.605  0.215  2.173   0.801   0.4915  0.646
 2  0.545  0.425  0.135  0.8445  0.373   0.21    0.235
 1  0.28   0.215  0.07   0.124   0.063   0.0215  0.03
 2  0.7    0.55   0.2    1.523   0.693   0.306   0.4405
 0  0.59   0.44   0.14   1.007   0.4775  0.2105  0.2925
 1  0.335  0.26   0.1    0.192   0.0785  0.0585  0.07
 ⋮                               ⋮               
 1  0.545  0.43   0.15   0.742   0.3525  0.158   0.208
 0  0.58   0.45   0.15   0.92    0.393   0.212   0.2895
 1  0.44   0.34   0.12   0.438   0.2115  0.083   0.12
 0  0.55   0.405  0.125  0.651   0.2965  0.137   0.2


### Classification Tree

In [9]:
# Model
@sk_import  tree: DecisionTreeRegressor
tree_model = DecisionTreeRegressor()

# Training and evaluation
print("Training time:")
@time begin
fit!(tree_model, trainX, trainY)
end 

print("Testing time:")
@time begin
accuracy = sum(predict(tree_model, testX) .== testY) / length(testY)
end

println("Proportion correct: ", accuracy)

# Model size
joblib.dump(tree_model, "model")
sz = stat("model").size
print("Model size: $(sz/1000) KB")


Training time:  0.022262 seconds (56.41 k allocations: 882.031 KiB)




Testing time:  0.003497 seconds (18.83 k allocations: 307.922 KiB)
Proportion correct: 0.5660919540229885
Model size: 106.185 KB

### Naive Bayes

In [10]:
# Model
@sk_import  naive_bayes: GaussianNB
bayes_model = GaussianNB()

# Training and evaluation
print("Training time:")
@time begin
fit!(bayes_model, trainX, trainY)
end

print("Testing time:")
@time begin
accuracy = sum(predict(bayes_model, testX) .== testY) / length(testY)
end

println("Proportion correct: ", accuracy)

# Model size
joblib.dump(bayes_model, "model")
sz = stat("model").size
print("Model size: $(sz/1000) KB")


Training time:  0.012057 seconds (56.41 k allocations: 882.031 KiB)
Testing time:  0.249947 seconds (1.03 M allocations: 48.094 MiB, 4.56% gc time)
Proportion correct: 0.5775862068965517
Model size: 1.101 KB

### Support Vector Machine

In [12]:
# Model
@sk_import  svm: LinearSVC
support_model =LinearSVC()

# Training and evaluation
print("Training time:")
@time begin
fit!(support_model, trainX, trainY)
end 

print("Testing time:")
@time begin
accuracy = sum(predict(support_model, testX) .== testY) / length(testY)
end

println("Proportion correct: ", accuracy)

# Model size
joblib.dump(support_model, "model")
sz = stat("model").size
print("Model size: $(sz/1000) KB")




Training time:  0.091146 seconds (56.41 k allocations: 882.031 KiB)
Testing time:  0.003598 seconds (18.85 k allocations: 309.078 KiB)
Proportion correct: 0.6513409961685823
Model size: 0.961 KB

### Neural Network

In [14]:
# Model
@sk_import neural_network: MLPClassifier
nn_model = MLPClassifier()

# Training and evaluation
print("Training time:")
@time begin
fit!(nn_model, trainX, trainY)
end 

print("Testing time:")
@time begin
accuracy = sum(predict(nn_model, testX) .== testY) / length(testY)
end

println("Proportion correct: ", accuracy)

# Model size
joblib.dump(tree_model, "model")
sz = stat("model").size
print("Model size: $(sz/1000) KB")




Training time:  1.816154 seconds (56.41 k allocations: 882.031 KiB)
Testing time:  0.004466 seconds (18.85 k allocations: 309.078 KiB)
Proportion correct: 0.6590038314176245
Model size: 106.185 KB

# Dataset 3: Madelon

Now this dataset is particularly nice to us since they already partitioned the data into training and testing data (which were called validation data in this case). Additionally, since all features were numeric, we do not need to perform any encoding.

However, we should still shuffle the data to induce some randomness, so we do that using the shuffle function from the Random package.

In [49]:
# Load data. It is already partitioned
dfTrain = CSV.File("Dataset3/madelon_train.data", header = false) |> DataFrame
dfTest = CSV.File("Dataset3/madelon_test.data", header = false) |> DataFrame
dfValid = CSV.File("Dataset3/madelon_valid.data", header = false) |> DataFrame

dfTrainLabels = CSV.File("Dataset3/madelon_train.labels", header = false) |> DataFrame
dfValidLabels = CSV.File("Dataset3/madelon_valid.labels", header = false) |> DataFrame

# Get the x and y data, while converting to arrays
trainX = convert(Array, dfTrain[:, 1:500])
trainY = convert(Array, dfTrainLabels)

testX = convert(Array, dfValid[:, 1:500])
testY = convert(Array, dfValidLabels)

# Shuffle the data. Generate shuffled indices then index the data
rng = StableRNG(566)
trainShuffle = Random.shuffle(rng, 1:length(trainY))
testShuffle = Random.shuffle(rng, 1:length(testY))

# Now we index by the shuffled indices to get the same data, albeit shuffed
trainX = trainX[trainShuffle, :]
trainY = trainY[trainShuffle, :]

testX = testX[testShuffle, :]
testY = testY[testShuffle, :]



600×1 Array{Int64,2}:
 -1
 -1
  1
 -1
 -1
 -1
  1
  1
 -1
 -1
  1
 -1
  1
  ⋮
  1
  1
  1
 -1
  1
  1
  1
 -1
  1
 -1
 -1
 -1

### Classification Tree

In [50]:
# Model
@sk_import  tree: DecisionTreeClassifier
tree_model = DecisionTreeClassifier()

# Training and evaluation
print("Training time:")
@time begin
fit!(tree_model, trainX, trainY)
end 

print("Testing time:")
@time begin
accuracy = sum(predict(tree_model, testX) .== testY) / length(testY)
end

println("Proportion correct: ", accuracy)

# Model size
joblib.dump(tree_model, "model")
sz = stat("model").size
print("Model size: $(sz/1000) KB")




Training time:  0.391014 seconds (22 allocations: 1.125 KiB)
Testing time:  0.001105 seconds (62 allocations: 12.141 KiB)
Proportion correct: 0.7316666666666667
Model size: 23.131 KB

### Naive Bayes

In [51]:
# Model
@sk_import  naive_bayes: GaussianNB
bayes_model =  GaussianNB()

# Training and evaluation
print("Training time:")
@time begin
fit!(bayes_model, trainX, trainY)
end

print("Testing time:")
@time begin
accuracy = sum(predict(bayes_model, testX) .== testY) / length(testY)
end

println("Proportion correct: ", accuracy)

# Model size
joblib.dump(bayes_model, "model")
sz = stat("model").size
print("Model size: $(sz/1000) KB")




Training time:  0.014903 seconds (22 allocations: 1.125 KiB)
Testing time:  0.004305 seconds (62 allocations: 12.141 KiB)
Proportion correct: 0.5916666666666667
Model size: 16.696 KB

### Support Vector Machine

In [52]:
# Model
@sk_import  svm: SVC
support_model = SVC()

# Training and evaluation
print("Training time:")
@time begin
fit!(support_model, trainX, trainY)
end 

print("Testing time:")
@time begin
accuracy = sum(predict(support_model, testX) .== testY) / length(testY)
end

println("Proportion correct: ", accuracy)

# Model size
joblib.dump(support_model, "model")
sz = stat("model").size
print("Model size: $(sz/1000) KB")




Training time:  0.616664 seconds (22 allocations: 1.125 KiB)
Testing time:  0.292484 seconds (62 allocations: 12.141 KiB)
Proportion correct: 0.6866666666666666
Model size: 7301.657 KB

### Neural Network

In [53]:
# Model
@sk_import neural_network: MLPClassifier
nn_model = MLPClassifier()

# Training and evaluation
print("Training time:")
@time begin
fit!(nn_model, trainX, trainY)
end 

print("Testing time:")
@time begin
accuracy = sum(predict(nn_model, testX) .== testY) / length(testY)
end

println("Proportion correct: ", accuracy)

# Model size
joblib.dump(nn_model, "model")
sz = stat("model").size
print("Model size: $(sz/1000) KB")




Training time:  0.399532 seconds (22 allocations: 1.125 KiB)
Testing time:  0.002390 seconds (62 allocations: 12.141 KiB)
Proportion correct: 0.5383333333333333
Model size: 1611.615 KB

# Dataset 4: KDD Cup

This dataset is extremely large, and for computational ease, we are only using the 10% data, which still contains roughly 400 thousand rows. It also has about 42 columns, meaning this data has very high dimensionality. Combined with the fact that it has string data, we opt to use label encoding (instead of OneHot) so that dimensionality isn't increased.

Much like the Abalone dataset, there are many classes. In the context of this dataset, a class variable is either known as "normal," or an "attack," of which there are several types of the latter. We reduce the data into just "normal" or "attack" classes.

In [54]:
# Load the data and convert to arrays
df = CSV.File("Dataset4/kddcup.data_10_percent_corrected", header = false) |> DataFrame
kddX = convert(Array, df[:, 1:41])
kddY = convert(Array, df[:, 42])


# There are several attack types. must map normal to normal and all others to something else
# Convert to binary classification. not-attack vs. attack
kddY[kddY .== "normal."] .= "0"
kddY[kddY .!= "0"] .= "1"

# Being that they were strings, we now parse the data into integers to push into the model
kddY = parse.(Int64, kddY)


# However, there are still string columns remaining, so we must encode them
# There are 3 of them precisely.
@sk_import preprocessing: LabelEncoder
lenc2 = LabelEncoder()
lenc3 = LabelEncoder()
lenc4 = LabelEncoder()

lenc2.fit(kddX[:, 2])
kddX[:, 2] = lenc2.transform(kddX[:, 2])

lenc3.fit(kddX[:, 3])
kddX[:, 3] = lenc3.transform(kddX[:, 3])

lenc4.fit(kddX[:, 4])
kddX[:, 4] = lenc4.transform(kddX[:, 4])


# Now we partition the data into training and test sets like as done before
rng = StableRNG(566)

train, test = partition(1:length(kddY), 0.75, shuffle = true, rng = rng)

trainX = kddX[train, :]
trainY = kddY[train, :]

testX = kddX[test, :]
testY = kddY[test, :]



123505×1 Array{Int64,2}:
 0
 1
 1
 1
 1
 1
 0
 1
 1
 1
 0
 1
 0
 ⋮
 1
 1
 1
 1
 1
 1
 1
 0
 0
 1
 1
 1

### Classification Tree

In [55]:
# Model
@sk_import  tree: DecisionTreeClassifier
tree_model = DecisionTreeClassifier()

# Training and evaluation
print("Training time:")
@time begin
fit!(tree_model, trainX, trainY)
end 

print("Testing time:")
@time begin
accuracy = sum(predict(tree_model, testX) .== testY) / length(testY)
end

println("Proportion correct: ", accuracy)

# Model size
joblib.dump(tree_model, "model")
sz = stat("model").size
print("Model size: $(sz/1000) KB")




Training time: 12.201162 seconds (31.12 M allocations: 474.906 MiB, 7.79% gc time)
Testing time:  2.920390 seconds (10.37 M allocations: 159.265 MiB)
Proportion correct: 0.9997166106635359
Model size: 20.967 KB

### Naive Bayes

In [57]:
# Model
@sk_import  naive_bayes: GaussianNB
bayes_model = GaussianNB()

# Training and evaluation
print("Training time:")
@time begin
fit!(bayes_model, trainX, trainY)
end

print("Testing time:")
@time begin
accuracy = sum(predict(bayes_model, testX) .== testY) / length(testY)
end

println("Proportion correct: ", accuracy)

# Model size
joblib.dump(bayes_model, "model")
sz = stat("model").size
print("Model size: $(sz/1000) KB")




Training time: 12.587327 seconds (31.12 M allocations: 474.905 MiB, 12.08% gc time)
Testing time:  3.126324 seconds (10.37 M allocations: 159.265 MiB)
Proportion correct: 0.9788510586615926
Model size: 2.005 KB

### Support Vector Machine

In [58]:
# Model
@sk_import  svm: LinearSVC
support_model = LinearSVC()

# Training and evaluation
print("Training time:")
@time begin
fit!(support_model, trainX, trainY)
end 

print("Testing time:")
@time begin
accuracy = sum(predict(support_model, testX) .== testY) / length(testY)
end

println("Proportion correct: ", accuracy)

# Model size
joblib.dump(support_model, "model")
sz = stat("model").size
print("Model size: $(sz/1000) KB")




Training time: 70.719620 seconds (31.12 M allocations: 474.905 MiB, 1.08% gc time)
Testing time:  3.197580 seconds (10.37 M allocations: 159.265 MiB)
Proportion correct: 0.9900408890328327
Model size: 1.071 KB

### Neural Network

In [59]:
# Model
@sk_import neural_network: MLPClassifier
nn_model = MLPClassifier()

# Training and evaluation
print("Training time:")
@time begin
fit!(nn_model, trainX, trainY)
end 

print("Testing time:")
@time begin
accuracy = sum(predict(nn_model, testX) .== testY) / length(testY)
end

println("Proportion correct: ", accuracy)

# Model size
joblib.dump(nn_model, "model")
sz = stat("model").size
print("Model size: $(sz/1000) KB")




Training time:177.687514 seconds (31.12 M allocations: 474.905 MiB, 0.44% gc time)
Testing time:  3.299938 seconds (10.37 M allocations: 159.265 MiB)
Proportion correct: 0.9975871422209627
Model size: 144.074 KB