### DataFrames basics

For a complete overview see the official documentation at `https://juliadata.github.io/DataFrames.jl/stable/`

In [1]:
using DataFrames

DataFrame construction

From an Array

In [2]:
df = DataFrame(randn(5,4))

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Float64,Float64,Float64,Float64
1,-1.2539,-0.669164,0.221726,-0.275339
2,1.16686,-1.72402,-0.585281,-0.847573
3,-1.00434,-1.08028,-0.199503,-1.44058
4,-1.22948,1.25691,0.843868,0.149301
5,-0.186074,2.50235,-0.45772,1.07423


But this makes more sense.

In [3]:
cars = DataFrame(
    brand = ["Volvo", "Volkswagen", "Škoda"], 
    motor = [2.0, 1.6, 1.3], 
    doors = [3, 5, 5])

Unnamed: 0_level_0,brand,motor,doors
Unnamed: 0_level_1,String,Float64,Int64
1,Volvo,2.0,3
2,Volkswagen,1.6,5
3,Škoda,1.3,5


indexing like a matrix - not so useful

In [4]:
cars[2:3, 1:2]

Unnamed: 0_level_0,brand,motor
Unnamed: 0_level_1,String,Float64
1,Volkswagen,1.6
2,Škoda,1.3


list of column names - you can use this to iterate over columns

In [5]:
names(cars)

3-element Array{String,1}:
 "brand"
 "motor"
 "doors"

column names as symbols

In [6]:
propertynames(cars)

3-element Array{Symbol,1}:
 :brand
 :motor
 :doors

getting a column - is a vector

In [7]:
cars.brand

3-element Array{String,1}:
 "Volvo"
 "Volkswagen"
 "Škoda"

In [10]:
cars[!,"brand"]

3-element Array{String,1}:
 "Volvo"
 "Volkswagen"
 "Škoda"

In [11]:
cars[!,:brand]

3-element Array{String,1}:
 "Volvo"
 "Volkswagen"
 "Škoda"

In [12]:
col = :doors
cars[!, col]

3-element Array{Int64,1}:
 3
 5
 5

getting multiple columns - is a DataFrame

In [13]:
cars[:,[:brand, :motor]]

Unnamed: 0_level_0,brand,motor
Unnamed: 0_level_1,String,Float64
1,Volvo,2.0
2,Volkswagen,1.6
3,Škoda,1.3


In [14]:
cars[:, Not(:brand)]

Unnamed: 0_level_0,motor,doors
Unnamed: 0_level_1,Float64,Int64
1,2.0,3
2,1.6,5
3,1.3,5


getting a row - is a DataFrameRow

In [19]:
cars[1:2,[:brand, :motor]]

Unnamed: 0_level_0,brand,motor
Unnamed: 0_level_1,String,Float64
1,Volvo,2.0
2,Volkswagen,1.6


multiple rows - is a DataFrame

In [20]:
cars[1:2, :]

Unnamed: 0_level_0,brand,motor,doors
Unnamed: 0_level_1,String,Float64,Int64
1,Volvo,2.0,3
2,Volkswagen,1.6,5


adding a column to an existing df

In [21]:
cars[:,:country]

ArgumentError: ArgumentError: column name :country not found in the data frame

In [22]:
cars[:,:country] = ["Sweden", "Germany", "Czech Republic"];
cars

Unnamed: 0_level_0,brand,motor,doors,country
Unnamed: 0_level_1,String,Float64,Int64,String
1,Volvo,2.0,3,Sweden
2,Volkswagen,1.6,5,Germany
3,Škoda,1.3,5,Czech Republic


adding a row to an existing df

In [23]:
a = [1,2,3]

3-element Array{Int64,1}:
 1
 2
 3

In [24]:
push!(a, 4)

4-element Array{Int64,1}:
 1
 2
 3
 4

In [25]:
a

4-element Array{Int64,1}:
 1
 2
 3
 4

In [26]:
push!(cars, ["Fiat", 1.0, 3, "Italy"])
push!(cars, ["Chrysler", 2.4, 5, "USA"])

Unnamed: 0_level_0,brand,motor,doors,country
Unnamed: 0_level_1,String,Float64,Int64,String
1,Volvo,2.0,3,Sweden
2,Volkswagen,1.6,5,Germany
3,Škoda,1.3,5,Czech Republic
4,Fiat,1.0,3,Italy
5,Chrysler,2.4,5,USA


produces a df that contains a summary for columns of the argument

In [27]:
describe(cars)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Union…,Nothing,DataType
1,brand,,Chrysler,,Škoda,5.0,,String
2,motor,1.66,1.0,1.6,2.4,,,Float64
3,doors,4.2,3,5.0,5,,,Int64
4,country,,Czech Republic,,USA,5.0,,String


In [29]:
for col in names(cars)
    println(cars[1,col])
end

Volvo
2.0
3
Sweden


Iterating over rows of a DataFrame

In [31]:
car_description(r::DataFrameRow) = println("A $(r.brand) with $(r.doors) doors and a $(r.motor)l motor.")

car_description (generic function with 1 method)

In [32]:
car_description.(eachrow(cars));

A Volvo with 3 doors and a 2.0l motor.
A Volkswagen with 5 doors and a 1.6l motor.
A Škoda with 5 doors and a 1.3l motor.
A Fiat with 3 doors and a 1.0l motor.
A Chrysler with 5 doors and a 2.4l motor.


#### A small join example

In [33]:
brands = DataFrame(parent = ["Fiat", "Volkswagen", "Volkswagen", "Geely", "Fiat"], brand = ["Fiat", "Volkswagen", "Škoda", "Volvo", "Chrysler"])

Unnamed: 0_level_0,parent,brand
Unnamed: 0_level_1,String,String
1,Fiat,Fiat
2,Volkswagen,Volkswagen
3,Volkswagen,Škoda
4,Geely,Volvo
5,Fiat,Chrysler


In [34]:
cars

Unnamed: 0_level_0,brand,motor,doors,country
Unnamed: 0_level_1,String,Float64,Int64,String
1,Volvo,2.0,3,Sweden
2,Volkswagen,1.6,5,Germany
3,Škoda,1.3,5,Czech Republic
4,Fiat,1.0,3,Italy
5,Chrysler,2.4,5,USA


In [35]:
cars = leftjoin(cars, brands, on = :brand)

Unnamed: 0_level_0,brand,motor,doors,country,parent
Unnamed: 0_level_1,String,Float64,Int64,String,String?
1,Volvo,2.0,3,Sweden,Geely
2,Volkswagen,1.6,5,Germany,Volkswagen
3,Škoda,1.3,5,Czech Republic,Volkswagen
4,Fiat,1.0,3,Italy,Fiat
5,Chrysler,2.4,5,USA,Fiat


Sorting

In [36]:
sort(cars, :motor)

Unnamed: 0_level_0,brand,motor,doors,country,parent
Unnamed: 0_level_1,String,Float64,Int64,String,String?
1,Fiat,1.0,3,Italy,Fiat
2,Škoda,1.3,5,Czech Republic,Volkswagen
3,Volkswagen,1.6,5,Germany,Volkswagen
4,Volvo,2.0,3,Sweden,Geely
5,Chrysler,2.4,5,USA,Fiat


In [37]:
sort(cars, [:doors, :motor], rev=true)

Unnamed: 0_level_0,brand,motor,doors,country,parent
Unnamed: 0_level_1,String,Float64,Int64,String,String?
1,Chrysler,2.4,5,USA,Fiat
2,Volkswagen,1.6,5,Germany,Volkswagen
3,Škoda,1.3,5,Czech Republic,Volkswagen
4,Volvo,2.0,3,Sweden,Geely
5,Fiat,1.0,3,Italy,Fiat


Filtering

In [38]:
filter(r -> r[:doors] > 3 && r[:motor] < 2, cars)

Unnamed: 0_level_0,brand,motor,doors,country,parent
Unnamed: 0_level_1,String,Float64,Int64,String,String?
1,Volkswagen,1.6,5,Germany,Volkswagen
2,Škoda,1.3,5,Czech Republic,Volkswagen


In [39]:
cars

Unnamed: 0_level_0,brand,motor,doors,country,parent
Unnamed: 0_level_1,String,Float64,Int64,String,String?
1,Volvo,2.0,3,Sweden,Geely
2,Volkswagen,1.6,5,Germany,Volkswagen
3,Škoda,1.3,5,Czech Republic,Volkswagen
4,Fiat,1.0,3,Italy,Fiat
5,Chrysler,2.4,5,USA,Fiat


In [40]:
filter!(r -> r[:doors] > 3 && r[:motor] < 2, cars)

Unnamed: 0_level_0,brand,motor,doors,country,parent
Unnamed: 0_level_1,String,Float64,Int64,String,String?
1,Volkswagen,1.6,5,Germany,Volkswagen
2,Škoda,1.3,5,Czech Republic,Volkswagen


In [41]:
cars

Unnamed: 0_level_0,brand,motor,doors,country,parent
Unnamed: 0_level_1,String,Float64,Int64,String,String?
1,Volkswagen,1.6,5,Germany,Volkswagen
2,Škoda,1.3,5,Czech Republic,Volkswagen


### A simple ML experiment

Load experimental data, from multiple experiments, merge them into one big DataFrame, average performance over folds and look for the best model. The script that was used to generate the data using the `MLJ` package is `generate_dataframes.jl`.

This is a folder where the data is saved

In [42]:
savepath = "./data"
files = readdir(savepath)

4-element Array{String,1}:
 "knn_crabs.csv"
 "knn_iris.csv"
 "xgboost_crabs.csv"
 "xgboost_iris.csv"

In [43]:
using CSV # for data reading
using Statistics # for averaging
results = map(x->CSV.read(joinpath(savepath,x)), files)
show(results[1], allcols = true, splitcols=false)

12×6 DataFrame
│ Row │ dataset │ model         │ parameters │ fold  │ cross_entropy │ auc      │
│     │ [90mString[39m  │ [90mString[39m        │ [90mString[39m     │ [90mInt64[39m │ [90mFloat64?[39m      │ [90mFloat64?[39m │
├─────┼─────────┼───────────────┼────────────┼───────┼───────────────┼──────────┤
│ 1   │ crabs   │ KNNClassifier │ K=1        │ 1     │ 1.07082       │ [90mmissing[39m  │
│ 2   │ crabs   │ KNNClassifier │ K=1        │ 2     │ 1.80432       │ 0.945303 │
│ 3   │ crabs   │ KNNClassifier │ K=1        │ 3     │ 1.62229       │ 0.943503 │
│ 4   │ crabs   │ KNNClassifier │ K=11       │ 1     │ 0.3762        │ 0.936245 │
│ 5   │ crabs   │ KNNClassifier │ K=11       │ 2     │ 0.371544      │ [90mmissing[39m  │
│ 6   │ crabs   │ KNNClassifier │ K=11       │ 3     │ [90mmissing[39m       │ 0.913999 │
│ 7   │ crabs   │ KNNClassifier │ K=101      │ 1     │ 0.676675      │ 0.669904 │
│ 8   │ crabs   │ KNNClassifier │ K=101      │ 2     │ 0.66818       │ 0.66

#### Missing values

Any operation on a vector containing a `missing` value results in a `missing` value

In [47]:
mean([1, 2, missing, 3])

missing

Thats why we use skipmissing

In [49]:
collect(skipmissing([1, 2, missing, 3]))

3-element Array{Int64,1}:
 1
 2
 3

In [50]:
mean(skipmissing([1, 2, missing, 3]))

2.0

Now we have the DataFrames from individual experiments, let's join them together.

In [53]:
resdf = vcat(results...)
show(resdf, splitcols=false)

48×6 DataFrame
│ Row │ dataset │ model             │ parameters  │ fold  │ cross_entropy │ auc      │
│     │ [90mString[39m  │ [90mString[39m            │ [90mString[39m      │ [90mInt64[39m │ [90mFloat64?[39m      │ [90mFloat64?[39m │
├─────┼─────────┼───────────────────┼─────────────┼───────┼───────────────┼──────────┤
│ 1   │ crabs   │ KNNClassifier     │ K=1         │ 1     │ 1.07082       │ [90mmissing[39m  │
│ 2   │ crabs   │ KNNClassifier     │ K=1         │ 2     │ 1.80432       │ 0.945303 │
│ 3   │ crabs   │ KNNClassifier     │ K=1         │ 3     │ 1.62229       │ 0.943503 │
│ 4   │ crabs   │ KNNClassifier     │ K=11        │ 1     │ 0.3762        │ 0.936245 │
│ 5   │ crabs   │ KNNClassifier     │ K=11        │ 2     │ 0.371544      │ [90mmissing[39m  │
│ 6   │ crabs   │ KNNClassifier     │ K=11        │ 3     │ [90mmissing[39m       │ 0.913999 │
│ 7   │ crabs   │ KNNClassifier     │ K=101       │ 1     │ 0.676675      │ 0.669904 │
│ 8   │ crabs   │ KNNClas

Now aggregate them over folds.

In [58]:
groupby(resdf, [:dataset, :model, :parameters])[5]

Unnamed: 0_level_0,dataset,model,parameters,fold,cross_entropy,auc
Unnamed: 0_level_1,String,String,String,Int64,Float64?,Float64?
1,iris,KNNClassifier,K=1,1,1.44175,0.977176
2,iris,KNNClassifier,K=1,2,1.44175,0.951051
3,iris,KNNClassifier,K=1,3,1.44175,0.954223


In [59]:
agdf = combine(groupby(resdf, [:dataset, :model, :parameters]), names(resdf, Not([:dataset, :model, :parameters])) .=> mean)
show(agdf, allrows=true, splitcols=false)

16×6 DataFrame
│ Row │ dataset │ model             │ parameters  │ fold_mean │ cross_entropy_mean │ auc_mean │
│     │ [90mString[39m  │ [90mString[39m            │ [90mString[39m      │ [90mFloat64[39m   │ [90mFloat64?[39m           │ [90mFloat64?[39m │
├─────┼─────────┼───────────────────┼─────────────┼───────────┼────────────────────┼──────────┤
│ 1   │ crabs   │ KNNClassifier     │ K=1         │ 2.0       │ 1.49914            │ [90mmissing[39m  │
│ 2   │ crabs   │ KNNClassifier     │ K=11        │ 2.0       │ [90mmissing[39m            │ [90mmissing[39m  │
│ 3   │ crabs   │ KNNClassifier     │ K=101       │ 2.0       │ 0.672827           │ 0.670749 │
│ 4   │ crabs   │ KNNClassifier     │ K=301       │ 2.0       │ [90mmissing[39m            │ [90mmissing[39m  │
│ 5   │ iris    │ KNNClassifier     │ K=1         │ 2.0       │ 1.44175            │ 0.960816 │
│ 6   │ iris    │ KNNClassifier     │ K=11        │ 2.0       │ 0.0969837          │ 0.996452 │
│ 7   │ iri

Where do the missing values come from?

In [61]:
filter(r->ismissing(r[:cross_entropy]) || ismissing(r[:auc]), resdf)

Unnamed: 0_level_0,dataset,model,parameters,fold,cross_entropy,auc
Unnamed: 0_level_1,String,String,String,Int64,Float64?,Float64?
1,crabs,KNNClassifier,K=1,1,1.07082,missing
2,crabs,KNNClassifier,K=11,2,0.371544,missing
3,crabs,KNNClassifier,K=11,3,missing,0.913999
4,crabs,KNNClassifier,K=301,1,missing,missing
5,crabs,KNNClassifier,K=301,2,missing,missing
6,crabs,KNNClassifier,K=301,3,missing,missing
7,iris,KNNClassifier,K=301,1,missing,missing
8,iris,KNNClassifier,K=301,2,missing,missing
9,iris,KNNClassifier,K=301,3,missing,missing


In [66]:
missing === missing

true

In [67]:
NaN === NaN

true

In [64]:
ismissing(3)

false

For K=301, we want to have missings, but for other values, we just want to ignore the one missing value.

In [68]:
mean(skipmissing([1,2,missing]))

1.5

In [69]:
mean(skipmissing([missing, missing, missing]))

MethodError: MethodError: reduce_empty(::typeof(Base.add_sum), ::Type{Union{}}) is ambiguous. Candidates:
  reduce_empty(::typeof(Base.add_sum), ::Type{T}) where T<:Union{Int16, Int32, Int8} in Base at reduce.jl:314
  reduce_empty(::typeof(Base.add_sum), ::Type{T}) where T<:Union{UInt16, UInt32, UInt8} in Base at reduce.jl:315
Possible fix, define
  reduce_empty(::typeof(Base.add_sum), ::Type{Union{}})

In [70]:
missmean(x) = all(ismissing, x) ? missing : mean(skipmissing(x)) # this returns mean ignoring the missing elements, but if all elements of x are missing, it returns missing

missmean (generic function with 1 method)

In [71]:
agdf = combine(groupby(resdf, [:dataset, :model, :parameters]), names(resdf, Not([:dataset, :model, :parameters])) .=> missmean)

Unnamed: 0_level_0,dataset,model,parameters,fold_missmean,cross_entropy_missmean
Unnamed: 0_level_1,String,String,String,Float64,Float64?
1,crabs,KNNClassifier,K=1,2.0,1.49914
2,crabs,KNNClassifier,K=11,2.0,0.373872
3,crabs,KNNClassifier,K=101,2.0,0.672827
4,crabs,KNNClassifier,K=301,2.0,missing
5,iris,KNNClassifier,K=1,2.0,1.44175
6,iris,KNNClassifier,K=11,2.0,0.0969837
7,iris,KNNClassifier,K=101,2.0,0.899702
8,iris,KNNClassifier,K=301,2.0,missing
9,crabs,XGBoostClassifier,max_depth=2,2.0,0.214894
10,crabs,XGBoostClassifier,max_depth=4,2.0,0.211701


In [72]:
agdf = agdf[!,Not(:fold_missmean)] # drop the means of folds

Unnamed: 0_level_0,dataset,model,parameters,cross_entropy_missmean,auc_missmean
Unnamed: 0_level_1,String,String,String,Float64?,Float64?
1,crabs,KNNClassifier,K=1,1.49914,0.944403
2,crabs,KNNClassifier,K=11,0.373872,0.925122
3,crabs,KNNClassifier,K=101,0.672827,0.670749
4,crabs,KNNClassifier,K=301,missing,missing
5,iris,KNNClassifier,K=1,1.44175,0.960816
6,iris,KNNClassifier,K=11,0.0969837,0.996452
7,iris,KNNClassifier,K=101,0.899702,0.466738
8,iris,KNNClassifier,K=301,missing,missing
9,crabs,XGBoostClassifier,max_depth=2,0.214894,0.976127
10,crabs,XGBoostClassifier,max_depth=4,0.211701,0.968631


In [73]:
rename!(agdf, :cross_entropy_missmean => :cross_entropy) # rename the aggregated columns
rename!(agdf, :auc_missmean => :auc)
show(agdf, allrows=true, splitcols=false)

16×5 DataFrame
│ Row │ dataset │ model             │ parameters  │ cross_entropy │ auc      │
│     │ [90mString[39m  │ [90mString[39m            │ [90mString[39m      │ [90mFloat64?[39m      │ [90mFloat64?[39m │
├─────┼─────────┼───────────────────┼─────────────┼───────────────┼──────────┤
│ 1   │ crabs   │ KNNClassifier     │ K=1         │ 1.49914       │ 0.944403 │
│ 2   │ crabs   │ KNNClassifier     │ K=11        │ 0.373872      │ 0.925122 │
│ 3   │ crabs   │ KNNClassifier     │ K=101       │ 0.672827      │ 0.670749 │
│ 4   │ crabs   │ KNNClassifier     │ K=301       │ [90mmissing[39m       │ [90mmissing[39m  │
│ 5   │ iris    │ KNNClassifier     │ K=1         │ 1.44175       │ 0.960816 │
│ 6   │ iris    │ KNNClassifier     │ K=11        │ 0.0969837     │ 0.996452 │
│ 7   │ iris    │ KNNClassifier     │ K=101       │ 0.899702      │ 0.466738 │
│ 8   │ iris    │ KNNClassifier     │ K=301       │ [90mmissing[39m       │ [90mmissing[39m  │
│ 9   │ crabs   │ XGBoostC

What is the best model on each dataset?

In [74]:
combine(x->sort(x, :cross_entropy), groupby(agdf, [:dataset]), ungroup = false)

Unnamed: 0_level_0,dataset,model,parameters,cross_entropy,auc
Unnamed: 0_level_1,String,String,String,Float64?,Float64?
1,crabs,XGBoostClassifier,max_depth=4,0.211701,0.968631
2,crabs,XGBoostClassifier,max_depth=2,0.214894,0.976127
3,crabs,XGBoostClassifier,max_depth=6,0.216623,0.976535
4,crabs,XGBoostClassifier,max_depth=8,0.241723,0.97217
5,crabs,KNNClassifier,K=11,0.373872,0.925122
6,crabs,KNNClassifier,K=101,0.672827,0.670749
7,crabs,KNNClassifier,K=1,1.49914,0.944403
8,crabs,KNNClassifier,K=301,missing,missing

Unnamed: 0_level_0,dataset,model,parameters,cross_entropy,auc
Unnamed: 0_level_1,String,String,String,Float64?,Float64?
1,iris,KNNClassifier,K=11,0.0969837,0.996452
2,iris,XGBoostClassifier,max_depth=6,0.211618,0.964671
3,iris,XGBoostClassifier,max_depth=8,0.21252,0.960521
4,iris,XGBoostClassifier,max_depth=2,0.219998,0.958139
5,iris,XGBoostClassifier,max_depth=4,0.227554,0.953534
6,iris,KNNClassifier,K=101,0.899702,0.466738
7,iris,KNNClassifier,K=1,1.44175,0.960816
8,iris,KNNClassifier,K=301,missing,missing


In [75]:
combine(x->sort(x, :auc, rev=true), groupby(agdf, [:dataset]), ungroup = false) # revert sorting since bigger auc is better

Unnamed: 0_level_0,dataset,model,parameters,cross_entropy,auc
Unnamed: 0_level_1,String,String,String,Float64?,Float64?
1,crabs,KNNClassifier,K=301,missing,missing
2,crabs,XGBoostClassifier,max_depth=6,0.216623,0.976535
3,crabs,XGBoostClassifier,max_depth=2,0.214894,0.976127
4,crabs,XGBoostClassifier,max_depth=8,0.241723,0.97217
5,crabs,XGBoostClassifier,max_depth=4,0.211701,0.968631
6,crabs,KNNClassifier,K=1,1.49914,0.944403
7,crabs,KNNClassifier,K=11,0.373872,0.925122
8,crabs,KNNClassifier,K=101,0.672827,0.670749

Unnamed: 0_level_0,dataset,model,parameters,cross_entropy,auc
Unnamed: 0_level_1,String,String,String,Float64?,Float64?
1,iris,KNNClassifier,K=301,missing,missing
2,iris,KNNClassifier,K=11,0.0969837,0.996452
3,iris,XGBoostClassifier,max_depth=6,0.211618,0.964671
4,iris,KNNClassifier,K=1,1.44175,0.960816
5,iris,XGBoostClassifier,max_depth=8,0.21252,0.960521
6,iris,XGBoostClassifier,max_depth=2,0.219998,0.958139
7,iris,XGBoostClassifier,max_depth=4,0.227554,0.953534
8,iris,KNNClassifier,K=101,0.899702,0.466738


### DataFramesMeta.jl

This package works on top of DataFrames and enables SQL-like queries. We can try to do the above in one query.

In [76]:
using DataFramesMeta

Best average result in terms of cross entropy on iris dataset.

In [78]:
@linq resdf |>
    where(.!ismissing.(:cross_entropy), :dataset.=="iris") |>
    by([:dataset, :model, :parameters], cross_entropy=mean(:cross_entropy))

Unnamed: 0_level_0,dataset,model,parameters,cross_entropy
Unnamed: 0_level_1,String,String,String,Float64
1,iris,KNNClassifier,K=1,1.44175
2,iris,KNNClassifier,K=11,0.0969837
3,iris,KNNClassifier,K=101,0.899702
4,iris,XGBoostClassifier,max_depth=2,0.219998
5,iris,XGBoostClassifier,max_depth=4,0.227554
6,iris,XGBoostClassifier,max_depth=6,0.211618
7,iris,XGBoostClassifier,max_depth=8,0.21252


In [77]:
@linq resdf |>
    where(.!ismissing.(:cross_entropy), :dataset.=="iris") |>
    by([:dataset, :model, :parameters], cross_entropy=mean(:cross_entropy)) |>
    orderby(:cross_entropy) |>
    select(:dataset, :model, :parameters, :cross_entropy)

Unnamed: 0_level_0,dataset,model,parameters,cross_entropy
Unnamed: 0_level_1,String,String,String,Float64
1,iris,KNNClassifier,K=11,0.0969837
2,iris,XGBoostClassifier,max_depth=6,0.211618
3,iris,XGBoostClassifier,max_depth=8,0.21252
4,iris,XGBoostClassifier,max_depth=2,0.219998
5,iris,XGBoostClassifier,max_depth=4,0.227554
6,iris,KNNClassifier,K=101,0.899702
7,iris,KNNClassifier,K=1,1.44175


Best average result in terms of AUC on crabs dataset.

In [79]:
@linq resdf |>
    where(.!ismissing.(:auc), :dataset.=="crabs") |>
    by([:dataset, :model, :parameters], auc=mean(:auc)) |>
    orderby(-:auc) |>
    select(:dataset, :model, :parameters, :auc)

Unnamed: 0_level_0,dataset,model,parameters,auc
Unnamed: 0_level_1,String,String,String,Float64
1,crabs,XGBoostClassifier,max_depth=6,0.976535
2,crabs,XGBoostClassifier,max_depth=2,0.976127
3,crabs,XGBoostClassifier,max_depth=8,0.97217
4,crabs,XGBoostClassifier,max_depth=4,0.968631
5,crabs,KNNClassifier,K=1,0.944403
6,crabs,KNNClassifier,K=11,0.925122
7,crabs,KNNClassifier,K=101,0.670749
