### DataFrames basics

For a complete overview see the official documentation at `https://juliadata.github.io/DataFrames.jl/stable/`

In [None]:
using DataFrames

DataFrame construction

In [None]:
cars = DataFrame(brand = ["Volvo", "Volkswagen", "Škoda"], motor = [2.0, 1.6, 1.3], doors = [3, 5, 5])

indexing like a matrix - not so useful

In [None]:
cars[2:3, 1:2]

list of column names - you can use this to iterate over columns

In [None]:
names(cars)

column names as symbols

In [None]:
propertynames(cars)

getting a column - is a vector

In [None]:
cars.brand

In [None]:
cars[!,"brand"]

In [None]:
cars[!,:brand]

getting multiple columns - is a DataFrame

In [None]:
cars[:,[:brand, :motor]]

getting a row - is a DataFrameRow

In [None]:
cars[1,:]

multiple rows - is a DataFrame

In [None]:
cars[1:2, :]

adding a column to an existing df

In [None]:
cars[:,:country] = ["Sweden", "Germany", "Czech Republic"];
cars

adding a row to an existing df

In [None]:
push!(cars, ["Fiat", 1.0, 3, "Italy"])
push!(cars, ["Chrysler", 2.4, 5, "USA"])

produces a df that contains a summary for columns of the argument

In [None]:
describe(cars)

Iterating over rows of a DataFrame

In [None]:
row_stats(r::DataFrameRow) = println("A $(r.brand) with $(r.doors) doors and a $(r.motor)l motor.")

In [None]:
map(row_stats, eachrow(cars));

#### A small join example

In [None]:
brands = DataFrame(parent = ["Fiat", "Volkswagen", "Volkswagen", "Geely", "Fiat"], brand = ["Fiat", "Volkswagen", "Škoda", "Volvo", "Chrysler"])

In [None]:
cars = leftjoin(cars, brands, on = :brand)

Sorting

In [None]:
sort(cars, [:doors, :motor], rev=true)

Filtering

In [None]:
filter(r-> r[:doors] > 3 && r[:motor] < 2, cars)

### A simple ML experiment

Load experimental data, from multiple experiments, merge them into one big DataFrame, average performance over folds and look for the best model. The script that was used to generate the data using the `MLJ` package is `generate_dataframes.jl`.

This is a folder where the data is saved

In [None]:
savepath = "./data"
files = readdir(savepath)

In [None]:
using CSV # for data reading
using Statistics # for averaging
results = map(x->CSV.read(joinpath(savepath,x)), files)
show(results[1], allcols = true, splitcols=false)

#### Missing values

Any operation on a vector containing a `missing` value results in a `missing` value

In [None]:
mean([1, 2, missing, 3])

Thats why we use skipmissing

In [None]:
mean(skipmissing([1, 2, missing, 3]))

Now we have the DataFrames from individual experiments, let's join them together.

In [None]:
resdf = vcat(results...)
show(resdf, splitcols=false)

Now aggregate them over folds.

In [None]:
agdf = combine(groupby(resdf, [:dataset, :model, :parameters]), names(resdf, Not([:dataset, :model, :parameters])) .=> mean)
show(agdf, allrows=true, splitcols=false)

Where do the missing values come from?

In [None]:
filter(r->ismissing(r[:cross_entropy]) || ismissing(r[:auc]), resdf)

For K=301, we want to have missings, but for other values, we just want to ignore the one missing value.

In [None]:
missmean(x) = all(ismissing, x) ? missing : mean(skipmissing(x)) # this returns mean ignoring the missing elements, but if all elements of x are missing, it returns missing

In [None]:
agdf = combine(groupby(resdf, [:dataset, :model, :parameters]), names(resdf, Not([:dataset, :model, :parameters])) .=> missmean)
agdf = agdf[!,Not(:fold_missmean)] # drop the means of folds
rename!(agdf, :cross_entropy_missmean => :cross_entropy) # rename the aggregated columns
rename!(agdf, :auc_missmean => :auc)
show(agdf, allrows=true, splitcols=false)

What is the best model on each dataset?

In [None]:
combine(x->sort(x, :cross_entropy), groupby(agdf, [:dataset]), ungroup = false)

In [None]:
combine(x->sort(x, :auc, rev=true), groupby(agdf, [:dataset]), ungroup = false) # revert sorting since bigger auc is better

### DataFramesMeta.jl

This package works on top of DataFrames and enables SQL-like queries. We can try to do the above in one query.

In [None]:
using DataFramesMeta

Best average result in terms of cross entropy on iris dataset.

In [None]:
@linq resdf |>
    where(.!ismissing.(:cross_entropy), :dataset.=="iris") |>
    by([:dataset, :model, :parameters], cross_entropy=mean(:cross_entropy)) |>
    orderby(:cross_entropy) |>
    select(:dataset, :model, :parameters, :cross_entropy)

Best average result in terms of AUC on crabs dataset.

In [None]:
@linq resdf |>
    where(.!ismissing.(:auc), :dataset.=="crabs") |>
    by([:dataset, :model, :parameters], auc=mean(:auc)) |>
    orderby(-:auc) |>
    select(:dataset, :model, :parameters, :auc)