In [1]:
using Suppressor
using MLJ
using Plots
pyplot()

Plots.PyPlotBackend()

In [24]:
using RDatasets
iris = dataset("datasets", "iris");

In [58]:
X, y = iris[:,1:4], iris[:,5];



### DecisionTreeClassifier example

In [25]:
@load DecisionTreeClassifier

┌ Info: A model named "DecisionTreeClassifier" is already loaded.
│ Nothing new loaded. 
└ @ MLJ /Users/davidbuchacaprats/.julia/packages/MLJ/gsBfz/src/loading.jl:159


In [43]:
tree_model = DecisionTreeClassifier(max_depth=4)

DecisionTreeClassifier(target_type = Int64,
                       pruning_purity = 1.0,
                       max_depth = 4,
                       min_samples_leaf = 1,
                       min_samples_split = 2,
                       min_purity_increase = 0.0,
                       n_subfeatures = 0.0,
                       display_depth = 5,
                       post_prune = false,
                       merge_purity_threshold = 0.9,)[0m[1m @ 1…96[22m

In [44]:
tree_machine = machine(tree_model, X, y)

[0m[1mMachine @ 4…82[22m


In [45]:
train, test = partition(eachindex(y), 0.7, shuffle=true);

In [46]:
fit!(tree, rows=train)

┌ Info: Training [0m[1mMachine @ 1…30[22m.
└ @ MLJ /Users/davidbuchacaprats/.julia/packages/MLJ/gsBfz/src/machines.jl:115


ErrorException: Type, String, of target incompatible with type, Int64, of [0m[1mDecisionTreeClassifier @ 1…89[22m.

Notice that fit expects Int64 as type for the outputs

https://github.com/JuliaML/MLDataUtils.jl

In [48]:
function map_class_to_int(y; y_enc_type=Int64)
    class_values = Set(y)
    y_enc = zeros(y_enc_type, length(y))
    label_to_int = Dict(class=>y_enc_type(j) for (j,class) in enumerate(class_values))
    
    for (j,class) in enumerate(class_values)
        y_enc[y .==class] .= y_enc_type(j)
    end
    return y_enc, label_to_int
end

map_class_to_int (generic function with 1 method)

In [49]:
y_enc, label_to_int = map_class_to_int(y; y_enc_type=Int64);

In [50]:
label_to_int

Dict{CategoricalString{UInt8},Int64} with 3 entries:
  CategoricalString{UInt8} "virginica"  => 3
  CategoricalString{UInt8} "versicolor" => 1
  CategoricalString{UInt8} "setosa"     => 2

In [52]:
y_enc = CategoricalArray(y_enc)
tree_machine = machine(tree_model, X, y_enc)

[0m[1mMachine @ 9…25[22m


In [53]:
fit!(tree, rows=train)

┌ Info: Training [0m[1mMachine @ 3…55[22m.
└ @ MLJ /Users/davidbuchacaprats/.julia/packages/MLJ/gsBfz/src/machines.jl:115


[0m[1mMachine @ 3…55[22m


In [59]:
y_hat = predict(tree,X[end-10:end,:])

11-element Array{UnivariateNominal{Int64,Float64},1}:
 UnivariateNominal{Int64,Float64}(Dict(2=>0.0,3=>1.0,1=>0.0))
 UnivariateNominal{Int64,Float64}(Dict(2=>0.0,3=>1.0,1=>0.0))
 UnivariateNominal{Int64,Float64}(Dict(2=>0.0,3=>1.0,1=>0.0))
 UnivariateNominal{Int64,Float64}(Dict(2=>0.0,3=>1.0,1=>0.0))
 UnivariateNominal{Int64,Float64}(Dict(2=>0.0,3=>1.0,1=>0.0))
 UnivariateNominal{Int64,Float64}(Dict(2=>0.0,3=>1.0,1=>0.0))
 UnivariateNominal{Int64,Float64}(Dict(2=>0.0,3=>1.0,1=>0.0))
 UnivariateNominal{Int64,Float64}(Dict(2=>0.0,3=>1.0,1=>0.0))
 UnivariateNominal{Int64,Float64}(Dict(2=>0.0,3=>1.0,1=>0.0))
 UnivariateNominal{Int64,Float64}(Dict(2=>0.0,3=>1.0,1=>0.0))
 UnivariateNominal{Int64,Float64}(Dict(2=>0.0,3=>1.0,1=>0.0))

In [65]:
yhat = predict(tree, X[test,:]);

In [68]:
misclassification_rate(yhat, y_enc[test])

0.08888888888888889

In [70]:
evaluate!(tree, 
          resampling=Holdout(fraction_train=0.7, shuffle=true),
          measure=misclassification_rate)

┌ Info: Evaluating using a holdout set. 
│ fraction_train=0.7 
│ shuffle=true 
│ measure=MLJ.misclassification_rate 
│ operation=StatsBase.predict 
│ Resampling from all rows. 
└ @ MLJ /Users/davidbuchacaprats/.julia/packages/MLJ/gsBfz/src/resampling.jl:91


0.044444444444444446

## Data containers and scientific types

### Data containers and scientific types

The MLJ user should acquaint themselves with some
basic assumptions about the form of data expected by MLJ, as outlined
below. 

```
machine(model::Supervised, X, y) 
machine(model::Unsupervised, X)
```

**Multivariate input.** The input `X` in the above machine
constructors can be any table, where *table* means any data type
supporting the [Tables.jl](https://github.com/JuliaData/Tables.jl)
interface.

> At present our API is more restrictive; see this
> [issue](https://github.com/JuliaData/Tables.jl/issues/74) with
> Tables.jl. If your Tables.jl compatible format is not working in
> MLJ, please post an issue.

In particular, `DataFrame`, `JuliaDB.IndexedTable` and
`TypedTables.Table` objects are supported, as are two Julia native
formats: *column tables* (named tuples of equal length vectors) and
*row tables* (vectors of named tuples sharing the same
keys).

> Certain `JuliaDB.NDSparse` tables can be used for sparse data, but
> this is experimental and undocumented.

**Univariate input.** For models which handle only univariate inputs
(`input_is_multivariate(model)=false`) `X` cannot be a table but is
expected to be some `AbstractVector` type.

**Targets.** The target `y` in the first constructor above must be an
`AbstractVector`. A multivariate target `y` will be a vector of
*tuples*. The tuples need not have uniform length, so some forms of
sequence prediction are supported. Only the element types of `y`
matter (the types of `y[j]` for each `j`). Indeed if a machine accepts
`y` as an argument it will be just as happy with `identity.(y)`.

**Element types.** The types of input and target *elements* has strict
consequences for MLJ's behaviour. 

To articulate MLJ's conventions about data representation, MLJ
distinguishes between *machine* data types on the one hand (`Float64`,
`Bool`, `String`, etc) and *scientific data types* on the other,
represented by new Julia types: `Continuous`, `Count`,
`Multiclass{N}`, `OrderedFactor{N}` and `Unknown`, with obvious
interpretations.  These types are organized in a type
[hierarchy](scitypes.png) rooted in a new abstract type `Found`.

A *scientific type* is any subtype of
`Union{Missing,Found}`. Scientific types have no instances. (They are
used behind the scenes is values for model trait functions.) Such
types appear, for example, when querying model metadata:

```julia
julia> info("DecisionTreeClassifier")[:target_scitype_union]
```

```julia
Finite
```

```julia
subtypes(Finite)
```

```julia
2-element Array{Any,1}:
 Multiclass   
 OrderedFactor
```

This means that the scitype of all elements of `DecisionTreeClassier`
target must be `Multiclass` or `OrderedFactor`.

To see how MLJ will interpret an object `x` appearing in table or
vector input `X`, or target vector `y`, call `scitype(x)`. The fallback
this function is `scitype(::Any) = Unknown`. 

```julia
julia> (scitype(42), scitype(float(π)), scitype("Julia"))
```

```julia
(Count, Continuous, Unknown)
```
    
The table below shows machine types that have scientific types
different from `Unknown`:

`T`                         |     `scitype(x)` for `x::T`
----------------------------|:--------------------------------
`AbstractFloat`             |      `Continuous`
`Integer`                   |        `Count`
`CategoricalValue`          | `Multiclass{N}` where `N = nlevels(x)`, provided `x.pool.ordered == false` 
`CategoricalString`         | `Multiclass{N}` where `N =p nlevels(x)`, provided `x.pool.ordered == false`
`CategoricalValue`          | `OrderedFactor{N}` where `N = nlevels(x)`, provided `x.pool.ordered == true` 
`CategoricalString`         | `OrderedFactor{N}` where `N = nlevels(x)` provided `x.pool.ordered == true`
`Integer`                   | `Count`
`Missing`                   | `Missing`

Here `nlevels(x) = length(levels(x.pool))`.

**Special note on using integers.** According to the above, integers
cannot be used to represent `Multiclass` or `OrderedFactor` data. These can be represented by an unordered or ordered `CategoricalValue`
or `CategoricalString` (automatic if they are elements of a
`CategoricalArray`).

Methods exist to coerce the scientific type of a vector or table (see
below). [Task](working_with_tasks.md) constructors also allow one to
force the data being wrapped to have the desired scientific type.

For more about scientific types and their role, see [Adding Models for
General Use](adding_models_for_general_use.md)





## Other

In [74]:
using MLJ
task = load_iris()


[0m[1mSupervisedTask @ 5…38[22m


In [75]:
using RDatasets
df = dataset("boot", "channing");

In [76]:
(Sex = Multiclass{2},
 Entry = Count,
 Exit = Count,
 Time = Count,
 Cens = Count,)

(Sex = Multiclass{2},
 Entry = Count,
 Exit = Count,
 Time = Count,
 Cens = Count,)

In [77]:
task = supervised(data=df,
                  target=:Exit,
                  ignore=:Time,
                  is_probabilistic=true,
                  types=Dict(:Entry=>Continuous,
                             :Exit=>Continuous,
                             :Cens=>Multiclass))
scitypes(task.X)

UndefVarError: UndefVarError: supervised not defined

## Tasks

- Definition: A task is a set of 3 elements: {data, data interpretator, learning objective}.




We can use tasks to choose learning models.



### `SupervisedTask` type

In [None]:
supertype(SupervisedTask)

In [None]:
?SupervisedTask

In [None]:
methodswith(SupervisedTask)

In [None]:
task = load_iris()

In [None]:
?models

In [None]:
models(task)

In [None]:
typeof(task)

In [None]:
X, y = task()

In [None]:
println(typeof(X))
println(typeof(y))

In [None]:
X

In [None]:
@load DecisionTreeClassifier

In [None]:
tree_model = DecisionTreeClassifier(max_depth=5)

Wrapping the model in data creates a machine which will store training outcomes:

In [None]:
machine_tree = machine(tree_model, X, y)

In [None]:
train, test = partition(eachindex(y), 0.7, shuffle=true);

In [None]:
fit!(tree)

In [None]:
yhat = predict(tree, X[test,:]);