# End to end examples

## Using GLM.jl

In [1]:
using Pkg; Pkg.activate("D:/JULIA/6_ML_with_Julia/EX-GLM"); Pkg.instantiate()

[32m[1m  Activating[22m[39m project at `D:\JULIA\6_ML_with_Julia\EX-GLM`
[32m[1m   Installed[22m[39m MLJGLMInterface ─ v0.2.0
[32m[1mPrecompiling[22m[39m project...
[32m  ✓ [39mMLJGLMInterface
  1 dependency successfully precompiled in 8 seconds (96 already precompiled)


> Reading the data <br>
> Defining the Linear Model <br>
> Reading the Output of Fitting the Linear Model <br>
> Defining the Logistic Model <br>
> Reading the Output from the Prediction of the Logistic Model <br>

This juypter lab showcases MLJ in particular using the popular GLM Julia package. We are using two datasets. One dataset was created manually for testing purposes. The other data set is the CollegeDistance dataset from the AER package in R.

We can quickly define our models in MLJ and study their results. It is very easy and consistent.

In [2]:
using MLJ, CategoricalArrays, PrettyPrinting
import DataFrames: DataFrame, nrow
using UrlDownload

LinearRegressor = @load LinearRegressor pkg = GLM
LinearBinaryClassifier = @load LinearBinaryClassifier pkg = GLM

┌ Info: For silent loading, specify `verbosity=0`. 
└ @ Main C:\Users\jeffr\.julia\packages\MLJModels\tMgLW\src\loading.jl:168


import MLJGLMInterface

┌ Info: Precompiling MLJGLMInterface [caf8df21-4939-456d-ac9c-5fefbfb04c0c]
└ @ Base loading.jl:1423


 ✔


┌ Info: For silent loading, specify `verbosity=0`. 
└ @ Main C:\Users\jeffr\.julia\packages\MLJModels\tMgLW\src\loading.jl:168


import MLJGLMInterface ✔


MLJGLMInterface.LinearBinaryClassifier

### Reading the data

---

The CollegeDistance dataset was stored in a CSV file. Here, we read the input file.

In [3]:
baseurl = "https://raw.githubusercontent.com/tlienart/DataScienceTutorialsData.jl/master/data/glm/"

"https://raw.githubusercontent.com/tlienart/DataScienceTutorialsData.jl/master/data/glm/"

In [5]:
dfX = DataFrame(urldownload(baseurl * "X3.csv"))
dfYbinary = DataFrame(urldownload(baseurl * "Y3.csv"))
dfX1 = DataFrame(urldownload(baseurl * "X1.csv"))
dfY1 = DataFrame(urldownload(baseurl * "Y1.csv"));

You can have a look at those using `first` :

In [6]:
first(dfX, 3)

Unnamed: 0_level_0,gender,ethnicity,score,fcollege,mcollege,home,urban,unemp,wage
Unnamed: 0_level_1,String7,String15,Float64,String3,String3,String3,String3,Float64,Float64
1,male,other,39.15,yes,no,yes,yes,6.2,8.09
2,female,other,48.87,no,no,yes,yes,6.2,8.09
3,male,other,48.74,no,no,yes,yes,6.2,8.09


same for Y:

In [7]:
first(dfY1, 3)

Unnamed: 0_level_0,Y
Unnamed: 0_level_1,Float64
1,-2.04463
2,-0.461529
3,0.458262


### Defining the Linear Model

---

Let see how many MLJ models handle out kind of target which is the y variable.

In [9]:
ms = models() do m
    AbstractVector{Count} <: m.target_scitype
end
foreach(m -> println(m.name), ms)

EvoTreeCount
LinearCountRegressor
XGBoostCount


How about  if the type was Continuous:

In [10]:
ms = models() do m
    Vector{Continuous} <: m.target_scitype
end

foreach(m -> println(m.name), ms)

ARDRegressor
AdaBoostRegressor
BaggingRegressor
BayesianRidgeRegressor
ConstantRegressor
DecisionTreeRegressor
DecisionTreeRegressor
DeterministicConstantRegressor
DummyRegressor
ElasticNetCVRegressor
ElasticNetRegressor
ElasticNetRegressor
EpsilonSVR
EvoTreeGaussian
EvoTreeRegressor
ExtraTreesRegressor
GaussianProcessRegressor
GradientBoostingRegressor
HuberRegressor
HuberRegressor
KNNRegressor
KNeighborsRegressor
KPLSRegressor
LADRegressor
LGBMRegressor
LarsCVRegressor
LarsRegressor
LassoCVRegressor
LassoLarsCVRegressor
LassoLarsICRegressor
LassoLarsRegressor
LassoRegressor
LassoRegressor
LinearRegressor
LinearRegressor
LinearRegressor
LinearRegressor
NeuralNetworkRegressor
NuSVR
OrthogonalMatchingPursuitCVRegressor
OrthogonalMatchingPursuitRegressor
PLSRegressor
PassiveAggressiveRegressor
QuantileRegressor
RANSACRegressor
RandomForestRegressor
RandomForestRegressor
RandomForestRegressor
RidgeCVRegressor
RidgeRegressor
RidgeRegressor
RidgeRegressor
RobustRegressor
SGDRegressor
SVMLin

We can quickly define our models in MLJ. It is very easy and consistent.

In [11]:
X = copy(dfX1)
y = copy(dfY1)

coerce!(X, autotype(X, :string_to_multiclass))
yv = Vector(y[:, 1])

LinearRegressorPipe = Pipeline(
    Standardizer(),
    OneHotEncoder(drop_last = true),
    LinearRegressor()
)

LinearModel = machine(LinearRegressorPipe, X, yv)
fit!(LinearModel)
fp = fitted_params(LinearModel)

┌ Info: Training Machine{ProbabilisticPipeline{NamedTuple{,…},…},…}.
└ @ MLJBase C:\Users\jeffr\.julia\packages\MLJBase\MuLnJ\src\machines.jl:464
┌ Info: Training Machine{Standardizer,…}.
└ @ MLJBase C:\Users\jeffr\.julia\packages\MLJBase\MuLnJ\src\machines.jl:464
┌ Info: Training Machine{OneHotEncoder,…}.
└ @ MLJBase C:\Users\jeffr\.julia\packages\MLJBase\MuLnJ\src\machines.jl:464
┌ Info: Training Machine{LinearRegressor,…}.
└ @ MLJBase C:\Users\jeffr\.julia\packages\MLJBase\MuLnJ\src\machines.jl:464


(linear_regressor = (features = ["V1", "V2", "V3", "V4", "V5"],
                     coef = [1.0207869497405524, 1.03242891546997, 0.009406292423317635, 0.026633915171207456, 0.29985915636370225, 0.01589388399578986],
                     intercept = 0.01589388399578986,),
 one_hot_encoder = (fitresult = OneHotEncoderResult,),
 standardizer = Dict(:V1 => (0.0024456300706479973, 1.1309193246154066), :V2 => (-0.015561621122145304, 1.1238897897565245), :V5 => (0.0077036209704558975, 1.1421493464876622), :V3 => (0.02442889884313839, 2.332713568319154), :V4 => (0.15168404285157286, 6.806065861835239)),
 machines = Machine[Machine{Standardizer,…}, Machine{OneHotEncoder,…}, Machine{LinearRegressor,…}],
 fitted_params_given_machine = OrderedCollections.LittleDict{Any, Any, Vector{Any}, Vector{Any}}(Machine{Standardizer,…} => Dict(:V1 => (0.0024456300706479973, 1.1309193246154066), :V2 => (-0.015561621122145304, 1.1238897897565245), :V5 => (0.0077036209704558975, 1.1421493464876622), :V3 => (0.

### Reading the Output of Fitting the Linear Model

---

We can quickly read the results of our models in MLJ. Remember to compute the accuracy of the linear model.

In [12]:
ŷ = MLJ.predict(LinearModel, X)

4000-element Vector{Distributions.Normal{Float64}}:
 Distributions.Normal{Float64}(μ=-1.6915415373374758, σ=0.9580569656804974)
 Distributions.Normal{Float64}(μ=1.4120055632036437, σ=0.9580569656804974)
 Distributions.Normal{Float64}(μ=0.47362968068623923, σ=0.9580569656804974)
 Distributions.Normal{Float64}(μ=0.7266938985590493, σ=0.9580569656804974)
 Distributions.Normal{Float64}(μ=-1.8396459459760564, σ=0.9580569656804974)
 Distributions.Normal{Float64}(μ=0.17582494693025746, σ=0.9580569656804974)
 Distributions.Normal{Float64}(μ=-0.6198103897510154, σ=0.9580569656804974)
 Distributions.Normal{Float64}(μ=2.180787658539391, σ=0.9580569656804974)
 Distributions.Normal{Float64}(μ=2.350862495689184, σ=0.9580569656804974)
 Distributions.Normal{Float64}(μ=0.8121326438168863, σ=0.9580569656804974)
 Distributions.Normal{Float64}(μ=0.26763461952335066, σ=0.9580569656804974)
 Distributions.Normal{Float64}(μ=-0.9597859195673623, σ=0.9580569656804974)
 Distributions.Normal{Float64}(μ=-0.4558610

In [18]:
yhatResponse = [ŷ[i, 1].μ for i in 1:nrow(y)] # μ가 곧 반응 변수의 결과이므로

4000-element Vector{Float64}:
 -1.6915415373374758
  1.4120055632036437
  0.47362968068623923
  0.7266938985590493
 -1.8396459459760564
  0.17582494693025746
 -0.6198103897510154
  2.180787658539391
  2.350862495689184
  0.8121326438168863
  0.26763461952335066
 -0.9597859195673623
 -0.45586101923927047
  ⋮
 -1.5710352382340806
  0.7892100599570454
  0.6062040281841251
 -0.47942797041989216
  0.41469503403720764
  1.1221537781217268
  1.8298953319112667
 -2.117915663545709
  1.045858447110721
 -1.211004416125827
 -1.8897562526259846
 -2.6383291661394104

In [16]:
residuals = y .- yhatResponse
r = report(LinearModel)

(linear_regressor = (deviance = 3665.985359058753,
                     dof_residual = 3994.0,
                     stderror = [0.015876403107805682, 0.015862782503144914, 0.01515900587321476, 0.015156676986003868, 0.016546721612329368, 0.0151482106987007],
                     vcov = [0.0002520601756415419 2.2602205615189542e-5 … -7.850207954537935e-5 -5.6471623554837194e-21; 2.2602205615189542e-5 0.0002516278687420804 … -7.734342973144671e-5 6.403094529988605e-21; … ; -7.850207954537935e-5 -7.734342973144671e-5 … 0.00027379399611592785 -2.1322073720725597e-21; -5.6471623554837194e-21 6.403094529988605e-21 … -2.1322073720725597e-21 0.00022946828737223037],),
 one_hot_encoder = (features_to_be_encoded = Symbol[],
                    new_features = [:V1, :V2, :V3, :V4, :V5],),
 standardizer = (features_fit = [:V1, :V2, :V5, :V3, :V4],),
 machines = Machine[Machine{Standardizer,…}, Machine{OneHotEncoder,…}, Machine{LinearRegressor,…}],
 report_given_machine = OrderedCollections.LittleDic

In [17]:
k = collect(keys(fp.fitted_params_given_machine))[3]
println("\n Coefficients: ", fp.fitted_params_given_machine[k].coef)
println("\n y \n ", y[1:5, 1])
println("\n ŷ \n ", ŷ[1:5])
println("\n yhatResponse \n", yhatResponse[1:5])
println("\n Residuals \n ", y[1:5, 1] .- yhatResponse[1:5])
println("\n Standard Error per Coefficient \n", r.linear_regressor.stderror[2:end])


 Coefficients: [1.0207869497405524, 1.03242891546997, 0.009406292423317635, 0.026633915171207456, 0.29985915636370225, 0.01589388399578986]

 y 
 [-2.0446341129015, -0.461528671336098, 0.458261960749596, 2.2746223981481, -0.969887403007307]

 ŷ 
 Distributions.Normal{Float64}[Distributions.Normal{Float64}(μ=-1.6915415373374758, σ=0.9580569656804974), Distributions.Normal{Float64}(μ=1.4120055632036437, σ=0.9580569656804974), Distributions.Normal{Float64}(μ=0.47362968068623923, σ=0.9580569656804974), Distributions.Normal{Float64}(μ=0.7266938985590493, σ=0.9580569656804974), Distributions.Normal{Float64}(μ=-1.8396459459760564, σ=0.9580569656804974)]

 yhatResponse 
[-1.6915415373374758, 1.4120055632036437, 0.47362968068623923, 0.7266938985590493, -1.8396459459760564]

 Residuals 
 [-0.3530925755640242, -1.8735342345397417, -0.01536771993664321, 1.5479284995890508, 0.8697585429687493]

 Standard Error per Coefficient 
[0.015862782503144914, 0.01515900587321476, 0.015156676986003868, 0.01

### Defining the Logistic Model

---

In [22]:
X = copy(dfX)
y = copy(dfYbinary)

coerce!(X, autotype(X, :string_to_multiclass))
yc = CategoricalArray(y[:, 1])
yc = coerce(yc, OrderedFactor)

LinearBinaryClassifierPipe = Pipeline(
    Standardizer(),
    OneHotEncoder(drop_last = true),
    LinearBinaryClassifier()
)

LogisticModel = machine(LinearBinaryClassifierPipe, X, yc)
fit!(LogisticModel)
fp = fitted_params(LogisticModel)

┌ Info: Training Machine{ProbabilisticPipeline{NamedTuple{,…},…},…}.
└ @ MLJBase C:\Users\jeffr\.julia\packages\MLJBase\MuLnJ\src\machines.jl:464
┌ Info: Training Machine{Standardizer,…}.
└ @ MLJBase C:\Users\jeffr\.julia\packages\MLJBase\MuLnJ\src\machines.jl:464
┌ Info: Training Machine{OneHotEncoder,…}.
└ @ MLJBase C:\Users\jeffr\.julia\packages\MLJBase\MuLnJ\src\machines.jl:464
┌ Info: Spawning 1 sub-features to one-hot encode feature :gender.
└ @ MLJModels C:\Users\jeffr\.julia\packages\MLJModels\tMgLW\src\builtins\Transformers.jl:1142
┌ Info: Spawning 2 sub-features to one-hot encode feature :ethnicity.
└ @ MLJModels C:\Users\jeffr\.julia\packages\MLJModels\tMgLW\src\builtins\Transformers.jl:1142
┌ Info: Spawning 1 sub-features to one-hot encode feature :fcollege.
└ @ MLJModels C:\Users\jeffr\.julia\packages\MLJModels\tMgLW\src\builtins\Transformers.jl:1142
┌ Info: Spawning 1 sub-features to one-hot encode feature :mcollege.
└ @ MLJModels C:\Users\jeffr\.julia\packages\MLJModels\

(linear_binary_classifier = (features = ["gender__female", "ethnicity__afam", "ethnicity__hispanic", "score", "fcollege__no", "mcollege__no", "home__no", "urban__no", "unemp", "wage", "tuition", "income__high", "region__other"],
                             coef = [0.20250729378868754, 0.13075293910912905, 0.344951624939835, 0.9977565847160846, -0.5022315102984595, -0.47850056260216456, -0.20440507809954991, -0.06922751403500088, 0.05892864973017097, -0.08344749828203235, -0.0023151433338596816, 0.46177653955786585, 0.3843262958100774, -1.076633890579366],
                             intercept = -1.076633890579366,),
 one_hot_encoder = (fitresult = OneHotEncoderResult,),
 standardizer = Dict(:wage => (9.500506478338009, 1.3430670761078416), :unemp => (7.597214581091511, 2.763580873344848), :tuition => (0.8146082493518824, 0.33950381985971717), :score => (50.88902933684601, 8.701909614072397)),
 machines = Machine[Machine{Standardizer,…}, Machine{OneHotEncoder,…}, Machine{LinearBinaryC

### Reading the Output from the Prediction of the Logistic Model

---

The output of the MLJ model basically contain the same information as the R version of the model.

In [23]:
ŷ = MLJ.predict(LogisticModel, X)
residuals = [1 - pdf(ŷ[i], y[i, 1]) for i in 1:nrow(y)]
r = report(LogisticModel)

k = collect(keys(fp.fitted_params_given_machine))[3]
println("\n Coefficients: ", fp.fitted_params_given_machine[k].coef)
println("\n y \n ", y[1:5, 1])
println("\n ŷ \n ", ŷ[1:5])
println("\n residuals \n ", residuals[1:5])
println("\n Standard Error per Coefficient \n", r.linear_binary_classifier.stderror[2:end])


 Coefficients: [0.20250729378868754, 0.13075293910912905, 0.344951624939835, 0.9977565847160846, -0.5022315102984595, -0.47850056260216456, -0.20440507809954991, -0.06922751403500088, 0.05892864973017097, -0.08344749828203235, -0.0023151433338596816, 0.46177653955786585, 0.3843262958100774, -1.076633890579366]

 y 
 [0, 0, 0, 0, 0]

 ŷ 
 UnivariateFinite{OrderedFactor{2}, Int64, UInt32, Float64}[UnivariateFinite{OrderedFactor{2}}(0=>0.881, 1=>0.119), UnivariateFinite{OrderedFactor{2}}(0=>0.838, 1=>0.162), UnivariateFinite{OrderedFactor{2}}(0=>0.866, 1=>0.134), UnivariateFinite{OrderedFactor{2}}(0=>0.936, 1=>0.0637), UnivariateFinite{OrderedFactor{2}}(0=>0.944, 1=>0.056)]

 residuals 
 [0.11944603346742211, 0.16182691493524637, 0.13445730373831222, 0.06370799769022917, 0.05604680411361729]

 Standard Error per Coefficient 
[0.1226000420274196, 0.10934317995152515, 0.04661437250372938, 0.09609243724815363, 0.10743620672240183, 0.10642223545563925, 0.09190778860389329, 0.039227245365088

No logistic analysis is complete without the confusion matrix:

In [24]:
yMode = [mode(ŷ[i]) for i in 1:length(ŷ)]
y = coerce(y[:, 1], OrderedFactor)
yMode = coerce(yMode, OrderedFactor)
confusion_matrix(yMode, y)

              ┌───────────────────────────┐
              │       Ground Truth        │
┌─────────────┼─────────────┬─────────────┤
│  Predicted  │      0      │      1      │
├─────────────┼─────────────┼─────────────┤
│      0      │    3283     │     831     │
├─────────────┼─────────────┼─────────────┤
│      1      │     236     │     389     │
└─────────────┴─────────────┴─────────────┘
