In [1]:
cd("../../../../temp")

In [2]:
]activate temp

[32m[1m  Activating[22m[39m project at `c:\Users\Work\Documents\Personal\Work\classes\PUCP 2024-II\repo\temp\temp`


In [32]:
using StatsModels
using DataFrames
using CSV
using GLM
using Statistics

In [4]:
data = CSV.read(
        download("https://raw.githubusercontent.com/d2cml-ai/CausalAI-Course/main/data/wage2015_subsample_inference.csv"), 
        DataFrame, 
        types = Dict(:occ2 => String, :ind2 => String)
);

7.

In [5]:
y = data[:, "lwage"];

8. Now that we have seen the for loops are implemented in Python, they can easily be ported into Julia, so we will not be focusing on that here. However, remember that the formula string methods can cause some issues as they are abstractions of larger procedures "under the hood," meaning that we lose some control over exactly how the generation works.

8.1.

In [6]:
basic_formula = @formula(lwage ~ 1 + sex + hsg + scl + clg + ad + so + we + ne + exp1 + occ2 + ind2)
X_basic = modelmatrix(basic_formula, data);

8.2.

In [10]:
flexible_formula = @formula(lwage ~ 1 + sex + (exp1 + exp2 + exp3 + exp4) * (hsg + scl + clg + ad + so + we + ne + occ2 + ind2))
X_flexible = modelmatrix(flexible_formula, data);

8.3

In [11]:
extra_flexible_formula = @formula(lwage ~ 1 + sex + (exp1 + exp2 + exp3 + exp4) * (hsg + scl + clg + ad + so + we + ne + occ2 + ind2) + (hsg + scl + clg + ad) * (so + we + ne + occ2 + ind2) + (so + we + ne) * (occ2 + ind2) + occ2 * ind2)
X_extra_flexible = modelmatrix(extra_flexible_formula, data);

9.

In [35]:
train_sample = rand(Float64, size(data)[1]) .< 0.8
test_sample = .!(train_sample)
y_train, y_test = y[train_sample], y[test_sample]
X_basic_train, X_basic_test = X_basic[train_sample, :], X_basic[test_sample, :]
X_flexible_train, X_flexible_test = X_flexible[train_sample, :], X_flexible[test_sample, :]
X_extra_flexible_train, X_extra_flexible_test = X_extra_flexible[train_sample, :], X_extra_flexible[test_sample, :];

([1.0 1.0 … 0.0 0.0; 1.0 0.0 … 0.0 0.0; … ; 1.0 0.0 … 0.0 0.0; 1.0 0.0 … 0.0 0.0], [1.0 1.0 … 0.0 0.0; 1.0 0.0 … 0.0 0.0; … ; 1.0 1.0 … 0.0 0.0; 1.0 1.0 … 0.0 0.0])

10

In [37]:
basic_model = lm(X_basic_train, y_train);

In [40]:
flexible_model = lm(X_flexible_train, y_train);

In [42]:
extra_flexible_model = lm(X_extra_flexible_train, y_train);

11

In [63]:
basic_mse_training = mean(residuals(basic_model) .^ 2)
basic_r2_training = 1 - basic_mse_training / var(y_train)
basic_adjr2_training = 1 - size(y_train)[1] / (size(y_train)[1] - size(X_basic)[2]) * basic_mse_training / var(y_train)
basic_mse_testing = mean((predict(basic_model, X_basic_test) - y_test) .^ 2)
basic_r2_testing = 1 - basic_mse_testing / var(y_test)

println("The training MSE for the basic model is $basic_mse_training")
println("The training R2 for the basic model is $basic_r2_training")
println("The training Adjusted R2 for the basic model is $basic_adjr2_training")
println("The testing MSE for the basic model is $basic_mse_testing")
println("The testing R2 for the basic model is $basic_r2_testing")

The training MSE for the basic model is 0.22355646807078697
The training R2 for the basic model is 0.31234132737727915
The training Adjusted R2 for the basic model is 0.30363462620951376
The testing MSE for the basic model is 0.23099027643329226
The testing R2 for the basic model is 0.2924014564471741


In [64]:
flexible_mse_training = mean(residuals(flexible_model) .^ 2)
flexible_r2_training = 1 - flexible_mse_training / var(y_train)
flexible_adjr2_training = 1 - size(y_train)[1] / (size(y_train)[1] - size(X_flexible)[2]) * flexible_mse_training / var(y_train)
flexible_mse_testing = mean((predict(flexible_model, X_flexible_test) - y_test) .^ 2)
flexible_r2_testing = 1 - flexible_mse_testing / var(y_test)

println("The training MSE for the flexible model is $flexible_mse_training")
println("The training R2 for the flexible model is $flexible_r2_training")
println("The training Adjusted R2 for the flexible model is $flexible_adjr2_training")
println("The testing MSE for the flexible model is $flexible_mse_testing")
println("The testing R2 for the flexible model is $flexible_r2_testing")

The training MSE for the flexible model is 0.2085168129360234
The training R2 for the flexible model is 0.35860323773900604
The training Adjusted R2 for the flexible model is 0.3174387181678595
The testing MSE for the flexible model is 0.23907818049707769
The testing R2 for the flexible model is 0.26762556880247357


In [65]:
extra_flexible_mse_training = mean(residuals(extra_flexible_model) .^ 2)
extra_flexible_r2_training = 1 - extra_flexible_mse_training / var(y_train)
extra_flexible_adjr2_training = 1 - size(y_train)[1] / (size(y_train)[1] - size(X_extra_flexible)[2]) * extra_flexible_mse_training / var(y_train)
extra_flexible_mse_testing = mean((predict(extra_flexible_model, X_extra_flexible_test) - y_test) .^ 2)
extra_flexible_r2_testing = 1 - extra_flexible_mse_testing / var(y_test)

println("The training MSE for the extra_flexible model is $extra_flexible_mse_training")
println("The training R2 for the extra_flexible model is $extra_flexible_r2_training")
println("The training Adjusted R2 for the extra_flexible model is $extra_flexible_adjr2_training")
println("The testing MSE for the extra_flexible model is $extra_flexible_mse_testing")
println("The testing R2 for the extra_flexible model is $extra_flexible_r2_testing")

The training MSE for the extra_flexible model is 0.17152114446618302
The training R2 for the extra_flexible model is 0.47240174463215134
The training Adjusted R2 for the extra_flexible model is 0.3089038909295265
The testing MSE for the extra_flexible model is 0.28239401244836165
The testing R2 for the extra_flexible model is 0.13493505007252615
