### GROUP 1

* DEL CARPIO CUENCA, GABRIEL SEBASTIAN
* ESPINOSA CALDERON, MAURICIO GUSTAVO 
* JAIME MARTINEZ, KEVIN OSWALDO
* MELLIZO ANTAZU, MILAGROS ESTEFANY
* QUISPE ROBLADILLO, ALMENDRA VALERIA

**Replication 1: A Prediction problem: The Gender Wage Gap** **(JULIA)**

The data set we consider is from the March Supplement of the U.S. Current Population Survey, year 2015. The dataset's path in the repository is data/wage2015_subsample_inference.csv; it can also be loaded with this link. We select white non-hispanic individuals, aged 25 to 64 years, and working more than 35 hours per week during at least 50 weeks of the year.. We exclude self-employed workers; individuals living in group quarters; individuals in the military, agricultural or private household sectors; individuals with inconsistent reports on earnings and employment status; individuals with allocated or missing information in any of the variables used in the analysis; and individuals with hourly wage below 3.

The variable of interest 𝑌 is the (log) hourly wage rate constructed as the ratio of the annual earnings to the total number of hours worked, which is constructed in turn as the product of number of weeks worked and the usual number of hours worked per week. In our analysis, we also focus on single (never married) workers. The final sample is of size
n
=
5150


Primeros instalamos los paquetes de Julia. Habilitamos Julia como entorno de ejecución para correr código.

In [None]:
%%shell
set -e

#---------------------------------------------------#
JULIA_VERSION="1.8.2" # any version ≥ 0.7.0
JULIA_PACKAGES="IJulia BenchmarkTools"
JULIA_PACKAGES_IF_GPU="CUDA" # or CuArrays for older Julia versions
JULIA_NUM_THREADS=2
#---------------------------------------------------#

if [ -z `which julia` ]; then
  # Install Julia
  JULIA_VER=`cut -d '.' -f -2 <<< "$JULIA_VERSION"`
  echo "Installing Julia $JULIA_VERSION on the current Colab Runtime..."
  BASE_URL="https://julialang-s3.julialang.org/bin/linux/x64"
  URL="$BASE_URL/$JULIA_VER/julia-$JULIA_VERSION-linux-x86_64.tar.gz"
  wget -nv $URL -O /tmp/julia.tar.gz # -nv means "not verbose"
  tar -x -f /tmp/julia.tar.gz -C /usr/local --strip-components 1
  rm /tmp/julia.tar.gz

  # Install Packages
  nvidia-smi -L &> /dev/null && export GPU=1 || export GPU=0
  if [ $GPU -eq 1 ]; then
    JULIA_PACKAGES="$JULIA_PACKAGES $JULIA_PACKAGES_IF_GPU"
  fi
  for PKG in `echo $JULIA_PACKAGES`; do
    echo "Installing Julia package $PKG..."
    julia -e 'using Pkg; pkg"add '$PKG'; precompile;"' &> /dev/null
  done

  # Install kernel and rename it to "julia"
  echo "Installing IJulia kernel..."
  julia -e 'using IJulia; IJulia.installkernel("julia", env=Dict(
      "JULIA_NUM_THREADS"=>"'"$JULIA_NUM_THREADS"'"))'
  KERNEL_DIR=`julia -e "using IJulia; print(IJulia.kerneldir())"`
  KERNEL_NAME=`ls -d "$KERNEL_DIR"/julia*`
  mv -f $KERNEL_NAME "$KERNEL_DIR"/julia

  echo ''
  echo "Successfully installed `julia -v`!"
  echo "Please reload this page (press Ctrl+R, ⌘+R, or the F5 key) then"
  echo "jump to the 'Checking the Installation' section."
fi

Installing Julia 1.8.2 on the current Colab Runtime...
2024-09-06 14:01:25 URL:https://storage.googleapis.com/julialang2/bin/linux/x64/1.8/julia-1.8.2-linux-x86_64.tar.gz [135859273/135859273] -> "/tmp/julia.tar.gz" [1]
Installing Julia package IJulia...
Installing Julia package BenchmarkTools...
Installing IJulia kernel...
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mInstalling julia kernelspec in /root/.local/share/jupyter/kernels/julia-1.8

Successfully installed julia version 1.8.2!
Please reload this page (press Ctrl+R, ⌘+R, or the F5 key) then
jump to the 'Checking the Installation' section.




In [None]:
versioninfo()

Julia Version 1.8.2
Commit 36034abf260 (2022-09-29 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 2 × Intel(R) Xeon(R) CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, broadwell)
  Threads: 2 on 2 virtual cores
Environment:
  LD_LIBRARY_PATH = /usr/local/nvidia/lib:/usr/local/nvidia/lib64
  JULIA_NUM_THREADS = 2


In [None]:
using Pkg
Pkg.add("Plots")
Pkg.add("Random")
Pkg.add("GLM")
Pkg.add("DataFrames")
Pkg.add("LaTeXStrings")
Pkg.add("Statistics")

[32m[1m    Updating[22m[39m registry at `~/.julia/registries/General.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m JpegTurbo_jll ──────────────── v3.0.3+0
[32m[1m   Installed[22m[39m GR_jll ─────────────────────── v0.73.7+0
[32m[1m   Installed[22m[39m Libmount_jll ───────────────── v2.40.1+0
[32m[1m   Installed[22m[39m libfdk_aac_jll ─────────────── v2.0.3+0
[32m[1m   Installed[22m[39m LERC_jll ───────────────────── v3.0.0+1
[32m[1m   Installed[22m[39m x265_jll ───────────────────── v3.5.0+0
[32m[1m   Installed[22m[39m libdecor_jll ───────────────── v0.2.2+0
[32m[1m   Installed[22m[39m LoggingExtras ──────────────── v1.0.3
[32m[1m   Installed[22m[39m Opus_jll ───────────────────── v1.3.3+0
[32m[1m   Installed[22m[39m Xorg_xkbcomp_jll ───────────── v1.4.6+0
[32m[1m   Installed[22m[39m RelocatableFolders ─────────── v1.0.1
[32m[1m   Installed[22m[39m Measures ───────────────────── v0.3.2
[32m[1m

In [None]:
Pkg.add("StatsBase")

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.8/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.8/Manifest.toml`


In [None]:
#Usamos los siguientes paquetes
using Plots
using Random
using GLM
using DataFrames
using LaTeXStrings
using Statistics
using StatsBase

**1. DATA ANALYSIS**

**1. Import the data set. Make sure the column names are imported as intended.**

In [4]:
using Pkg
Pkg.add(["CSV", "DataFrames"])

[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m WorkerUtilities ─ v1.6.1
[32m[1m   Installed[22m[39m WeakRefStrings ── v1.4.2
[32m[1m   Installed[22m[39m FilePathsBase ─── v0.9.22
[32m[1m   Installed[22m[39m CSV ───────────── v0.10.14
[32m[1m    Updating[22m[39m `~/.julia/environments/v1.8/Project.toml`
 [90m [336ed68f] [39m[92m+ CSV v0.10.14[39m
[32m[1m    Updating[22m[39m `~/.julia/environments/v1.8/Manifest.toml`
 [90m [336ed68f] [39m[92m+ CSV v0.10.14[39m
 [90m [48062228] [39m[92m+ FilePathsBase v0.9.22[39m
 [90m [ea10d353] [39m[92m+ WeakRefStrings v1.4.2[39m
 [90m [76eceee3] [39m[92m+ WorkerUtilities v1.6.1[39m
[32m[1mPrecompiling[22m[39m project...
[32m  ✓ [39m[90mWorkerUtilities[39m
[32m  ✓ [39m[90mWeakRefStrings[39m
[32m  ✓ [39m[90mFilePathsBase[39m
[32m  ✓ [39mCSV
  4 dependencies successfully precompiled in 23 seconds. 190 already precompiled.


In [6]:
using CSV
using DataFrames

# Cargar el archivo CSV
data = CSV.read("/content/wage2015_subsample_inference.csv", DataFrame)

names(data)


21-element Vector{String}:
 "rownames"
 "wage"
 "lwage"
 "sex"
 "shs"
 "hsg"
 "scl"
 "clg"
 "ad"
 "mw"
 "so"
 "we"
 "ne"
 "exp1"
 "exp2"
 "exp3"
 "exp4"
 "occ"
 "occ2"
 "ind"
 "ind2"

**2. Are there missing values? Display the number of missings (if any) of each variable in the data set.**

In [7]:
# Contar los missing values en cada columna
missing_count = map(col -> sum(ismissing.(data[:, col])), names(data))

# Crear un DataFrame para mostrar los resultados
missing_data = DataFrame(variable = names(data), missing_count = missing_count)
missing_data

Row,variable,missing_count
Unnamed: 0_level_1,String,Int64
1,rownames,0
2,wage,0
3,lwage,0
4,sex,0
5,shs,0
6,hsg,0
7,scl,0
8,clg,0
9,ad,0
10,mw,0


**3. Report descriptive statistics of the variables (mean, standard deviation, percentiles, etc.). Interpret your results.**

In [8]:
# Se obtienen los estadísticas descriptivas
stats = describe(data)
stats


Row,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Float64,Real,Float64,Real,Int64,DataType
1,rownames,15636.3,10.0,15260.0,32643.0,0,Int64
2,wage,23.4104,3.02198,19.2308,528.846,0,Float64
3,lwage,2.97079,1.10591,2.95651,6.2707,0,Float64
4,sex,0.444466,0.0,0.0,1.0,0,Float64
5,shs,0.023301,0.0,0.0,1.0,0,Float64
6,hsg,0.243883,0.0,0.0,1.0,0,Float64
7,scl,0.278058,0.0,0.0,1.0,0,Float64
8,clg,0.31767,0.0,0.0,1.0,0,Float64
9,ad,0.137087,0.0,0.0,1.0,0,Float64
10,mw,0.259612,0.0,0.0,1.0,0,Float64


Los estadísticos descriptivos muestran que las variables en el conjunto de datos tienen una distribución variada, con salarios que van desde 3.02 hasta 528.85, y una media de 23.41. Las variables categóricas como sex indican que aproximadamente el 44.4% de los individuos pertenecen a una categoría específica, mientras que las variables de experiencia (exp1, exp2, etc.) presentan rangos amplios, con un máximo de 487.97 para exp4. No hay valores faltantes en las variables, y algunas, como wage y exp4, presentan posibles valores extremos, lo que sugiere la necesidad de revisar la presencia de outliers en el análisis.

**4. How many women with a college graduate degree (clg) or above have a wage corresponding to the 25% richest of the sample? Report the dataframe corresponding to this conditions and the result.**

In [11]:
using Statistics

# 1:Solo mujeres con un grado universitario o superior
women_clg_above = data[(data[!, "sex"] .== 1) .&& (data[!, "clg"] .== 1 .|| data[!, "ad"] .== 1), :]

# 2: Percentil 75 del logaritmo del salario (log(wage))
percentile_75 = quantile(data[!, "lwage"], 0.75)

# 3: Mujeres que están en el 25% más rico
rich_women_clg_above = women_clg_above[women_clg_above[!, "lwage"] .>= percentile_75, :]

rich_women_clg_above, size(rich_women_clg_above, 1)


([1m419×21 DataFrame[0m
[1m Row [0m│[1m rownames [0m[1m wage     [0m[1m lwage   [0m[1m sex     [0m[1m shs     [0m[1m hsg     [0m[1m scl     [0m[1m clg     [0m[1m ad      [0m[1m mw      [0m[1m[0m ⋯
     │[90m Int64    [0m[90m Float64  [0m[90m Float64 [0m[90m Float64 [0m[90m Float64 [0m[90m Float64 [0m[90m Float64 [0m[90m Float64 [0m[90m Float64 [0m[90m Float64 [0m[90m[0m ⋯
─────┼──────────────────────────────────────────────────────────────────────────────────────────────
   1 │       19   28.8462  3.36198      1.0      0.0      0.0      0.0      1.0      0.0      0.0  ⋯
   2 │      191   42.3077  3.74497      1.0      0.0      0.0      0.0      0.0      1.0      0.0
   3 │      232   41.2088  3.71865      1.0      0.0      0.0      0.0      1.0      0.0      0.0
   4 │      319  100.0     4.60517      1.0      0.0      0.0      0.0      0.0      1.0      0.0
   5 │      563   33.6538  3.51613      1.0      0.0      0.0      0.0      1.0  

**5. How many men with a high school graduate degree (hsg) or below have a wage corresponding to the 25% richest of the sample? Report the dataframe corresponding to this conditions and the result.**

In [12]:
# 1: Hombres graduados de secundaria (hsg) o menos (shs)
men_hsg_below = data[(data[!, "sex"] .== 0) .&& (data[!, "hsg"] .== 1 .|| data[!, "shs"] .== 1), :]

# 2: Percentil 75 del logaritmo del salario (log(wage))
percentile_75 = quantile(data[!, "lwage"], 0.75)

# 3: Hombres que están en el 25% más rico
rich_men_hsg_below = men_hsg_below[men_hsg_below[!, "lwage"] .>= percentile_75, :]

rich_men_hsg_below, size(rich_men_hsg_below, 1)


([1m118×21 DataFrame[0m
[1m Row [0m│[1m rownames [0m[1m wage     [0m[1m lwage   [0m[1m sex     [0m[1m shs     [0m[1m hsg     [0m[1m scl     [0m[1m clg     [0m[1m ad      [0m[1m mw      [0m[1m[0m ⋯
     │[90m Int64    [0m[90m Float64  [0m[90m Float64 [0m[90m Float64 [0m[90m Float64 [0m[90m Float64 [0m[90m Float64 [0m[90m Float64 [0m[90m Float64 [0m[90m Float64 [0m[90m[0m ⋯
─────┼──────────────────────────────────────────────────────────────────────────────────────────────
   1 │      113   27.8846  3.32808      0.0      0.0      1.0      0.0      0.0      0.0      0.0  ⋯
   2 │      276   28.8462  3.36198      0.0      0.0      1.0      0.0      0.0      0.0      0.0
   3 │      467   28.8462  3.36198      0.0      0.0      1.0      0.0      0.0      0.0      0.0
   4 │      858   28.8462  3.36198      0.0      0.0      1.0      0.0      0.0      0.0      0.0
   5 │      876   29.7143  3.39163      0.0      0.0      1.0      0.0      0.0  

**6. Create two dataframes. One containing only the log(wage) and the other containig every variable of the data set but the wage related variables.**

In [13]:
# DataFrame de la variable log(wage)
df_log_wage = data[:, ["lwage"]]

# DataFrame con todas las variables excepto salario
df_no_wage = select(data, Not(["lwage", "wage"]))

df_log_wage, df_no_wage


([1m5150×1 DataFrame[0m
[1m  Row [0m│[1m lwage   [0m
      │[90m Float64 [0m
──────┼─────────
    1 │ 2.26336
    2 │ 3.8728
    3 │ 2.40313
    4 │ 2.63493
    5 │ 3.36198
    6 │ 2.46222
    7 │ 2.95651
    8 │ 2.95651
    9 │ 2.48491
   10 │ 2.95651
   11 │ 2.85115
  ⋮   │    ⋮
 5141 │ 3.81874
 5142 │ 3.11778
 5143 │ 2.82298
 5144 │ 3.17966
 5145 │ 2.62801
 5146 │ 2.69255
 5147 │ 3.13883
 5148 │ 3.64966
 5149 │ 3.49551
 5150 │ 2.85115
[36m5129 rows omitted[0m, [1m5150×19 DataFrame[0m
[1m  Row [0m│[1m rownames [0m[1m sex     [0m[1m shs     [0m[1m hsg     [0m[1m scl     [0m[1m clg     [0m[1m ad      [0m[1m mw      [0m[1m so      [0m[1m we      [0m[1m[0m ⋯
      │[90m Int64    [0m[90m Float64 [0m[90m Float64 [0m[90m Float64 [0m[90m Float64 [0m[90m Float64 [0m[90m Float64 [0m[90m Float64 [0m[90m Float64 [0m[90m Float64 [0m[90m[0m ⋯
──────┼────────────────────────────────────────────────────────────────────────────────────────────

**2. Data wrangling**

**7. Make an array for our  Y  variable, which will be the logarithm of wage (lwage column)**

In [14]:
# Crear un array para la variable Y, que es la columna lwage
Y = data[!, "lwage"]

# Mostrar el array Y
Y


5150-element SentinelArrays.ChainedVector{Float64, Vector{Float64}}:
 2.2633643798407643
 3.872802292274865
 2.403126322215923
 2.634927936273247
 3.361976668508874
 2.4622152385859297
 2.9565115604007097
 2.9565115604007097
 2.4849066497880004
 2.9565115604007097
 2.8511510447428834
 2.486507931154974
 2.486507931154974
 ⋮
 2.981204172991081
 3.0518217402050345
 3.818735071004589
 3.117779707996832
 2.822980167776187
 3.1796551117149194
 2.6280074934286737
 2.6925460145662448
 3.138833117194664
 3.649658740960655
 3.4955080611333966
 2.8511510447428834

**8. Make three new arrays for our predictors:**

**8.1. The basic model will have the columns sex hsg scl clg ad so we ne exp1 occ2 ind2. Make sure to convert occ2 and ind2 to dummies and to drop the first dummy value to prevent multicolinearity.**

sex+exp1+hsg+scl+clg+ad+so+we+ne+dummy(occ2)+dummy(ind2)

In [17]:
Pkg.add("StatsModels")

[32m[1m   Resolving[22m[39m package versions...
[32m[1m    Updating[22m[39m `~/.julia/environments/v1.8/Project.toml`
 [90m [3eaba693] [39m[92m+ StatsModels v0.7.4[39m
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.8/Manifest.toml`


In [25]:
Pkg.add("CategoricalArrays")

[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m CategoricalArrays ─ v0.10.8
[32m[1m    Updating[22m[39m `~/.julia/environments/v1.8/Project.toml`
 [90m [324d7699] [39m[92m+ CategoricalArrays v0.10.8[39m
[32m[1m    Updating[22m[39m `~/.julia/environments/v1.8/Manifest.toml`
 [90m [324d7699] [39m[92m+ CategoricalArrays v0.10.8[39m
[32m[1mPrecompiling[22m[39m project...
[32m  ✓ [39mCategoricalArrays
  1 dependency successfully precompiled in 2 seconds. 194 already precompiled.


In [26]:
using DataFrames
using CategoricalArrays  # Importar el paquete para manejar variables categóricas

# Paso 1: Crear variables dummy para occ2 e ind2 usando directamente las columnas como categóricas
data_dummies = select(data, Not(["occ2", "ind2"]))  # Excluir las columnas originales

# Crear las columnas categóricas de occ2 y ind2
data_dummies[!, "occ2_str"] = categorical(string.(data[!, "occ2"]))  # Convertir a string y luego a categoría
data_dummies[!, "ind2_str"] = categorical(string.(data[!, "ind2"]))  # Convertir a string y luego a categoría

# Paso 2: Seleccionar las columnas del modelo básico
X_basic = select(data_dummies, ["sex", "hsg", "scl", "clg", "ad", "so", "we", "ne", "exp1", "occ2_str", "ind2_str"])

# Paso 3: Convertir X_basic a un array
X_basic_array = Matrix(X_basic)

# Mostrar el array
X_basic_array


5150×11 Matrix{Any}:
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  1.0   7.0  "11"  "18"
 0.0  0.0  0.0  1.0  0.0  0.0  0.0  1.0  31.0  "10"  "9"
 0.0  1.0  0.0  0.0  0.0  0.0  0.0  1.0  18.0  "19"  "4"
 1.0  0.0  0.0  0.0  1.0  0.0  0.0  1.0  25.0  "1"   "12"
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  1.0  22.0  "6"   "22"
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  1.0   1.0  "5"   "14"
 1.0  1.0  0.0  0.0  0.0  0.0  0.0  1.0  42.0  "17"  "14"
 0.0  1.0  0.0  0.0  0.0  0.0  0.0  1.0  37.0  "17"  "9"
 1.0  1.0  0.0  0.0  0.0  0.0  0.0  1.0  31.0  "13"  "19"
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  1.0   4.0  "10"  "18"
 1.0  1.0  0.0  0.0  0.0  0.0  0.0  1.0   7.0  "13"  "18"
 0.0  1.0  0.0  0.0  0.0  0.0  0.0  1.0  30.0  "14"  "18"
 1.0  0.0  1.0  0.0  0.0  0.0  0.0  1.0   5.5  "11"  "18"
 ⋮                        ⋮                          ⋮
 1.0  0.0  0.0  0.0  1.0  0.0  1.0  0.0   8.0  "6"   "18"
 0.0  0.0  0.0  1.0  0.0  0.0  1.0  0.0  28.0  "1"   "21"
 1.0  0.0  0.0  1.0  0.0  0.0  1.0  0.0   5.0  "10"  "18"

**8.2. The flexible model will have the same columns, and will also include polynomials for experience (exp2 exp3 exp4), as well as the interactions between all experience variables and other variables except for sex. This means**

s
e
x
+
e
x
p
1
+
e
x
p
2
+
e
x
p
3
+
e
x
p
4
+
h
s
g
+
s
c
l
+
c
l
g
+
a
d
+
s
e
+
w
e
+
n
e
+
d
u
m
m
y
(
o
c
c
2
)
+
d
u
m
m
y
(
i
n
d
2
)
+

(
e
x
p
1
+
e
x
p
2
+
e
x
p
3
+
e
x
p
4
)
×
(
h
s
g
+
s
c
l
+
c
l
g
+
a
d
+
s
e
+
w
e
+
n
e
+
d
u
m
m
y
(
o
c
c
2
)
+
d
u
m
m
y
(
i
n
d
2
)
)

Hint: you can use a for loop to multiply the desired variables and create the interactions. Some packages might also have ways to easily specify the model variables with strings like R does by default

In [30]:
println(names(data_dummies))


["rownames", "wage", "lwage", "sex", "shs", "hsg", "scl", "clg", "ad", "mw", "so", "we", "ne", "exp1", "exp2", "exp3", "exp4", "occ", "ind", "exp1_hsg", "exp1_scl", "exp1_clg", "exp1_ad", "exp1_so", "exp1_we", "exp1_ne"]


In [33]:
using DataFrames
using CategoricalArrays

# Crear polinomios de la experiencia
data_dummies[!, :exp2] = data_dummies[!, :exp1].^2
data_dummies[!, :exp3] = data_dummies[!, :exp1].^3
data_dummies[!, :exp4] = data_dummies[!, :exp1].^4

# Crear las interacciones entre las variables de experiencia y las demás (excepto sex)
variables_interaccion = [:hsg, :scl, :clg, :ad, :so, :we, :ne, :occ, :ind]
# Crear interacciones entre exp1, exp2, exp3, exp4 y las variables seleccionadas
for exp_var in [:exp1, :exp2, :exp3, :exp4]
    for var in variables_interaccion
        # Crear una nueva columna para cada interacción
        data_dummies[!, Symbol("$(exp_var)_$(var)")] = data_dummies[!, exp_var] .* data_dummies[!, var]
    end
end

#Seleccionar las columnas del modelo flexible (incluyendo las interacciones)
X_flexible = select(data_dummies, [:sex, :hsg, :scl, :clg, :ad, :so, :we, :ne, :exp1, :exp2, :exp3, :exp4, :occ, :ind])

# Añadir las interacciones a las columnas del DataFrame
for exp_var in [:exp1, :exp2, :exp3, :exp4]
    for var in variables_interaccion
        insertcols!(X_flexible, Symbol("$(exp_var)_$(var)") => data_dummies[!, Symbol("$(exp_var)_$(var)")])
    end
end


X_flexible_array = Matrix(X_flexible)

X_flexible_array



5150×50 Matrix{Float64}:
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  1.0   7.0    49.0   …       8.6436e6       2.00964e7
 0.0  0.0  0.0  1.0  0.0  0.0  0.0  1.0  31.0   961.0           2.81674e9      4.68225e9
 0.0  1.0  0.0  0.0  0.0  0.0  0.0  1.0  18.0   324.0           6.5715e8       8.08315e7
 1.0  0.0  0.0  0.0  1.0  0.0  0.0  1.0  25.0   625.0           1.64062e8      2.73047e9
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  1.0  22.0   484.0           4.72026e8      2.2184e9
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  1.0   1.0     1.0   …    1650.0         7460.0
 1.0  1.0  0.0  0.0  0.0  0.0  0.0  1.0  42.0  1764.0           1.59319e10     2.26531e10
 0.0  1.0  0.0  0.0  0.0  0.0  0.0  1.0  37.0  1369.0           9.8206e9       1.06452e10
 1.0  1.0  0.0  0.0  0.0  0.0  0.0  1.0  31.0   961.0           3.73102e9      7.93305e9
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  1.0   4.0    16.0      833280.0            2.09664e6
 1.0  1.0  0.0  0.0  0.0  0.0  0.0  1.0   7.0    49.0   …       9.65202e6      1.98563e7
 

**8.3. The extra-flexible model will include all two-way interactions between variables, except for sex. This means**

s
e
x
+
(
e
x
p
1
+
e
x
p
2
+
e
x
p
3
+
e
x
p
4
+
h
s
g
+
s
c
l
+
c
l
g
+
a
d
+
s
e
+
w
e
+
n
e
+
d
u
m
m
y
(
o
c
c
2
)
+
d
u
m
m
y
(
i
n
d
2
)
)
2

Hint: If you use a for loop here, you might create several variables identical to others. You can make another procedure to get rid of duplicates

In [34]:
using DataFrames
using CategoricalArrays

# Definir las variables para las interacciones (excluyendo 'sex')
variables = [:exp1, :exp2, :exp3, :exp4, :hsg, :scl, :clg, :ad, :so, :we, :ne, :occ, :ind]

# Interacciones de dos vías entre las variables (excepto sex)
for i in 1:length(variables)
    for j in (i+1):length(variables)
        # Crear una nueva columna para cada interacción entre dos variables
        var1 = variables[i]
        var2 = variables[j]
        interaction_name = Symbol("$(var1)_$(var2)")
        data_dummies[!, interaction_name] = data_dummies[!, var1] .* data_dummies[!, var2]
    end
end

# Seleccionar las columnas para el modelo extra-flexible (incluyendo las interacciones)
X_extra_flexible = select(data_dummies, [:sex, :exp1, :exp2, :exp3, :exp4, :hsg, :scl, :clg, :ad, :so, :we, :ne, :occ, :ind])

# Añadir las interacciones
for i in 1:length(variables)
    for j in (i+1):length(variables)
        var1 = variables[i]
        var2 = variables[j]
        interaction_name = Symbol("$(var1)_$(var2)")
        insertcols!(X_extra_flexible, interaction_name => data_dummies[!, interaction_name])
    end
end

X_extra_flexible_array = Matrix(X_extra_flexible)

X_extra_flexible_array


5150×92 Matrix{Float64}:
 1.0   7.0    49.0     343.0      2401.0        0.0  …     0.0  3600.0  8370.0       3.0132e7
 0.0  31.0   961.0   29791.0    923521.0        0.0        0.0  3050.0  5070.0       1.54635e7
 0.0  18.0   324.0    5832.0    104976.0        1.0        0.0  6260.0   770.0       4.8202e6
 1.0  25.0   625.0   15625.0    390625.0        0.0        0.0   420.0  6990.0       2.9358e6
 1.0  22.0   484.0   10648.0    234256.0        0.0        0.0  2015.0  9470.0       1.9082e7
 1.0   1.0     1.0       1.0         1.0        0.0  …     0.0  1650.0  7460.0       1.2309e7
 1.0  42.0  1764.0   74088.0         3.1117e6   1.0        0.0  5120.0  7280.0       3.72736e7
 0.0  37.0  1369.0   50653.0         1.87416e6  1.0        0.0  5240.0  5680.0       2.97632e7
 1.0  31.0   961.0   29791.0    923521.0        1.0        0.0  4040.0  8590.0       3.47036e7
 1.0   4.0    16.0      64.0       256.0        0.0        0.0  3255.0  8190.0       2.66584e7
 1.0   7.0    49.0     343.0  

**3. Linear Regressions**

**Split each of the dataframes created (basic, flexible and extra-flexible models) into a training sample (80% of the data) and a test sample. Use the normalized data for this. Hint: You do not need to normalize the data for dummy variables.**

In [40]:
using Random


Random.seed!(1234)
# Porcentaje de entrenamiento
train_size = Int(0.8 * size(X_basic_norm, 1))


permuted_indices = randperm(size(X_basic_norm, 1))

# Índices en entrenamiento (80%) y prueba (20%)
train_indices = permuted_indices[1:train_size]
test_indices = permuted_indices[train_size+1:end]

# Crear el conjunto de entrenamiento y prueba
X_basic_train = X_basic_norm[train_indices, :]
X_basic_test = X_basic_norm[test_indices, :]

println("Tamaño entrenamiento: ", size(X_basic_train))
println("Tamaño prueba: ", size(X_basic_test))


Tamaño entrenamiento: (4120, 11)
Tamaño prueba: (1030, 11)


In [41]:

Random.seed!(1234)

train_size_flexible = Int(0.8 * size(X_flexible_norm, 1))

permuted_indices_flexible = randperm(size(X_flexible_norm, 1))

# Índices en entrenamiento (80%) y prueba (20%) para el modelo flexible
train_indices_flexible = permuted_indices_flexible[1:train_size_flexible]
test_indices_flexible = permuted_indices_flexible[train_size_flexible+1:end]

# Crear el conjunto de entrenamiento y prueba para el modelo flexible
X_flexible_train = X_flexible_norm[train_indices_flexible, :]
X_flexible_test = X_flexible_norm[test_indices_flexible, :]

println("Tamaño entrenamiento (modelo flexible): ", size(X_flexible_train))
println("Tamaño prueba (modelo flexible): ", size(X_flexible_test))


Tamaño entrenamiento (modelo flexible): (4120, 50)
Tamaño prueba (modelo flexible): (1030, 50)


In [42]:

Random.seed!(1234)


train_size_extra_flexible = Int(0.8 * size(X_extra_flexible_norm, 1))

permuted_indices_extra_flexible = randperm(size(X_extra_flexible_norm, 1))


train_indices_extra_flexible = permuted_indices_extra_flexible[1:train_size_extra_flexible]
test_indices_extra_flexible = permuted_indices_extra_flexible[train_size_extra_flexible+1:end]

X_extra_flexible_train = X_extra_flexible_norm[train_indices_extra_flexible, :]
X_extra_flexible_test = X_extra_flexible_norm[test_indices_extra_flexible, :]

println("Tamaño entrenamiento (modelo extra-flexible): ", size(X_extra_flexible_train))
println("Tamaño prueba (modelo extra-flexible): ", size(X_extra_flexible_test))


Tamaño entrenamiento (modelo extra-flexible): (4120, 92)
Tamaño prueba (modelo extra-flexible): (1030, 92)
