In [None]:
rdata_read = load("../../../data/wage2015_subsample_inference.RData")
data = rdata_read["data"]
names(data)
println("Number of Rows : ", size(data)[1],"\n","Number of Columns : ", size(data)[2],) #rows and columns

# An inferential problem: The Gender Wage Gap

In the previous lab, we already analyzed data from the March Supplement of the U.S. Current Population Survey (2015) and answered the question how to use job-relevant characteristics, such as education and experience, to best predict wages. Now, we focus on the following inference question:

What is the difference in predicted wages between men and women with the same job-relevant characteristics?

Thus, we analyze if there is a difference in the payment of men and women (*gender wage gap*). The gender wage gap may partly reflect *discrimination* against women in the labor market or may partly reflect a *selection effect*, namely that women are relatively more likely to take on occupations that pay somewhat less (for example, school teaching).

To investigate the gender wage gap, we consider the following log-linear regression model

\begin{align}
\log(Y) &= \beta'X + \epsilon\\
&= \beta_1 D  + \beta_2' W + \epsilon,
\end{align}

where $D$ is the indicator of being female ($1$ if female and $0$ otherwise) and the
$W$'s are controls explaining variation in wages. Considering transformed wages by the logarithm, we are analyzing the relative difference in the payment of men and women.

In [None]:
Pkg.add("DataFrames")
Pkg.add("Dates")
Pkg.add("Plots")
Pkg.add("CategoricalArrays")

In [None]:
using Pkg
using DataFrames
using Dates
using Plots
using Statistics,RData  #upload data of R format 
using CategoricalArrays # categorical data 

***Variable description***

- occ : occupational classification
- ind : industry classification
- lwage : log hourly wage
- sex : gender (1 female) (0 male)
- shs : some high school
- hsg : High school graduated
- scl : Some College
- clg: College Graduate
- ad: Advanced Degree
- ne: Northeast
- mw: Midwest
- so: South
- we: West
- exp1: experience

### **1.  Analysis in the subset of workers with advanced college education (variables scl, clg, ad).**

Next, we will conduct an analysis for the subset of workers with advanced college education. To do this, we will restrict our data and keep only those who have Some College, College Graduate, or Advanced Degree. 

In [None]:
sub_data = data[(data[!, :scl] .== 1) .| (data[!, :clg] .== 1) .| (data[!, :ad] .== 1), :]
Z = data[(data[!, :scl] .== 1) .| (data[!, :clg] .== 1) .| (data[!, :ad] .== 1), 
         [:lwage, :sex, :shs, :hsg, :scl, :clg, :ad, :ne, :mw, :so, :we, :exp1]]
data_female = data[(data[!, :sex] .== 1) .& ((data[!, :scl] .== 1) .| (data[!, :clg] .== 1) .| (data[!, :ad] .== 1)), :]
Z_female = data_female[!, [:lwage, :sex, :shs, :hsg, :scl, :clg, :ad, :ne, :mw, :so, :we, :exp1]]

# Filtrar datos para hombres
data_male = data[(data[!, :sex] .== 0) .& ((data[!, :scl] .== 1) .| (data[!, :clg] .== 1) .| (data[!, :ad] .== 1)), :]
Z_male = data_male[!, [:lwage, :sex, :shs, :hsg, :scl, :clg, :ad, :ne, :mw, :so, :we, :exp1]]

means = DataFrame( variables = names(Z), All = describe(Z, :mean)[!,2], Men = describe(Z_male,:mean)[!,2], Female = describe(Z_female,:mean)[!,2])


In particular, the table above shows that the difference in average logwage between men and women is equal to  $0.0750$

In [None]:
mean(Z_female[:,:lwage]) - mean(Z_male[:,:lwage])

Thus, the unconditional gender wage gap is about $7,5$\% for the group of never married workers (women get paid less on average in our sample). 

#### 1.1 OLS estimation without controls

In [None]:
#install all the package that we can need
#Pkg.add("GLM") # package to run models 
#Pkg.add("StatsPlots")
#Pkg.add("MLBase")
#Pkg.add("Tables")
#Pkg.add("CovarianceMatrices") # robust standar error 
# Load the installed packages
using DataFrames
using Tables
using GLM
using CovarianceMatrices


In [None]:
nocontrol_model = lm(@formula(lwage ~ sex), sub_data)
nocontrol_est = GLM.coef(nocontrol_model)[2]
nocontrol_se = GLM.coeftable(nocontrol_model).cols[2][2]

# nocontrol_se1 = stderror(HC1(), nocontrol_model)[2]
CI1upper = confint(nocontrol_model)[2, 2]
CI1low = confint(nocontrol_model)[2, 1]

println("The estimated gender coefficient is ", nocontrol_est ," and the corresponding robust standard error is " ,nocontrol_se)

#### 1.2 OLS estimation with controls

Next, we run an ols regression of $Y$ on $(D,W)$ to control for the effect of covariates summarized in $W$:

\begin{align}
\log(Y) &=\beta_1 D  + \beta_2' W + \epsilon.
\end{align}

Here, we are considering the flexible model from the previous lab. Hence, $W$ controls for experience, education, region, and occupation and industry indicators plus transformations and two-way interactions.

In [None]:
flex = @formula(lwage ~ sex + (exp1+exp2+exp3+exp4) * (shs+hsg+scl+clg+occ2+ind2+mw+so+we))
control_model = lm(flex , sub_data)
control_est = GLM.coef(control_model)[2]
control_se = GLM.coeftable(control_model).cols[2][2]
#control_se1 = stderror( HC0(), control_model)[2]
CI2upper = confint(control_model)[2, 2]
CI2low = confint(control_model)[2, 1]

println("Coefficient for OLS with controls " , control_est, "robust standard error:", control_se)

#### 1.3 Partialling-Out using ols

In [None]:
# models
# model for Y
flex_y = @formula(lwage ~ (exp1+exp2+exp3+exp4) * (shs+hsg+scl+clg+occ2+ind2+mw+so+we))
flex_d = @formula(sex ~ (exp1+exp2+exp3+exp4) * (shs+hsg+scl+clg+occ2+ind2+mw+so+we))

t_Y = residuals(lm(flex_y, sub_data))
t_D = residuals(lm(flex_d, sub_data))

data_res = DataFrame(t_Y = t_Y, t_D = t_D )
partial_fit = lm(@formula(t_Y ~ t_D), data_res)
partial_est = GLM.coef(partial_fit)[2]

# standard error
partial_se = GLM.coeftable(partial_fit).cols[2][2]

#partial_se1 = stderror( HC0(), partial_fit)[2]

#condifence interval
CI3upper = confint(partial_fit)[2, 2]
CI3low = confint(partial_fit)[2, 1]
println("Coefficient for D via partiallig-out ", partial_est, " robust standard error:", partial_se )

We know that the partialling-out approach works well when the dimension of $W$ is low
in relation to the sample size $n$. When the dimension of $W$ is relatively high, we need to use variable selection
or penalization for regularization purposes. 

In the following, we illustrate the partialling-out approach using lasso instead of ols. 

In [None]:
DataFrame(modelos = [ "Without controls", "full reg", "partial reg" ], 
Estimate = [nocontrol_est,control_est, partial_est], 
StdError = [nocontrol_se,control_se, partial_se])

### **2.   Descriptive statistics subset of workers with advanced college education (variables scl, clg, ad).**


##### 2.1)  Wage & Lwage

In [None]:
plot_density = density(data.wage, fillcolor = :blue, fillalpha = 0.5, 
                        legend = false, title = "Figure1: Density of wages",
                        xlabel = "Log Wage", ylabel = "Density")

In [None]:
plot_density = density(data.lwage, fillcolor = :blue, fillalpha = 0.5, 
                        legend = false, title = "Figure1: Density of wages",
                        xlabel = "Log Wage", ylabel = "Density")

#### 2.2 Descriptive data by gender

In [None]:
Bplot = boxplot(data.lwage, title = "BoxPlot - Log Wage by Gender",
    ylabel = "Log Wage", xlabel = "Gender",label = "All", legend =:outerright )
boxplot!(Z_female.lwage, label = "Female" )
boxplot!(Z_male.lwage, label = "Male")

### **3. Confidence Interval of sex's coefficient for a different models:**


In [None]:
CIdf = DataFrame(
    x = ["No Control", "With Control", "Partialling Out"],
    y = [nocontrol_est, control_est, partial_est],
    lower = [CI1low, CI2low, CI3low],
    upper = [CI1upper, CI2upper, CI3upper]
)

In [None]:
using Gadfly
plot_density =  Gadfly.plot(CIdf, x=:x, y=:y, ymin=:lower, ymax=:upper,
                    Geom.point(), Geom.errorbar(),
                    Guide.title("Figure 9: C.I. of the sex variable according to different estimations."),
                    Theme(panel_fill="white", panel_stroke=colorant"black", default_color=colorant"darkred"))
display(plot_density)


### **4. Replication of the figure: Experience Profiles and Wage Gap for High School Graduates**


In [None]:
data_hsg = data[data.hsg .== 1, :]
data_clg = data[data.clg .== 1, :]
data_scl = data[data.scl .== 1, :]

data_clgm = data_clg[data_clg.sex .== 0, :]  # Hombres
data_clgf = data_clg[data_clg.sex .== 1, :]  # Mujeres

In [None]:
# Tabla_hsg
Tabla_hsg = data_hsg.groupby('exp1').agg(Promlwageo=('lwage', 'mean')).reset_index()
nivel_hsg = sorted(data_hsg['exp1'].unique())

Promedio = []
for nivel in nivel_hsg:
    Promedio.append(data_hsg[data_hsg['exp2'] <= nivel]['lwage'].mean())

Tabla_hsg['PromMov'] = Promedio
print(Tabla_hsg.head())

# Tabla_clg
Tabla_clg = data_clg.groupby('exp2').agg(Promlwageo=('lwage', 'mean')).reset_index()
Tabla_clgm = data_clg.groupby('exp2').agg(Promlwageo=('lwage', 'mean')).reset_index()
Tabla_clgf = data_clg.groupby('exp2').agg(Promlwageo=('lwage', 'mean')).reset_index()

nivel_clg = sorted(data_clg['exp2'].unique())
nivel_clgm = sorted(data_clgm['exp2'].unique())
nivel_clgf = sorted(data_clgf['exp2'].unique())

Promedio = []
for nivel in nivel_clg:
    Promedio.append(data_clg[data_clg['exp2'] <= nivel]['lwage'].mean())

Tabla_clg['PromMov'] = Promedio
Tabla_clgm['PromMov'] = Promedio
Tabla_clgf['PromMov'] = Promedio
print(Tabla_clg.head())


In [None]:
# Definir la fórmula del modelo
formula = @formula(lwage ~ sex + (exp1 + exp2 + exp3 + exp4) * (shs + hsg + scl + clg + occ2 + ind2 + mw + so + we))

# Ajustar el modelo
control_fit1 = lm(formula, data)

# Hacer predicciones
predict = predict(control_fit1)

# Añadir las predicciones al DataFrame original
data[!, :Predict] = predict

# Filtrar datos para scl y clg
data_sclP = filter(row -> row.scl == 1, data)
data_clgP = filter(row -> row.clg == 1, data)
data_hsgP = filter(row -> row.hsg == 1, data)

data_clgPm = filter(row -> row.clg == 1 && row.sex == 0, data)  # Hombres
data_clgPf = filter(row -> row.clg == 1 && row.sex == 1, data)  # Mujeres


In [None]:
using Statistics

# Using "sclP"
Tabla_hsgP = by(data_hsgP, :exp2, Predict = :Predict => mean)
Tabla_hsgP = sort!(Tabla_hsgP, :exp2)

nivel_hsgP = sort(unique(data_hsgP.exp2))

Promedio = [mean(data_hsgP[data_hsgP.exp2 .<= nivel, :Predict]) for nivel in nivel_hsgP]
Tabla_hsgP[!, :PromMovP] = Promedio

println(first(Tabla_hsgP, 5))

# Repeat for "clgP"
Tabla_clgP = by(data_clgP, :exp2, Predict = :Predict => mean)
Tabla_clgP = sort!(Tabla_clgP, :exp2)

Tabla_clgPf = by(data_clgPf, :exp2, Predict = :Predict => mean)
Tabla_clgPf = sort!(Tabla_clgPf, :exp2)

Tabla_clgPm = by(data_clgPm, :exp2, Predict = :Predict => mean)
Tabla_clgPm = sort!(Tabla_clgPm, :exp2)

nivel_clgP = sort(unique(data_clgP.exp2))

Promedio = [mean(data_clgP[data_clgP.exp2 .<= nivel, :Predict]) for nivel in nivel_clgP]
Tabla_clgP[!, :PromMov] = Promedio
Tabla_clgPf[!, :PromMov] = Promedio

Promediof = Promedio[1:end-1]
Tabla_clgPm[!, :PromMov] = Promediof

println(first(Tabla_clgP, 5))

In [None]:
using Plots

# Datos
x = Tabla_clg[:exp2]
x_3 = Tabla_clgPm[:exp2]
y = Tabla_clg[:PromMov]
y_3 = Tabla_clgPm[:PromMov]

# Crear el gráfico
plot(x, y, color=:navy, linestyle=:solid, label="Actual CLG")
plot!(x_3, y_3, color=:darkred, linestyle=:dash)

# Ajustes del gráfico
ylims!(3, 3.2)
xlims!(0, 15)
xlabel!("Years of Potential Experience")
ylabel!("Log Wage (or Wage Gap)")
title!("Comparison between actual and fitted for CLG and HSG Male")
grid!(linestyle=:dash, color=:gray)

# Marcas de los ejes
xticks!(0:5:15)

# Leyenda
legend!(:topright, fontsize=8)

# Mostrar el gráfico
display(plot!())

# Datos
x = Tabla_clg[:exp2]
x_3 = Tabla_clgPf[:exp2]
y = Tabla_clg[:PromMov]
y_3 = Tabla_clgPf[:PromMov]

# Crear el gráfico
plot(x, y, color=:navy, linestyle=:solid, label="Actual CLG")
plot!(x_3, y_3, color=:darkred, linestyle=:dash)

# Ajustes del gráfico
ylims!(3, 3.2)
xlims!(0, 15)
xlabel!("Years of Potential Experience")
ylabel!("Log Wage (or Wage Gap)")
title!("Comparison between actual and fitted for CLG and HSG Female")
grid!(linestyle=:dash, color=:gray)

# Marcas de los ejes
xticks!(0:5:15)

# Leyenda
legend!(:topright, fontsize=8)

# Mostrar el gráfico
display(plot!())

In [None]:
#Añadiendo los paquetes necesarios
using Pkg
Pkg.add("GLM")
Pkg.add("DataFrames")
Pkg.add("DelimitedFiles")
Pkg.add("RData")
Pkg.add("Random")
Pkg.add("Lasso")
Pkg.add("DataStructures")
Pkg.add("NamedArrays")
Pkg.add("PrettyTables")
Pkg.add("Plots")


In [None]:
using GLM, DataFrames, Random, RData, Lasso, LinearAlgebra, Statistics, DataStructures, NamedArrays, PrettyTables, Plots

In [None]:
#Carpeta Inicio
pwd()

In [None]:
#1. Reading the Data
Data = load("wage2015_subsample_inference.Rdata")
Data = Data["data"]

In [None]:
#2. Setting the Alpha Vector
Alpha_Vector = [0.1, 0.2,0.3,0.4,0.5]



In [None]:
#3. Shuffle and Folds Generation
Random.seed!(123)  # Set a random seed for reproducibility
Data_Shuffled = shuffle!(Data)
k_folds = 10
Fold_Size = Int.(Total_Observations / k_folds)
Fold_List = []

for number in 1:k_folds
    Fold_Generation = (Data_Shuffled[1+Fold_Size*(number-1):Fold_Size*(number),:])
    push!(Fold_List, Fold_Generation)
end

Validation_fold = Fold_List[10]
Training_Folds = Fold_List[1:9,:]


In [None]:
Validation_fold

In [None]:
#4. Lasso Function

function lasso_implementation(dataframe, tested, alpha_value)
    Y = dataframe[!,"wage"]
    Y = DataFrame([Y],[:Y])
    D = dataframe[!,"sex"]
    D = DataFrame([D],[:D])
    W = select(dataframe, Not(["lwage","wage", "sex","occ","occ2","ind","ind2"]))
    data = [Y D W]
    lasso_model = fit(LassoModel, term(:Y) ~  sum(term.(names(data[!, Not(["Y", "D"])]))), data; α = alpha_value)
    Y = tested[!,"wage"]
    Y = DataFrame([Y],[:Y])
    D = tested[!,"sex"]
    D = DataFrame([D],[:D])
    W = select(tested, Not(["lwage","wage", "sex","occ","occ2","ind","ind2"]))
    data_tested = [D W]
    Y_predict = predict(lasso_model, data_tested)
    Y = tested[!,"wage"]
    MSE = mean((Y_predict .- Y).^2)
    return MSE, lasso_model
end





In [None]:
#5. Cross Validation
MSE_Averages = []
for alpha in Alpha_Vector
    MSE_List = []
    for i in 1:k_folds-1
        tested_fold = Training_Folds[i,:][1]
        Current_Folds = [Training_Folds[z] for z in 1:length(Training_Folds) if !(z == i)]
        Combined_dataframe = vcat(Current_Folds...)
        MSE_current = lasso_implementation(Combined_dataframe,tested_fold,alpha)
        push!(MSE_List, MSE_current[1])
    end
    MSE_Mean = mean(MSE_List)
    println("MSE for this alpha ",alpha,": ",MSE_Mean)
    Alpha_MSE = [MSE_Mean, alpha]
    push!(MSE_Averages, Alpha_MSE)
end



In [None]:
#6. Selecting Optimal Alpha

MSE_Averages_Final = []
for i in MSE_Averages
    push!(MSE_Averages_Final, i[1])
end
min_value = minimum(MSE_Averages_Final)
index = findfirst(x -> x == min_value, MSE_Averages_Final)

Optimal_Alpha = MSE_Averages[index][2]
println("Optimal Alpha: ",Optimal_Alpha)
println("Optimal MSE: ",MSE_Averages[index][1])

In [None]:
#7. Final Training Set
Combined_dataframe = vcat(Training_Folds...)
Last_Alpha = lasso_implementation(Combined_dataframe, Validation_fold, Optimal_Alpha)
MSE = Last_Alpha[1]
println("Final MSE: ", MSE)

#Due to the lower MSE we can say that this model can be generalized and be more useful for prediction than otherwise

In [None]:
#8. Plotting
plot(Alpha_Vector, MSE_Averages_Final, xlabel="Alpha", ylabel="MSE", title="Cross-Validation", label="Data", marker=:circle)




# **1.Frisch-Waugh-Lovell (FWL) Theorem Proof**

Given a linear regression model, we aim to demonstrate the FWL theorem using the following elements:

- **$y$**: dependent variable vector ($n \times 1$)
- **$D$**: matrix of independent variables of interest ($n \times k_1$)
- **$\beta_1$**: coefficient vector for $D$ ($k_1 \times 1$)
- **$W$**: matrix of control variables ($n \times k_2$)
- **$\beta_2$**: coefficient vector for $W$ ($k_2 \times 1$)
- **$u$**: error term vector ($n \times 1$)

The model is represented as:

$$ y = D\beta_1 + W\beta_2 + u$$

---

## **Objective**

To prove that $\Psi = \beta_1$ can be accurately estimated through the regression $e_y = e_D \Psi + \varepsilon$, employing the FWL theorem.

---

## **Proof**

### **Step 1: Control for Variables in $W$**

First, we calculate the residuals after controlling for $W$:

- **Regress $D$ on $W$:** Aim to determine the component of $D$ that is orthogonal to $W$. This is achieved by calculating the residuals $e_D$, using the projection matrix:
  
  $$M_W = I - W(W'W)^{-1}W'$$
  
  Thus, the residuals for $D$ are:
  
  $$e_D = M_W D$$ 

- **Regress $y$ on $W$:** Similarly, find the component of $y$ not explained by $W$:
  
  $$e_y = M_W y$$

### **Step 2: Estimate $\Psi$**

With the residuals obtained, we proceed to estimate $\Psi$:

- **Regress $e_y$ on $e_D$ by OLS:** 

  $$ e_y = e_D \Psi + z $$
  
  Solving for $\Psi$, we get:
  
  $$ \hat{\Psi} = (e_D'e_D)^{-1}e_D'e_y $$
  
  Substituting the expressions for residuals into $\hat{\Psi}$ yields:
  
  $$ \hat{\Psi} = (D'M_W'M_W'D)^{-1}D'M_W'M_Wy $$
  
  Which simplifies to:
  
  $$ \hat{\Psi} = (D'M_WD)^{-1}D'M_Wy $$

And this proof that $\Psi = \beta_1$.







# **2.Conditional Expectation Function Minimizes Expected Squared Error Proof**

## **Problem Statement**

Given a random variable $Y$ and a conditioning variable $X$, we consider a relationship of the form:
$$ Y = m(X) + \epsilon $$
where:
- $m(X) = E[Y | X]$ is the Conditional Expectation Function (CEF) of $Y$ given $X$.
- $\epsilon$ is the error term, representing the deviation of $Y$ from its conditional mean.

## **Objective**

Our goal is to prove that the function that minimizes the expected squared error:
$$ m(X) = \text{arg}\min_{g(X)} E[(Y - g(X))^2] $$

is indeed:
$$ E[(Y - g(X))^2] = E[\epsilon^2] $$

## **Proof**

### **Step 1: Expanding the Expected Squared Error**

We start by expanding the expected squared difference as follows:
$$ E[(Y - g(X))^2] = E[(Y - E[Y|X] + E[Y|X] - g(X))^2] $$

By applying the expansion for the square of a sum $(a + b)^2 = a^2 + b^2 + 2ab$, where $a = Y - E[Y|X]$ and $b = E[Y|X] - g(X)$, we obtain:
$$ E[(Y - g(X))^2] = E[(Y - E[Y|X])^2] + E[(E[Y|X] - g(X))^2] + 2E[(Y - E[Y|X])(E[Y|X] - g(X))] $$

### **Step 2: Simplifying Using the Law of Iterated Expectations**

Applying the Law of Iterated Expectations to the mixed term:
$$ 2E[(Y - E[Y|X])(E[Y|X] - g(X))] = 0 $$

This follows because the expectation of $Y - E[Y|X]$ is zero by definition of the error term $\epsilon$ (i.e., $Y - E[Y|X] = \epsilon$ and $E[\epsilon] = 0$), and thus the cross-term disappears.

### **Step 3: Final Reduction**

After removing the cross-term, we are left with:
$$ E[(Y - g(X))^2] = E[(Y - E[Y|X])^2] + E[(E[Y|X] - g(X))^2] $$

Since the second term $E[(E[Y|X] - g(X))^2]$ is always non-negative, it follows that:
$$ E[(Y - g(X))^2] \geq E[(Y - E[Y|X])^2] $$

### **Conclusion**

The expected squared error is minimized when $g(X) = E[Y|X]$, demonstrating that the Conditional Expectation Function (CEF) $m(X)$ minimizes the expected squared error:
$$ m(X) = \text{arg}\min_{g(X)} E[(Y - g(X))^2] $$
This conclusively proves that the CEF is the function that minimizes the expected squared error between the predicted values and the actual values of $Y$.
