Members:
* Stephy Riega - 20171426
* Jesus Soto - 20172738
* Franco Caceres - 2016615

## 1. What is data splitting?

Data splitting consists in dividing the data *randomly* in at least two samples, training our models in one and testing their predict capacity in the other one. The first step to evaluate the prediction's performance of the model is data splitting. The proportion represented by each partitioned data varies depending of the researcher, but it is pretty common that the data has a proportion of 80%/20%. You are going to name "train" at the first split (which represents  the 80%) and "test" (20%) to the other one. It is important to take into account that this data splitting has to be done *randomly*. The second step is to perform the regression or regressions of different models in the training sample and, after that, perform the predict of your dependent variable ($Y_{test-estimated}$) on your test sample using your estimators ($\beta_{train}$) from the training sample. The third step is to estimate the $R^2$ and $MSE_{OFS}$ of the regressions by calculating the average of the squared difference between the observed ($Y_{test}$) and the estimaded ($Y_{test-estimated}$). With this, you can compared them with $R^2$  and $MSE$ into sample to know if your estimators perform better than in the sample data or which model perform better out of sample.

# 2. Replicate the PM1_Notebook1_Prediction_newdata (R and Python) JN but with restricted data:

As seen, Y  is the hourly wage of a worker and  X  is a vector of worker's characteristics, e.g., education, experience, gender. We focus on how to use job-relevant characteristics, such as education and experience, to best predict wages.

# Data

The data set we consider is from the March Supplement of the U.S. Current Population Survey, year 2015. We select white non-hispanic individuals, aged 25 to 64 years, and working more than 35 hours per week during at least 50 weeks of the year. We exclude self-employed workers; individuals living in group quarters; individuals in the military, agricultural or private household sectors; individuals with inconsistent reports on earnings and employment status; individuals with allocated or missing information in any of the variables used in the analysis; and individuals with hourly wage below  3 .

The variable of interest  Y  is the hourly wage rate constructed as the ratio of the annual earnings to the total number of hours worked, which is constructed in turn as the product of number of weeks worked and the usual number of hours worked per week. In our analysis, we also focus on single (never married) workers. The final sample is of size  **n=5150** .

# Data analysis


We start by loading the data set.

In [None]:
# Import relevant packages
import pandas as pd
import numpy as np
import pyreadr

In [None]:
#import data 
rdata_read = pyreadr.read_r("../data/wage2015_subsample_inference.Rdata")
data_raw = rdata_read[ 'data' ]
type(data_raw)
data_raw.shape 

It's a data frame with 5150 observations with 20 variables. However,  **since we want only a subsample of the data that consists of the people who did not go to college**, we restrict the data set using the variables *shs* and *hsg* (which represents the people who ultimately went partially to high school or completed this level of education). 

In [None]:
data = data_raw.loc[(data_raw["shs"] == 1) | (data_raw["hsg"] == 1)]
data.shape #now we have 1376 observations

In [None]:
data.info() # info of the new subsample

In [None]:
data.describe()

Obviously, we can see that there are no observations of people who went for higher levels of educations than high school (like college or advanced levels- sinced it is necessary college to go a MA or PhD) 

We are constructing the output variable  **Y**  and the matrix  **Z**  which includes the characteristics of workers that are given in the data.

In [None]:
Y = np.log2(data['wage']) 
n = len(Y)
z = data.loc[:, ~data.columns.isin(['wage', 'lwage','Unnamed: 0'])]
z # the covariates 

In [None]:
p = z.shape[1] #number of regressors

print("Number of observation:", n, '\n')
print( "Number of raw regressors:", p)

For the outcome variable *wage* and a subset of the raw regressors, we calculate the empirical mean to get familiar with the data.

In [None]:
Z_subset = data.loc[:, data.columns.isin(["lwage","sex","shs","hsg","scl","clg","ad","mw","so","we","ne","exp1"])]
table = Z_subset.mean(axis=0)
table

In [None]:
table = pd.DataFrame(data=table, columns={"Sample mean":"0"} )
table.index # saves the values of the average of the variables
index1 = list(table.index) #save them in a list 
index2 = ["Log Wage","Sex","Some High School","High School Graduate",\
          "Some College","College Graduate", "Advanced Degree","Midwest",\
          "South","West","Northeast","Experience"]
table = table.rename(index=dict(zip(index1,index2))) # names of the constructed data frame and the average values
table

The share of female workers in our sample is ~32% ($sex=1$ if female). **This is a mean lower than the one found in the complete data set.**

Alternatively, we can also print the table as latex.

In [None]:
print(table.to_latex())

## Prediction Question

We will construct a prediction rule for hourly wage $Y$, which depends linearly on job-relevant characteristics $X$:

\begin{equation}\label{decompose}
Y = \beta'X+ \epsilon.
\end{equation}

With this, we will:

* Predict wages  using various characteristics of workers.

* Assess the predictive performance using the (adjusted) sample MSE, the (adjusted) sample $R^2$ and the out-of-sample MSE and $R^2$.


We employ two different specifications for prediction:


1. Basic Model:   $X$ consists of a set of raw regressors (e.g. gender, experience, education indicators,  occupation and industry indicators, regional indicators). That is,  sex + exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2.


2. Flexible Model:  $X$ consists of all raw regressors from the basic model (excepto for sex) plus transformations (e.g., ${exp}^2$, ${exp}^3$ and ${exp}^4$) squared.That is, $(exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)^2$


Using the **Flexible Model**, enables us to approximate the real relationship by a more complex regression model and therefore to reduce the bias. The **Flexible Model** increases the range of potential shapes of the estimated regression function. In general, flexible models often deliver good prediction accuracy but give models which are harder to interpret. *Also, with flexible models we have the trade off between bias and variance, making it more propense to overfitting.*

Now, let us fit both models to our data by running ordinary **least squares (ols):**

In [None]:
# Import packages for OLS regression
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [None]:
# 1. basic model with 51 regressors
basic = 'lwage ~ sex + exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2'
basic_results = smf.ols(basic , data=data).fit()
print(basic_results.summary()) # estimated coefficients
print( "Number of regressors in the basic model:",len(basic_results.params), '\n')  # number of regressors in the Basic Model

##### Note that the basic model consists of $51$ regressors.

In [None]:
# 2. flexible model
flex = 'lwage ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2'
flex_results_0 = smf.ols(flex , data=data)
flex_results = smf.ols(flex , data=data).fit()
print(flex_results.summary()) # estimated coefficients
print( "Number of regressors in the basic model:",len(flex_results.params), '\n') # number of regressors in the Flexible Model

##### Note that the flexible model consists of $246$ regressors (because we have not only the raw variables and their interactions, but also the squared versions of them.) 

## Try Lasso next

In [None]:
# Import relevant packages for lasso 
from sklearn.linear_model import LassoCV
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error

In [None]:
# Get exogenous variables from flexible model
X = flex_results_0.exog #bc skitlearn cannot produce them in the regression
X.shape # the subsample data of the people who didnt go to college

In [None]:
# Set endogenous variable
lwage = data["lwage"]
lwage.shape

This $\alpha$ is $\lambda$, the penalty value for the number of regressors. 

#### Extra1:

If we use $\alpha$=0.1 we can see a warning message of no converge. However, it is necesary that we increase the penalty value in order to reduce the number of regressors. Now, we estimated the $\alpha$ as 0.2, 03, 0.4 (we leave the loop as a commentary in order to not interrupt the run of the script):

In [None]:
#sequence = np.array( [0, 0.1, 0.2, 0.3 , 0.4, 0.5, 0.6 ] )

#for element in sequence:
    #reg = linear_model.Lasso(alpha = element)

##LASSO regression for flexible model
#reg.fit(X, lwage)
#lwage_lasso_fitted = reg.fit(X, lwage).predict( X )

## coefficients 
#reg.coef_
#print('Lasso Regression: R^2 score', reg.score(X, lwage))

After that, we find that 0.4 makes the model converge:

In [None]:
alpha=0.5
#however, we find that 0.39579 is the smallest value of alpha that makes the model converge

In [None]:
reg = linear_model.Lasso(alpha = alpha)

# LASSO regression for flexible model
reg.fit(X, lwage)
lwage_lasso_fitted = reg.fit(X, lwage).predict( X )

# coefficients 
reg.coef_
print('Lasso Regression: R^2 score', reg.score(X, lwage))

In [None]:
# Check predicted values
lwage_lasso_fitted

Now, we can evaluate the performance of both models based on the (adjusted) $R^2_{sample}$ and the (adjusted) $MSE_{sample}$:

In [None]:
# Basic Model
basic = 'lwage ~ sex + exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2'
basic_results = smf.ols(basic , data=data).fit()

# Flexible model 
flex = 'lwage ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2'
flex_results = smf.ols(flex , data=data).fit()

In [None]:
# Assess the predictive performance
R2_1 = basic_results.rsquared
print("R-squared for the basic model: ", R2_1, "\n")
R2_adj1 = basic_results.rsquared_adj
print("adjusted R-squared for the basic model: ", R2_adj1, "\n")


R2_2 = flex_results.rsquared
print("R-squared for the basic model: ", R2_2, "\n")
R2_adj2 = flex_results.rsquared_adj
print("adjusted R-squared for the basic model: ", R2_adj2, "\n")

R2_L = reg.score(flex_results_0.exog, lwage)
print("R-squared for LASSO: ", R2_L, "\n")
R2_adjL = 1 - (1-R2_L)*(len(lwage)-1)/(len(lwage)-X.shape[1]-1)
print("adjusted R-squared for LASSO: ", R2_adjL, "\n")

In [None]:
# calculating the MSE
MSE1 =  np.mean(basic_results.resid**2)
print("MSE for the basic model: ", MSE1, "\n")
p1 = len(basic_results.params) # number of regressors
n = len(lwage)
MSE_adj1  = (n/(n-p1))*MSE1
print("adjusted MSE for the basic model: ", MSE_adj1, "\n")

MSE2 =  np.mean(flex_results.resid**2)
print("MSE for the flexible model: ", MSE2, "\n")
p2 = len(flex_results.params) # number of regressors
n = len(lwage)
MSE_adj2  = (n/(n-p2))*MSE2
print("adjusted MSE for the flexible model: ", MSE_adj2, "\n")


MSEL = mean_squared_error(lwage, lwage_lasso_fitted)
print("MSE for the LASSO model: ", MSEL, "\n")
pL = reg.coef_.shape[0] # number of regressors
n = len(lwage)
MSE_adjL  = (n/(n-pL))*MSEL
print("adjusted MSE for LASSO model: ", MSE_adjL, "\n")

In [None]:
# Package for latex table 
#import array_to_latex as a2l

table = np.zeros((3, 5))
table[0,0:5] = [p1, R2_1, MSE1, R2_adj1, MSE_adj1]
table[1,0:5] = [p2, R2_2, MSE2, R2_adj2, MSE_adj2]
table[2,0:5] = [pL, R2_L, MSEL, R2_adjL, MSE_adjL]
table

In [None]:
table = pd.DataFrame(table, columns = ["p","$R^2_{sample}$","$MSE_{sample}$","$R^2_{adjusted}$", "$MSE_{adjusted}$"], \
                      index = ["basic reg","flexible reg", "lasso flex"])
table

**Considering all measures above, the flexible model performs better than the basic model and the lasso flex model**.

Also, it is worth noticing that the results in the R scrip won't be equal to the ones found here in the lasso flexible model because the lambda/alpha is different.

One procedure to circumvent this issue is to use **data splitting** that is described and applied in the following.

## Data Splitting

Measure the prediction quality of the two models via data splitting:

- Randomly split the data into one training sample and one testing sample. Here we just use a simple method.
- Use the training sample for estimating the parameters of the Basic Model and the Flexible Model.
- Use the testing sample for evaluation. Predict the $\mathtt{lwage}$  of every observation in the testing sample based on the estimated parameters in the training sample.
- Calculate the Mean Squared Prediction Error $MSE_{test}$ based on the testing sample for both prediction models. 

In [None]:
# Import relevant packages for splitting data
import random
import math

In [None]:
# Set Seed
# to make the results replicable (generating random numbers)
np.random.seed(0)
random = np.random.randint(0,n, size=math.floor(n))
data["random"] = random
random    # the array does not change 

In [None]:
data_2 = data.sort_values(by=['random'])
data_2.head() #sort the random numbers in order

In [None]:
# Create training and testing sample 
train = data_2[ : math.floor(n*4/5)]    # training sample (75% of the data)
test =  data_2[ math.floor(n*4/5) : ]   # testing sample
print(train.shape)
print(test.shape)

In [None]:
# Basic Model
basic = 'lwage ~ sex + exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2'
basic_results = smf.ols(basic , data=data).fit()

# Flexible model 
flex = 'lwage ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2'
flex_results = smf.ols(flex , data=data).fit()

In [None]:
# basic model
# estimating the parameters in the training sample
basic_results = smf.ols(basic , data=train).fit()
print(basic_results.summary())

In [None]:
lwage_test = test["lwage"].values
#test = test.drop(columns=['wage', 'lwage', 'random'])
lwage_test

In [None]:
# calculating the out-of-sample MSE
test = sm.add_constant(test)   #add constant 

lwage_pred =  basic_results.predict(test) # predict out of sample
print(lwage_pred)

In [None]:
MSE_test1 = np.sum((lwage_test-lwage_pred)**2)/len(lwage_test)
R2_test1  = 1 - MSE_test1/np.var(lwage_test)

print("Test MSE for the basic model: ", MSE_test1, " ")
print("Test R2 for the basic model: ", R2_test1)

In the basic model, the $MSE_{test}$ is *slightly closed* to the $MSE_{sample}$ when not adjusted. **However, it is slightly more distant if we compared it to the $MSE_{adjusted}$**.

Note that this wont be equal to the estimation in 

In [None]:
# Flexible model
# estimating the parameters in the training sample
flex_results = smf.ols(flex , data=train).fit()

# calculating the out-of-sample MSE
lwage_flex_pred =  flex_results.predict(test) # predict out of sample
lwage_test = test["lwage"].values

MSE_test2 = np.sum((lwage_test-lwage_flex_pred)**2)/len(lwage_test)
R2_test2  = 1 - MSE_test2/np.var(lwage_test)

print("Test MSE for the flexible model: ", MSE_test2, " ")
print("Test R2 for the flexible model: ", R2_test2)

In the flexible model **using ols**, the discrepancy between the $MSE_{test}$ and the $MSE_{sample}$ is **large** because the $R^2$ is negative, meaning a bad fit for the model. 

It is worth to notice that the $MSE_{test}$ vary across different data splits. Hence, it is a good idea average the out-of-sample MSE over different data splits to get valid results.

Nevertheless, we observe that, based on the out-of-sample $MSE$, the basic model using ols regression performs is about as well (or slightly better) than the flexible model. *Let's remember that the basic model has less regressors*.

*Also, let's remember that this results won't be necessary the same with the ones found in the R script because the determination of the data splitting won't be the same.*

Next, let us use lasso regression in the flexible model instead of ols regression. Lasso (*least absolute shrinkage and selection operator*) is a penalized regression method that can be used to reduce the complexity of a regression model when the number of regressors $p$ is relatively large in relation to $n$. 

Note that the out-of-sample $MSE$ on the test sample can be computed for any other black-box prediction method as well. Thus, let us finally compare the performance of lasso regression in the flexible model to ols regression.

In [None]:
# flexible model using lasso
# get exogenous variables from training data used in flex model
flex_results_0 = smf.ols(flex , data=train)
X_train = flex_results_0.exog
print(X_train.shape)

# Get endogenous variable 
lwage_train = train["lwage"]
print(lwage_train.shape)

In [None]:
# flexible model using lasso

# get exogenous variables from testing data used in flex model
flex_results_1 = smf.ols(flex , data=test)
X_test = flex_results_1.exog
print(X_test.shape)

# Get endogenous variable 
lwage_test = test["lwage"]
print(lwage_test.shape) # have 30% of the observations in test set

In [None]:
reg = linear_model.Lasso(alpha=0.5)
lwage_lasso_fitted = reg.fit(X_train, lwage_train).predict( X_test )

MSE_lasso = np.sum((lwage_test-lwage_lasso_fitted)**2)/len(lwage_test)
R2_lasso  = 1 - MSE_lasso/np.var(lwage_test)

print("Test MSE for the flexible model: ", MSE_lasso, " ")
print("Test R2 for the flexible model: ", R2_lasso)

Finally, let us summarize the results:

In [None]:
# Package for latex table 
#import array_to_latex as a2l

table2 = np.zeros((3, 2))
table2[0,0] = MSE_test1
table2[1,0] = MSE_test2
table2[2,0] = MSE_lasso
table2[0,1] = R2_test1
table2[1,1] = R2_test2
table2[2,1] = R2_lasso

table2 = pd.DataFrame(table2, columns = ["$MSE_{test}$", "$R^2_{test}$"], \
                      index = ["basic reg","flexible reg","lasso regression"])
table2

We can see now that the lasso regression is slightly better at predicting the data than was before using OLS. However, we can see that, overall, **the basic model is better at predicting the test set.**

In [None]:
table2.to_latex #in latex
print(table2.to_latex())

## 2. In addition Do two cases of Partialling-Out using lasso. Remember that we want to find the beta associated with sex.

### Example 1: Matrix W = 'exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2'

In [None]:
# import packages
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [None]:
# model for Y
flex_y = 'lwage ~  exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2'

# model for D
flex_d = 'sex ~ exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2'

### Using Sklearn

In [None]:
from sklearn import linear_model
# flex_y

#define the lasso model
lasso_model = linear_model.Lasso( alpha = 0.5 )

#extract the covariable to use
flex_y_covariables = smf.ols(formula = flex_y, data = data)

#fit hte model and predict the value of the outcome
Y_lasso_fitted = lasso_model.fit( flex_y_covariables.exog, data[[ 'lwage' ]] ).predict( flex_y_covariables.exog )

#save the residuals
t_Y = data[[ 'lwage' ]] - Y_lasso_fitted.reshape( Y_lasso_fitted.size, 1)

In [None]:
# extraflex_d

#extract the covariable to use
flex_d_covariables = smf.ols( flex_d, data=data)

#fit hte model and predict the value of the outcome
D_lasso_fitted = lasso_model.fit( flex_d_covariables.exog, data[[ 'sex' ]] ).predict( flex_d_covariables.exog )

#save the residuals
t_D = data[[ 'sex' ]] - D_lasso_fitted.reshape( D_lasso_fitted.size, 1)

In [None]:
#create a data frame with the residuals of both the regressions
data_res = pd.DataFrame( np.hstack(( t_Y , t_D )) , columns = [ 't_Y', 't_D' ] )

# regression of the residuals Y on D after partialling-out the effect of W
partial_lasso_fit = smf.ols( formula = 't_Y ~ t_D' , data = data_res ).fit()
partial_lasso_est = partial_lasso_fit.summary2().tables[1]['Coef.']['t_D']

print( f"Coefficient for D via partialling-out using lasso {partial_lasso_est}" )

# standard error
HCV_coefs = partial_lasso_fit.cov_HC0
partial_lasso_se = np.power( HCV_coefs.diagonal() , 0.5)[1]

The use of the lasso for partialling-out the basic regressions shows us that the gender gap is of $12.14\%$, which is different from the $7\%$ gap found int the basic regression using OLS. Probably because lasso supresses some of the regressors but not all of them, and because partialling out controls for the covariates in the both regressions.

Next we try "extra" flexible model, where we take interactions of all controls, giving us about 1000 controls.

### Example 2: Matrix W = '(exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2'

Lets remember that sklearn will not solve $(exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)^2$, so we use a previous saved set of exogenous variables called "X" that does solve this.

In [None]:
# model for Y
flex_y = 'lwage ~  X'

# model for D
flex_d = 'sex ~ X'

### Using Sklearn

In [None]:
from sklearn import linear_model
# flex_y

#define the lasso model
lasso_model = linear_model.Lasso( alpha = 0.5 )

#extract the covariable to use
flex_y_covariables = smf.ols(formula = flex_y, data = data)

#fit hte model and predict the value of the outcome
Y_lasso_fitted = lasso_model.fit( flex_y_covariables.exog, data[[ 'lwage' ]] ).predict( flex_y_covariables.exog )

#save the residuals
t_Y = data[[ 'lwage' ]] - Y_lasso_fitted.reshape( Y_lasso_fitted.size, 1)

In [None]:
# extraflex_d

#extract the covariable to use
flex_d_covariables = smf.ols( flex_d, data=data)

#fit hte model and predict the value of the outcome
D_lasso_fitted = lasso_model.fit( flex_d_covariables.exog, data[[ 'sex' ]] ).predict( flex_d_covariables.exog )

#save the residuals
t_D = data[[ 'sex' ]] - D_lasso_fitted.reshape( D_lasso_fitted.size, 1)

In [None]:
#create a data frame with the residuals of both the regressions
data_res = pd.DataFrame( np.hstack(( t_Y , t_D )) , columns = [ 't_Y', 't_D' ] )

# regression of the residuals Y on D after partialling-out the effect of W
partial_lasso_fit = smf.ols( formula = 't_Y ~ t_D' , data = data_res ).fit()
partial_lasso_est = partial_lasso_fit.summary2().tables[1]['Coef.']['t_D']

print( f"Coefficient for D via partialling-out using lasso {partial_lasso_est}" )

# standard error
HCV_coefs = partial_lasso_fit.cov_HC0
partial_lasso_se = np.power( HCV_coefs.diagonal() , 0.5)[1]

The use of the lasso for partialling-out the basic regressions shows us that the gender gap is of $11.40\%$, which is different from the $7\%$ gap found int the basic regression using OLS but **pretty similar to the $12\%$ found on the basic regression model with partialling-out.** Above the possible reasons already mentioned, in this flexible model we have more regressors but also controlled by the penalty value $\alpha$.

### Extra2: flexible model with sex

In [None]:
sex_flex = 'lwage ~ sex + (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2'

control_fit = smf.ols( formula = sex_flex, data=data).fit()

print(control_fit.summary()) # estimated coefficients

**The $R^2$ of this new flexible model used in all the sample data is $0.51$, which is higher than the flexible model (that had $0.50$). This is due to the addition of one more regressor (sex).**