## GRUPO 8 

#### *Integrantes:* 

1. Gianfranco Soria (20163509)
2. Erick Morales (20163041)
3. Andrea Clavo (20176040)
4. Sandra Martínez (20173026)

## Question 1
Briefly explain the idea of sample splitting to evaluate the performance
of prediction rules to a fellow student and show how to use it on the wage data.

The idea of separating the sample is to have two representative subsamples.
The TR subsample and the TS subsample. The TR subsample is separated
for model selection and the TS subsample for model validation. That
is, the model is generated using the first subsample as the training
sample and tested on the second subsample. Now, to know the proportion
of each subsample, 80\% of the total sample is usually used for the
training subsample (TR) and the remaining 20\% for the test subsample
(TS) (varies depending on the use of splitting) to avoid underfitting
problems - which is when the number of observations in the training
subsample is insufficient and, therefore, the model has little predictive
value - or overfitting - which is when the model has learned from
the randomness of a certain data set, so that the model will no longer
be exportable to another data set. An example of the use of splitting:
Suppose that two models can be used to predict wages using worker
characteristics: a basic model and a flexible model. How do we know
which model captures the data better? The training subsample will
be used to estimate the parameters of the Basic Model and the Flexible
Model and the test subsample will be used for evaluation. Therefore,
you will predict the wage of each observation in the test sample based
on the parameters estimated in the training sample. And finally, we
will calculate the mean square error of $MSE_{test}$ prediction based
on the test sample (TS) for both prediction models. 

Note: If you use Python, a uniform variable can be generated for splitting
so that all data have the same weight.

## Question 2
Replicate the PM1_Notebook1_Prediction_newdata (R and Python) JN but follow the next instructions:
- Focus on people who did not go to college (use the next variables shs, hsg)
- Basic model: 'lwage ~ sex + exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2'
- Flexible model: 'lwage ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2'

## Introduction

In labor economics an important question is what determines the wage of workers. This is a causal question,
but we could begin to investigate from a predictive perspective.

In the following wage example, $Y$ is the hourly wage of a worker and $X$ is a vector of worker's characteristics, e.g., education, experience, gender. Two main questions here are:


* How to use job-relevant characteristics, such as education and experience, to best predict wages?

* What is the difference in predicted wages between men and women with the same job-relevant characteristics?

In this lab, we focus on the prediction question first.

## Data

The data set we consider is from the March Supplement of the U.S. Current Population Survey, year 2015.  We select white non-hispanic individuals, aged 25 to 64 years, and working more than 35 hours per week during at least 50 weeks of the year. We exclude self-employed workers; individuals living in group quarters; individuals in the military, agricultural or private household sectors;  individuals with inconsistent reports on earnings and employment status; individuals with allocated or missing information in any of the variables used in the analysis; and individuals with hourly wage below $3$. 

The variable of interest $Y$ is the hourly wage rate constructed as the ratio of the annual earnings to the total number of hours worked, which is constructed in turn as the product of number of weeks worked and the usual number of hours worked per week. In our analysis, we also focus on single (never married) workers. The final sample is of size $n=5150$.

## Data analysis

We start by loading the data set.

In [2]:
import pandas as pd
import numpy as np
import pyreadr

In [48]:
rdata_read = pyreadr.read_r("../data/wage2015_subsample_inference.Rdata")
data = rdata_read[ 'data' ]
data.shape

(5150, 20)

In [40]:
data.columns

Index(['wage', 'lwage', 'sex', 'shs', 'hsg', 'scl', 'clg', 'ad', 'mw', 'so',
       'we', 'ne', 'exp1', 'exp2', 'exp3', 'exp4', 'occ', 'occ2', 'ind',
       'ind2'],
      dtype='object')

In [41]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5150 entries, 10 to 32643
Data columns (total 20 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   wage    5150 non-null   float64 
 1   lwage   5150 non-null   float64 
 2   sex     5150 non-null   float64 
 3   shs     5150 non-null   float64 
 4   hsg     5150 non-null   float64 
 5   scl     5150 non-null   float64 
 6   clg     5150 non-null   float64 
 7   ad      5150 non-null   float64 
 8   mw      5150 non-null   float64 
 9   so      5150 non-null   float64 
 10  we      5150 non-null   float64 
 11  ne      5150 non-null   float64 
 12  exp1    5150 non-null   float64 
 13  exp2    5150 non-null   float64 
 14  exp3    5150 non-null   float64 
 15  exp4    5150 non-null   float64 
 16  occ     5150 non-null   category
 17  occ2    5150 non-null   category
 18  ind     5150 non-null   category
 19  ind2    5150 non-null   category
dtypes: category(4), float64(16)
memory usage: 736.3+ KB


In [42]:
data.describe()

Unnamed: 0,wage,lwage,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4
count,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0
mean,23.41041,2.970787,0.444466,0.023301,0.243883,0.278058,0.31767,0.137087,0.259612,0.296505,0.216117,0.227767,13.760583,3.018925,8.235867,25.118038
std,21.003016,0.570385,0.496955,0.150872,0.429465,0.448086,0.465616,0.343973,0.438464,0.456761,0.411635,0.419432,10.609465,4.000904,14.488962,53.530225
min,3.021978,1.105912,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,13.461538,2.599837,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.25,0.125,0.0625
50%,19.230769,2.956512,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,1.0,1.0,1.0
75%,27.777778,3.324236,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,21.0,4.41,9.261,19.4481
max,528.845673,6.270697,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,47.0,22.09,103.823,487.9681


We focus on Focus on people who did not go to college (using variables shs, hsg)

In [51]:
data = data[(data['shs'] == 1) | (data['hsg'] == 1)]
print(data.shape) 
data

(1376, 20)


Unnamed: 0_level_0,wage,lwage,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4,occ,occ2,ind,ind2
rownames,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
15,11.057692,2.403126,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,18.0,3.24,5.832,10.4976,6260,19,770,4
43,19.230769,2.956512,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,42.0,17.64,74.088,311.1696,5120,17,7280,14
44,19.230769,2.956512,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,37.0,13.69,50.653,187.4161,5240,17,5680,9
47,12.000000,2.484907,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,31.0,9.61,29.791,92.3521,4040,13,8590,19
73,17.307692,2.851151,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,7.0,0.49,0.343,0.2401,4020,13,8270,18
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32580,12.980769,2.563469,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,15.0,2.25,3.375,5.0625,2010,6,9370,22
32590,13.461538,2.599837,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,8.0,0.64,0.512,0.4096,4720,16,8590,19
32599,22.596154,3.117780,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,15.0,2.25,3.375,5.0625,9620,22,5390,9
32603,16.826923,2.822980,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,11.0,1.21,1.331,1.4641,7150,20,8770,21


In [50]:
(data['shs'].value_counts()), (data['hsg'].value_counts())

(0.0    1256
 1.0     120
 Name: shs, dtype: int64,
 1.0    1256
 0.0     120
 Name: hsg, dtype: int64)

We are constructing the output variable  **Y**  and the matrix  **Z**  which includes the characteristics of workers that are given in the data.

In [52]:
Y = np.log2(data['wage']) 
n = len(Y)
z = data.loc[:, ~ data.columns.isin(['wage', 'lwage','Unnamed: 0'])]
p = z.shape[1]

print("Number of observation:", n, '\n')
print( "Number of raw regressors:", p)

Number of observation: 1376 

Number of raw regressors: 18


## Prediction Question

Now, we will construct a prediction rule for hourly wage $Y$, which depends linearly on job-relevant characteristics $X$:

\begin{equation}\label{decompose}
Y = \beta'X+ \epsilon.
\end{equation}

Our goals are

* Predict wages  using various characteristics of workers.

* Assess the predictive performance using the (adjusted) sample MSE, the (adjusted) sample $R^2$ and the out-of-sample MSE and $R^2$.


We employ two different specifications for prediction:


1. Basic Model:   $X$ consists of a set of raw regressors (e.g. gender, experience, education indicators,  occupation and industry indicators, regional indicators).


2. Flexible Model:  $X$ consists of all raw regressors from the basic model plus occupation and industry indicators, transformations (e.g., ${exp}^2$ and ${exp}^3$) and additional two-way interactions of polynomial in experience with other regressors. An example of a regressor created through a two-way interaction is *experience* times the indicator of having a *college degree*.

Using the **Flexible Model**, enables us to approximate the real relationship by a
 more complex regression model and therefore to reduce the bias. The **Flexible Model** increases the range of potential shapes of the estimated regression function. In general, flexible models often deliver good prediction accuracy but give models which are harder to interpret.

#### Basic regression

Now, let us fit both models to our data by running ordinary least squares (ols):

In [53]:
# Import packages for OLS regression

import statsmodels.api as sm
import statsmodels.formula.api as smf

In [54]:
basic = 'lwage ~ sex + exp1 + shs + hsg + scl + clg + mw + so + we + occ2+ ind2'
basic_results = smf.ols(basic , data=data).fit()
print(basic_results.summary()) # estimated coefficients
print( "Number of regressors in the basic model:",len(basic_results.params), '\n')  # number of regressors in the Basic Model

                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.180
Model:                            OLS   Adj. R-squared:                  0.151
Method:                 Least Squares   F-statistic:                     6.212
Date:                Thu, 16 Sep 2021   Prob (F-statistic):           9.07e-33
Time:                        19:57:48   Log-Likelihood:                -872.87
No. Observations:                1376   AIC:                             1842.
Df Residuals:                    1328   BIC:                             2093.
Df Model:                          47                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.0216      0.062     32.368      0.0

The basic model consists of $51$ regressors.

#### Flexible regression

In [55]:
flex = 'lwage ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2'
flex_results_0 = smf.ols(flex , data=data)
flex_results = smf.ols(flex , data=data).fit()
print(flex_results.summary()) # estimated coefficients
print( "Number of regressors in the basic model:", len(flex_results.params), '\n') # number of regressors in the Flexible Model

                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.507
Model:                            OLS   Adj. R-squared:                  0.232
Method:                 Least Squares   F-statistic:                     1.840
Date:                Thu, 16 Sep 2021   Prob (F-statistic):           2.24e-15
Time:                        19:58:40   Log-Likelihood:                -522.96
No. Observations:                1376   AIC:                             2034.
Df Residuals:                     882   BIC:                             4616.
Df Model:                         493                                         
Covariance Type:            nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept                 3.63

The flexible model consists of $979$ regressors.

#### Lasso 

## Question 3

In addition Do two cases of Partialling-Out using lasso. Remember that we want to find the beta associated with sex.
Y = log(wage), D = sex
- Case 1: Partialling-Out using lasso 1 : Matrix W = 'exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2'
- Case 2: Partialling-Out using lasso 2 : Matrix W = (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2'