# Lasso Regression

Lasso regression is what is called the **Penalized regression method**, often used in machine learning to **select the subset of variables**. It is a supervised machine learning method. Specifically, LASSO is a **Shrinkage and Variable Selection** method for linear regression models. <br>
<br>
LASSO, is actually an acronym for **Least Absolute Selection and Shrinkage Operator**. 

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Shrinkage**<br>
The LASSO imposes a constraint on the sum of the absolute values of the model parameters, where the sum has a specified constant as an upper bound. This constraint causes regression coefficients for some variables to shrink towards zero. 

### Why Lasso

![image.png](attachment:image.png)
1. If the true relationship between the response variable and the predictors is approximately linear and you have a large number of observations, then OLS regression parameter estimates will have low bias and low variance. However, if you have a relatively small number of observations and a large number of predictors, then the variance of the OLS perimeter estimates will be higher. In this case, Lasso Regression is useful because **shrinking the regression coefficient can reduce variance without a substantial increase in bias**. 
2. Often times, at least some of the explanatory variables in an OLS multiple regression analysis are not really associated with the response variable. As a result, we often end up with a model that's over fitted and more difficult to interpret. With Lasso Regression, the regression coefficients for **unimportant variables are reduced to zero** which effectively removes them from the model and produces a simpler model that selects only the most important predictors. 
3. lambda: control the strength of the penalty

![image.png](attachment:image.png)

# SAS Program
![image.png](attachment:image.png)

![image.png](attachment:image.png)

1. srs: split data by using sample random sampling.
2. outall option: tells SAS to include, both, the training and test observations in a single output data set that has a new variable called selected, to indicate whether an observation belongs to the training set, or the test set.

![image.png](attachment:image.png)

1. partition role: Observations with a value of one on the selected variable, are assigned the role of training observation. And observations with a value of zero, are assigned the role of test observation. 
2. /selection=lar: specify the options we want to use to test the model. The selection option tells SAS which method to use to compute the parameters for variable selection. Here we use lar algorithm
3. choose=cv: ask SAS to use cross validation to choose the final statistical model. 
4. stop=none: ensures that the model doesn't stop running until each of the candidate predictor variables is tested. 
5. cvmethod=random(10): Specifies that I use a K-fold cross-validation method with ten randomly selected folds.
<br>
<br>
At each step of the estimation process, a **new predictor is entered into the model** and the mean square error for the validation fold, is calculated for each of the nine folds, and then averaged. The model with the **lowest average means square error** is selected by SAS as the **best model**.<br>
<br>
In lasso regression, the penalty term is **not fair** if the predictive variables are **not on the same scale**. Meaning that not all the predictors will get the same penalty. The SAS **glmselect procedure** handles this by **automatically standardizing the predictor variables**, so that they all have a mean equal to zero and a standard deviation equal to one, which places them all **on the same scale**.

#### LAR algorithm
which stands for **Least Angled Regression**. This algorithm starts with no predictors in the model, and adds a predictor at each step. 

 It first adds a predictor that is most correlated with the response variable, and moves it towards least square estimate, until there is another predictor that is equally correlated with the model residual. It adds this predictor to the model and starts the least square estimation process over again, with both variables. The LAR algorithm continues with this process until it has tested all the predictors. Perimeter estimates at any step are shrunk, and predictors with coefficients that are shrunk to zero are removed from the model ,and the process starts all over again. 

# Result Analysis:


Table 1 shows The SURVEYSELECT Procedure table which we used to split the observations in the total data set into training and test data. There are 70% observations of the total sample to be the training dataset. Seed number is 123.

![image.png](attachment:image.png)

Table 2 tells the target is "times skip school without an excuse" H1ED2 and the Seletion Method is LAR. The criterion for choosing the best model is 10-fold Cross-Validation with the Random assignments of observations to the folds. The number of observations used are 1473 of 6504, and 1032 for training dataset, 441 for testing dataset.
The number of parameters to be estimated is 12 for the intercept plus the 11 predictors.

![image.png](attachment:image.png)

Table 3 tells the information about the LAR selection. It shows the steps in the analysis and the variable that is entered at each step. "paying attention in school" H1ED16 is the first predictor added into the model, it is the most important predictor in predicting "times skip school without an excuse". Before adding a predictor to the model, the ASE of the training and testing dataset are 59.38 and 42.37, which is very high. The final ASE are 53.17 and 38.79. The ASE decreases as variables are added to the model, indicating that the prediction accuracy improves as each variable is added to the model. There's an asterisk at step 8. This is the model selected as the best model by the procedure, cause this model with the lowest summed residual sum of squares which also shown in Table 4(CV PRESS).

![image.png](attachment:image.png)

![image.png](attachment:image.png)

Table 4 plot shows the change in the regression coefficients at each step, and the vertical line represents the selected model which is model 8. This plot shows the relative importance of the predictor selected at any step of the selection process, how the regression coefficients changed with the addition of a new predictor at each step. As well as, the steps at which each variable entered the model. For example, as also indicated in Table 3, "paying attention in school" H1ED16 and "expelled from school" H1ED9 had the largest regression coefficient, followed by "feel close to people at your school" H1ED19. They are all positively associated with "times skip school without excuse".

Table 5 plot shows at which step in the selection process different selection criteria would choose the best model. Criterias AIC, AICC, Adj R-Sq selected more complex models, for the criterion based on cross validation, they possibly selected an overfitted model. 

![image.png](attachment:image.png)

Table 6 plot shows the change in the ASE at each step in the process. As expected, the selected model was less accurate in predicting "times skip school without an excuse" in the test data, but the test ASE at each step was pretty close to the training ASE overall. This suggests that prediction accuracy was pretty stable across the two data sets. 

![image.png](attachment:image.png)

Table 7 shows the R-Square=0.0917 and adjusted R-Square=0.0846 for the selected model and the mean square error for both the training (53.93) and test data (38.92). It also shows the estimated regression coefficients for the selected model, the estimated regression coefficient of H1TO16 is 0.069358. 

![image.png](attachment:image.png)

# Lasso Regression Limitations


![image.png](attachment:image.png)

![image.png](attachment:image.png)