# Notes about the SAS's Course
## Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression

This notes is based in the course materials, some codes and images are property . I made a Jupyter Notebook using JupiterLab with SAS University Edition. 

## 0. Script to setup the sesion
Run this script in the begining the each session to access the data in the correct way. 

In [21]:
%let InicioCurso=/folders/myfolders/Statistics1_ANOVA_Regression_LogisticRegression;
%include "&InicioCurso/inicio.sas";

## 6: Model Building for Scoring and Prediction

* explaining the relationship and predict future values of a response variable.
* how to transition from inferential statistics to predictive modeling. 
* how to assess models using honest assessment.
* After we choose the best performing model, we'll discuss ways to deploy the model to predict new data.

### Introduction to Predictive Modeling
...

#### Predictive Modeling Terminology
* The process begins with partitioning a data set into separate training and validation data sets.
* The model is built using the training data and then assessed using the validation data.
* After a best model is chosen, the model is deployed to make predictions on new data using a process called scoring.
* A predictive model consists of either a formula or rules.
* The predictive regression models, are parametric, have formulas. 
* Predictive models based on nonparametric models, decision trees and random forests, based on a sequence of decisions, or rules, based on the values of the inputs.

#### Model Complexity

* The important concept is that there isn't one perfect model. There's always a balance between overfitting and underfitting.

#### Building a Predictive Model

* You start by fitting a variety of models, and then you assess their performance and select the best model. 
* **The key is to not overfit the training data set.**
* How can you select the best model?: Using honest assessment, you partition or split the available data into a data set for training, and one for validation, and sometimes a third data set for testing. All partitions contain the predictors and the response.
* The **training data** set is used to fit a variety of different models.
* The **validation data** set is a holdout sample that's used to compare model performance and select the best performing model. Using a holdout sample is a way of assessing how well the models generalize to new data. 
* the **test data** set is used to give a final honest estimate of generalization for the chosen model.In practice, many analysts see no need for a final assessment. Instead, the model assessment measured on the validation data is reported as an upper bound on the performance that is expected when the model is deployed.
* Common partitions include 70% training and 30% validation.

#### Demo: Building a Predictive Model Using PROC GLMSELECT
 we use the GLMSELECT procedure to build a predictive linear regression model of SalePrice from both categorical and continuous predictors.



In [25]:
%let interval=Gr_Liv_Area Basement_Area Garage_Area Deck_Porch_Area 
         Lot_Area Age_Sold Bedroom_AbvGr Total_Bathroom;
%let categorical=House_Style2 Overall_Qual2 Overall_Cond2 Fireplaces 
         Season_Sold Garage_Type_2 Foundation_2 Heating_QC 
         Masonry_Veneer Lot_Shape_2 Central_Air;

ods graphics;

proc glmselect data=STAT1.ameshousing3
               plots=all 
               valdata=STAT1.ameshousing4;
    class &categorical / param=glm ref=first;
    model SalePrice=&categorical &interval / 
               selection=backward
               select=sbc 
               choose=validate;
    store out=STAT1.amesstore;
    title "Selecting the Best Model using Honest Assessment";
run;

0,1
Data Set,STAT1.AMESHOUSING3
Validation Data Set,STAT1.AMESHOUSING4
Dependent Variable,SalePrice
Selection Method,Backward
Select Criterion,SBC
Stop Criterion,SBC
Choose Criterion,Validation ASE
Effect Hierarchy Enforced,

Observation Profile for Analysis Data,Observation Profile for Analysis Data.1
Number of Observations Read,300
Number of Observations Used,294
Number of Observations Used for Training,294

Observation Profile for Validation Data,Observation Profile for Validation Data.1
Number of Observations Read,300
Number of Observations Used,293

Class Level Information,Class Level Information,Class Level Information
Class,Levels,Values
House_Style2,5,1Story 2Story SFoyer SLvl 1.5Fin
Overall_Qual2,3,5 6 4
Overall_Cond2,3,5 6 4
Fireplaces,3,1 2 0
Season_Sold,4,2 3 4 1
Garage_Type_2,3,Detached NA Attached
Foundation_2,3,Cinder Block Concrete/Slab Brick/Tile/Stone
Heating_QC,4,Fa Gd TA Ex
Masonry_Veneer,2,Y N
Lot_Shape_2,2,Regular Irregular

Dimensions,Dimensions.1
Number of Effects,20
Number of Parameters,43

Backward Selection Summary,Backward Selection Summary,Backward Selection Summary,Backward Selection Summary,Backward Selection Summary,Backward Selection Summary,Backward Selection Summary
Step,Effect Removed,Number Effects In,Number Parms In,SBC,ASE,Validation ASE
0,,20,32,5779.6460,185773538,252878776
1,Season_Sold,19,29,5762.6753,185824120,252480746
2,House_Style2,18,25,5750.8247,192832172,248469026
3,Foundation_2,17,23,5740.3830,193440101,248951925
4,Garage_Type_2,16,21,5730.0735,194137231,247966687
5,Central_Air,15,20,5724.5490,194242334,247854963*
6,Heating_QC,14,17,5721.3123,203586891,259432895
7,Masonry_Veneer,13,16,5718.5873,205646000,263660934
8,Lot_Shape_2,12,15,5717.9317*,209193215,265159474
* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion,* Optimal Value of Criterion

0
Selection stopped at a local minimum of the SBC criterion.

Stop Details,Stop Details,Stop Details,Stop Details,Stop Details
Candidate For,Effect,Candidate SBC,Unnamed: 3_level_1,Compare SBC
Removal,Deck_Porch_Area,5718.6683,>,5717.9317

0,1
Effects:,Intercept Overall_Qual2 Overall_Cond2 Fireplaces Heating_QC Masonry_Veneer Lot_Shape_2 Gr_Liv_Area Basement_Area Garage_Area Deck_Porch_Area Lot_Area Age_Sold Bedroom_AbvGr Total_Bathroom

Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance
Source,DF,Sum of Squares,Mean Square,F Value
Model,19,356645200000.0,18770797693.0,90.06
Error,274,57107246191.0,208420607.0,
Corrected Total,293,413752400000.0,,

0,1
Root MSE,14437.0
Dependent Mean,137179.0
R-Square,0.862
Adj R-Sq,0.8524
AIC,5946.87742
AICC,5950.27448
SBC,5724.54902
ASE (Train),194242334.0
ASE (Validate),247854963.0

Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates
Parameter,DF,Estimate,Standard Error,t Value
Intercept,1,51207.0,7079.121457,7.23
Overall_Qual2 5,1,6782.080263,3104.469941,2.18
Overall_Qual2 6,1,13659.0,3414.565419,4.00
Overall_Qual2 4,0,0.0,.,.
Overall_Cond2 5,1,8996.61802,4137.937302,2.17
Overall_Cond2 6,1,15909.0,4025.283609,3.95
Overall_Cond2 4,0,0.0,.,.
Fireplaces 1,1,9716.205925,2044.560791,4.75
Fireplaces 2,1,7235.661619,4540.159269,1.59
Fireplaces 0,0,0.0,.,.


#### Partitioning a Data Set Using PROC GLMSELECT


`PROC GLMSELECT DATA=training-data-set <SEED=number>;
        MODEL targets=inputs < / options>;
        PARTITION FRACTION(<TEST=fraction> <VALIDATE=fraction>) ;
RUN;`

* The statement below requests two partitions (training and validation), 25% of the observations are written to the validation data set. The remaining three quarters, or 75%, are written to the training data set.

PARTITION FRACTION(VALIDATE=.25);


#### Practice: Building a Predictive Model Using PROC GLMSELECT
Use the ameshousing3 data set to build a model that predicts the sale prices of homes in Ames, Iowa, that are 1500 square feet or below, based on various home characteristics.

1. Write a PROC GLMSELECT step that predicts the values of SalePrice. Partition the stat1.ameshousing3 data set into a training data set of approximately 2/3 and a validation data set of approximately 1/3. Specify the seed 8675309. Define the Interval and Categorical macro variables as shown below, and use them to specify the inputs. Use stepwise regression as the selection method, Akaike's information criterion (AIC) to add and or remove effects, and average squared error for the validation data to select the best model. Add the REF=FIRST option in the CLASS statement. Submit the code and examine the results.

`%let interval=Gr_Liv_Area Basement_Area Garage_Area Deck_Porch_Area 
         Lot_Area Age_Sold Bedroom_AbvGr Total_Bathroom;`
         
         
`%let categorical=House_Style2 Overall_Qual2 Overall_Cond2 Fireplaces 
         Season_Sold Garage_Type_2 Foundation_2 Heating_QC 
         Masonry_Veneer Lot_Shape_2 Central_Air;`

In [22]:
%let interval=Gr_Liv_Area Basement_Area Garage_Area Deck_Porch_Area 
         Lot_Area Age_Sold Bedroom_AbvGr Total_Bathroom;
%let categorical=House_Style2 Overall_Qual2 Overall_Cond2 Fireplaces 
         Season_Sold Garage_Type_2 Foundation_2 Heating_QC 
         Masonry_Veneer Lot_Shape_2 Central_Air;


/*In this example, the data set ameshousing3 is divided into */
/*training and validation using the PARTITION statement, */
/*along with the SEED= option in the PROC GLMSELECT statement.*/

proc glmselect data=STAT1.ameshousing3
               plots=all 
               seed=8675309;
   class &categorical / param=ref ref=first;
   model SalePrice=&categorical &interval / 
                   selection=stepwise
                   (select=aic 
                   choose=validate) hierarchy=single;
   partition fraction(validate=0.3333);
   title "Selecting the Best Model using Honest Assessment";
run;

0,1
Data Set,STAT1.AMESHOUSING3
Dependent Variable,SalePrice
Selection Method,Stepwise
Select Criterion,AIC
Stop Criterion,AIC
Choose Criterion,Validation ASE
Effect Hierarchy Enforced,Single
Random Number Seed,8675309

0,1
Number of Observations Read,300
Number of Observations Used,294
Number of Observations Used for Training,197
Number of Observations Used for Validation,97

Class Level Information,Class Level Information,Class Level Information
Class,Levels,Values
House_Style2,5,1.5Fin 1Story 2Story SFoyer SLvl
Overall_Qual2,3,4 5 6
Overall_Cond2,3,4 5 6
Fireplaces,3,0 1 2
Season_Sold,4,1 2 3 4
Garage_Type_2,3,Attached Detached NA
Foundation_2,3,Brick/Tile/Stone Cinder Block Concrete/Slab
Heating_QC,4,Ex Fa Gd TA
Masonry_Veneer,2,N Y
Lot_Shape_2,2,Irregular Regular

Dimensions,Dimensions.1
Number of Effects,20
Number of Parameters,32

Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary,Stepwise Selection Summary
Step,Effect Entered,Effect Removed,Number Effects In,Number Parms In,AIC,ASE,Validation ASE
0,Intercept,,1,1,4335.7651,1303938780,1656501303
1,Basement_Area,,2,2,4222.6053,726746007,767937080
2,Gr_Liv_Area,,3,3,4153.7335,507157741,590152215
3,Age_Sold,,4,4,4070.6947,329360476,379123329
4,Garage_Area,,5,5,4040.9787,280383339,349351979
5,Overall_Cond2,,6,7,4017.8121,244265684,348031039
6,Fireplaces,,7,9,4001.1755,219972414,328829426
7,Overall_Qual2,,8,11,3991.0799,204782951,328466410
8,House_Style2,,9,15,3981.7659,187553153,302046363
9,Deck_Porch_Area,,10,16,3975.3902,179746298,298786920

0
Selection stopped at a local minimum of the AIC criterion.

Stop Details,Stop Details,Stop Details,Stop Details,Stop Details
Candidate For,Effect,Candidate AIC,Unnamed: 3_level_1,Compare AIC
Entry,Masonry_Veneer,3959.1313,>,3958.4479
Removal,Total_Bathroom,3961.481,>,3958.4479

0,1
Effects:,Intercept House_Style2 Overall_Qual2 Overall_Cond2 Fireplaces Heating_QC Gr_Liv_Area Basement_Area Garage_Area Deck_Porch_Area Age_Sold

Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance,Analysis of Variance
Source,DF,Sum of Squares,Mean Square,F Value
Model,18,223176500000.0,12398695049.0,65.49
Error,178,33699428801.0,189322634.0,
Corrected Total,196,256875900000.0,,

0,1
Root MSE,13759.0
Dependent Mean,133582.0
R-Square,0.8688
Adj R-Sq,0.8555
AIC,3971.63597
AICC,3976.4087
SBC,3835.01684
ASE (Train),171063090.0
ASE (Validate),290197323.0

Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates,Parameter Estimates
Parameter,DF,Estimate,Standard Error,t Value
Intercept,1,27334.0,10120.0,2.7
House_Style2 1Story,1,12267.0,4203.159135,2.92
House_Style2 2Story,1,2456.477699,4386.235156,0.56
House_Style2 SFoyer,1,20779.0,7050.033468,2.95
House_Style2 SLvl,1,17117.0,5527.649598,3.1
Overall_Qual2 5,1,7841.596393,3417.138088,2.29
Overall_Qual2 6,1,14024.0,3806.928311,3.68
Overall_Cond2 5,1,12475.0,4949.669709,2.52
Overall_Cond2 6,1,17766.0,4841.031305,3.67
Fireplaces 1,1,5832.276234,2471.249968,2.36


2. Which model did PROC GLMSELECT choose?: *PROC GLMSELECT chose the model at Step 10, which has the following effects:Intercept, Basement_Area, Gr_Liv_Area, Age_Sold, Garage_Area, Overall_Cond2, Fireplaces, Overall_Qual2, House_Style2, Deck_Porch_Area, and Heating_QC.*

3. Resubmit the PROC GLMSELECT step. Do not make any changes to it. Does it produce the same results as before?: *The results are the same. Every time you run a specific PROC GLMSELECT step using the same seed value, the pseudo-random selection process is replicated and you get the same results.*

4. In the PROC GLMSELECT statement, change the value of SEED= and submit the modified code. Does it produce the same results as before?: *Because you used a different seed, the results are almost certainly different from the previous results.*

### Scoring Predictive Models

* some preparation of the data might be required.
* It's essential for the scoring data to be comparable to the training and validation data that were used to build the model.
* The process of preparing the data for scoring can be time- and resource-intensive.

#### Methods of Scoring

* you apply the score code, that is, the equations or rules obtained from the final model, to the new data. 

**methods of scoring your data**
* SCORE statement in PROC GLMSELECT to build and score a model in one step.
* STORE statement in PROC GLMSELECT to build an item store, and a SCORE statement in PROC PLM. 
* STORE statement in PROC GLMSELECT to build an item store, a CODE statement in PROC PLM to generate scoring code based on the item store, and a DATA step to run the scoring code.

#### Demo: Scoring Data Using PROC PLM

In the previous demonstration, we built a predictive model and created an item store. Now, we'll use the item store to score data with two different methods, and then compare the results to show equivalence between the two methods. For demonstration purposes, we'll score ameshousing4, the validation data set from the previous demonstration. Remember that in a business environment, you would score data that was not used in either training or validation.

In [31]:
proc plm restore=STAT1.amesstore;
    score data=STAT1.ameshousing4 out=scored;
    code file="&homefolder/scoring.sas";
run;

Store Information,Store Information.1
Item Store,STAT1.AMESSTORE
Data Set Created From,STAT1.AMESHOUSING3
Created By,PROC GLMSELECT
Date Created,19MAY20:05:09:57
Response Variable,SalePrice
Class Variables,House_Style2 Overall_Qual2 Overall_Cond2 Fireplaces Season_Sold Garage_Type_2 Foundation_2 ...
Model Effects,Intercept Overall_Qual2 Overall_Cond2 Fireplaces Heating_QC Masonry_Veneer Lot_Shape_2 Gr_Liv_Are..


In [32]:
data scored2;
    set STAT1.ameshousing4;
    %include "&homefolder/scoring.sas";
run;

proc compare base=scored compare=scored2 criterion=0.0001;
    var Predicted;
    with P_SalePrice;
run;

**Conclution:**

We built a predictive model on training data, chose a best fitting and generalizable model according to validation data, and now we've seen multiple ways to deploy our predictive model. We can now predict new cases after we measure the model inputs by passing the new data to PROC PLM or a DATA step using score code. That is, we can predict home prices after we measure the home attributes that are needed as inputs in our predictive model. After predicting sale prices of homes in Ames, Iowa, we'll have some idea of the future commission for our real estate firm.

#### Practice: Using the SCORE Statement in PROC GLMSELECT

You want to re-create the model that was built in the previous practice (based on stat1.ameshousing3), create an item store, and then use the item store to score the new cases in stat1.ameshousing4. You'll score the data in two ways (using PROC GLMSELECT and PROC PLM) and compare the results.

1. Open the solution program from the previous practice, st106s01.sas. There is no need to examine the results, so make the following changes to the code:


* Remove the PLOTS= option.
* Add the NOPRINT option to the PROC GLMSELECT statement.
* Remove the TITLE statement

here is the code:


`proc glmselect data=STAT1.ameshousing3
               seed=8675309
               noprint;
   class &categorical / param=ref ref=first;
   model SalePrice=&categorical &interval / 
               selection=stepwise
               (select=aic 
               choose=validate) hierarchy=single;
   partition fraction(validate=0.3333);
run;`

2. In the PROC GLMSELECT step,


* Add a STORE statement to create an item store named store1, and a SCORE statement to score the data in stat1.ameshousing4.
* Add a PROC PLM step that uses the item store, store1, to score the data in stat1.ameshousing4.
* Note: Be sure to use different names for the two scored data sets.
* Add a PROC COMPARE step to compare the scoring results from PROC GLMSELECT and PROC PLM.

In [41]:
proc glmselect data=STAT1.ameshousing3
               seed=8675309
               noprint;
   class &categorical / param=ref ref=first;
   model SalePrice=&categorical &interval / 
               selection=stepwise
               (select=aic 
               choose=validate) hierarchy=single;
   partition fraction(validate=0.3333);
   score data=STAT1.ameshousing4 out=score1;
   store out=store1;
   title "Selecting the Best Model using Honest Assessment";
run;

proc plm restore=store1;
   score data=STAT1.ameshousing4 out=score2;
run;

proc compare base=score1 compare=score2 criterion=0.0001;
   var P_SalePrice;
   with Predicted;
run;

Store Information,Store Information.1
Item Store,WORK.STORE1
Data Set Created From,STAT1.AMESHOUSING3
Created By,PROC GLMSELECT
Date Created,19MAY20:05:54:33
Response Variable,SalePrice
Class Variables,House_Style2 Overall_Qual2 Overall_Cond2 Fireplaces Season_Sold Garage_Type_2 Foundation_2 ...
Model Effects,Intercept House_Style2 Overall_Qual2 Overall_Cond2 Fireplaces Heating_QC Gr_Liv_Area Basement_Are..


In [43]:
%showLog

3. Does the PROC COMPARE output indicate any differences between the predictions produced by the two scoring methods?

The two scoring methods produce the same predictions. 

Note: Depending on the version of SAS and SAS/STAT that you are using, your results might look somewhat different from the output shown here. However, the results should indicate that these data sets do not differ.