# Notes about the SAS's Course
# Predictive Modeling Using Logistic Regression (15.1)

This course covers predictive modeling using SAS/STAT software with emphasis on the LOGISTIC procedure. This course also discusses selecting variables and interactions, recoding categorical variables based on the smooth weight of evidence, assessing models, treating missing values, and using efficiency techniques for massive data sets. This notes are based in the course materials, some codes and images are copyrighted by Sas Institute. I made a Jupyter Notebook using JupiterLab with SAS University Edition.

In [2]:
/*Run this script to configurate the session*/

%let InicioCurso=/folders/myfolders/Cursos/EPMLR51;
%include "&InicioCurso/setup.sas";

## Lesson 4: Measuring Model Performance
You know how to prepare your input variables by addressing common problems and how to then use the prepared data to develop a family of increasingly complex models. Now you need a way to measure how well your models generalize to new data, and then you need to choose the best model.
In this lesson, you learn how to do an honest assessment of your model and use a variety of metrics to assess model performance and select the best model. In addition to the most common metrics, you learn about profit-based metrics, the Kolmogorov-Smirnov statistic, and model selection plots.

### 4.1 Honest Assessment of the Model

After you prepare input variables and fit a model, you need to ask yourself, "Does my model generalize well?" You need to do an honest assessment of how well your model performs on a different sample of data than you used to develop the model.

* explain the benefit of comparing the training and validation data fit statistics versus model complexity
* prepare the input variables in the validation data set

#### Fit versus Complexity
After you create a family of increasingly complex models, you need to compare and evaluate them on the training and validation data. You can use a variety of metrics to measure model performance. Here is a graph that illustrates model fit versus complexity on both the training and validation data sets. Fit is on the Y axis and complexity is on the X axis.

First, let's plot the line for a model of increasing complexity that is fit to the training data set. The fit statistics tend to increase as complexity increases. Some of this increase happens because the model is capturing relevant trends in the data. However, some of the increase is due to overfitting. In other words, the model is identifying peculiarities of the training data set. To see the point at which overfitting begins, you compare the training fit line to the validation fit line. As you would expect, the model's fit statistics for the validation data tend to be lower than the fit statistics for the training data. Initially, the validation fit line increases with complexity, as more complex models detect more usable data patterns. Then the line tends to plateau, indicating more complicated models that do not increase fit. When the model becomes very complex, the line starts to decrease. The decrease in fit is due to overfitting.

Notice that the most complex model has the greatest difference between the training fit line and the validation fit line. This difference, known as shrinkage, is another statistic that some modelers use when measuring a model's overall predictive power. So, when comparing models, you might use a rule that says "Choose the simplest model that has the highest validation fit measure, with no more than 10% shrinkage from the training to the validation results."

If the measure of model fit is some sort of error rate, then the plot looks like the one in the previous example but flipped about the horizontal axis. If there is no profit or cost information, the Mean Squared Error (MSE) is one such fit statistic that measures how poorly a model fits. That is, smaller is better.

#### Assessing Models when Target Event Data Is Rare
Data splitting is a simple technique. However, when the target event is rare, you might not be able to afford to split your data because you want to use all of the target event cases to fit the model. Furthermore, when the test set is small, the performance measures might be unreliable because of high variability. In this situation, you can use other honest assessment approaches, such as bootstrapping and k-fold cross validation.

One approach that is frugal with the data is to assess the model on the same data set that was used for training but to penalize the assessment for optimism (Ripley 1996). The appropriate penalty can be determined theoretically or by using computationally intensive methods such as the bootstrap method. Bootstrapping is repeated sampling with replacement. A model is fit to each sample, the assessment statistics are calculated for each model, and the average of the assessment statistics is calculated. It is possible to write a macro to do bootstrapping. However, bootstrapping is not covered in this course.

For small and moderate data sets, k-fold cross validation, also called v-fold cross validation, (Breiman et al. 1984; Ripley 1996; Hand 1997) is a better strategy than data splitting. In k-fold cross validation, you split your data into k parts—also called folds. The benefit of using k-fold cross validation is that you use all of the data for both training and validation. Let's look at an example.

Suppose you split your data into five folds: A, B, C, D, and E. First, you train your model on parts B, C, D, and E, and then validate the model on part A. Next, you train your model using the data in parts A, C, D, and E and then validate the model on part B. You repeat this process so that you get validation statistics for each of the remaining parts. Because there are five parts, you get a total of five validation statistics, one for each part. Finally, you calculate the average of the five validation statistics. You use this average as the overall honest assessment of the model's ability to generalize. K-fold cross validation gives you accurate validation statistics, but it doesn't give you a final model. You get your final model by fitting the model to the entire development data set.

#### Preparing the Validation Data
Note: The demonstrations in this course build on each other. If you want to perform this demonstration in your own SAS software and you started a new SAS session after you performed the previous demonstration, open l4_demos.sas. It contains the solution code for all demos in Lesson 1, 2, 3 and 4. Locate the code for the previous demos, review the comments to see if any modifications are needed, and then submit the code.

Early in the target marketing project, we split the data into training and validation data sets. Now we are ready to prepare the validation data set that we'll score later to assess our model. Remember that the validation data needs to be prepared for scoring the same way that the training data was prepared for model building. Data preparation includes imputing missing values, creating new inputs, and applying any necessary transformations. In this demonstration, we do the following: Identify inputs that need imputation using PROC MEANS. Create an output data set with the medians of the inputs that have missing values using PROC UNIVARIATE, and impute values, create inputs, and apply a transformation using the DATA step.

Let's look at the code. We have: proc means data=work.valid (our validation data set), and we're asking for the number of missing values. The VAR statement lists the variables that were in our training data set model. And run.

So we'll submit the code. And it looks like credit card balance has some missing data, and investment, and credit card. So in the validation data set, missing values should be replaced with the medians from the training data set. PROC UNIVARIATE is used to create an output data set with the medians of those variables.

So we have: proc univariate data=work.train_imputed_swoe (the smoothed weight of evidence) _bins with the NOPRINT option. The VAR statement lists the three variables that had missing values. We have an OUTPUT statement where we have an output data set called work.medians. The PCTLPTS= option requests the 50th percentile and the PCTLPRE= option specifies a prefix for the variable names in the output data set. In this case, it'll be CC, CCBAL, and Investment. Let's run the PROC UNIVARIATE code.

The DATA step first combines values from a single observation in one data set (in this case, work.medians) with all the observations in another data set, work.valid. So we have: data work.valid_imputed_swoe_bins. We're dropping the medians and their indicator. if _N_=1 then set work.medians; set work.valid; And that does that 1 to N merge.

We have two arrays. We have array x that has the three variables that have missing values, and we have array med that has our three medians. The DO loop simply replaces the missing values with the medians. So, do i=1 to dim(x); (in this case, 3) if x(i)=. (missing) then x(i)=med(i);

So let's look at the first one. x(1) would be credit card, so if credit card is missing, then credit card equals med(1), which is credit card 50, which is the median for credit card. So if credit card is missing, it's replaced with the median from credit card. And the END goes with the DO loop.

This DO loop goes three times, replacing the missing values for the three variables. The Branch smoothed weight of evidence variable and the bins for checking account balance, which is the rank transformed input, are added with the %INCLUDE statements. So we're including the scoring code that created the Branch smoothed weight of evidence variable, and we include the scoring code that created the bins for the checking account balance.

And if you don't have a checking account, your checking account balance is the overall mean. So let's submit that DATA step, and we have prepared our validation data for scoring.

Instead of using PROC UNIVARIATE and the DATA step to replace the missing values in the validation data set with the medians from the training data set, you can use PROC STDIZE. For more information, see Imputation with PROC STDIZE in the Resources section.

The question arises: What metrics can we use to measure model performance? You'll learn about that next.