# Notes about the SAS's Course
# Predictive Modeling Using Logistic Regression (15.1)

This course covers predictive modeling using SAS/STAT software with emphasis on the LOGISTIC procedure. This course also discusses selecting variables and interactions, recoding categorical variables based on the smooth weight of evidence, assessing models, treating missing values, and using efficiency techniques for massive data sets. This notes are based in the course materials, some codes and images are copyrighted by Sas Institute. I made a Jupyter Notebook using JupiterLab with SAS University Edition.

In [1]:
/*Run this script to configurate the session*/

%let InicioCurso=/folders/myfolders/Cursos/EPMLR51/;
%include "&InicioCurso/setup.sas";

SAS Connection established. Subprocess id is 2225



## Lesson 3: Fitting the Model


#### Fitting a Basic Logistic Regression Model, Part 1

First create the training and validation data.

#### Practice: Imputing Missing Values
For the veterans' organization project, impute missing values for several variables in the pmlr.pva_train data set.

Note: If you started a new SAS session after you performed the previous practice, do the following before you continue:

Make sure you have set up your practice files in the Course Overview.
Open l3_all.sas. It contains the solution code for all practices in Lesson 1, 2, and 3. Locate the code for the previous practice(s), review the comments to see if any modifications are needed, and then submit the code.


1. Write a DATA step that creates missing value indicators for the following inputs in the pmlr.pva_train data set: Donor_Age, Income_Group, and Wealth_Rating. Also add a cumulative count of the missing values. Name the output data set pmlr.pva_train_mi.

In [2]:
/*1.  Create pva_train */ 
proc sort data=pmlr.pva out=work.pva_sort;
   by target_b;
run;

proc surveyselect noprint data=work.pva_sort 
                  samprate=0.5 out=pva_sample seed=27513 
                  outall stratumseed=restore;
   strata target_b;
run;

data pmlr.pva_train(drop=selected SelectionProb SamplingWeight)
     pmlr.pva_valid(drop=selected SelectionProb SamplingWeight);
   set work.pva_sample;
   if selected then output pmlr.pva_train;
   else output pmlr.pva_valid;
run;

In [3]:
data pmlr.pva_train_mi(drop=i);
   set pmlr.pva_train;
   /* name the missing indicator variables */
   array mi{*} mi_DONOR_AGE mi_INCOME_GROUP 
               mi_WEALTH_RATING;
   /* select variables with missing values */
   array x{*} DONOR_AGE INCOME_GROUP WEALTH_RATING;
   do i=1 to dim(mi);
      mi{i}=(x{i}=.);
      nummiss+mi{i};
   end;
run;

The log indicates that the pmlr.pva_train_mi data set has 62 variables.


2. Open l3p1.sas in your SAS software. This program uses PROC RANK to group the values of the variables Recent_Response_Prop and Recent_Avg_Gift_Amt into three groups each. Note that this code creates an output data set named work.pva_train_rank.

In [4]:
proc rank data=pmlr.pva_train_mi out=work.pva_train_rank groups=3;
   var recent_response_prop recent_avg_gift_amt;
   ranks grp_resp grp_amt;
run;

The log indicates that the work.pva_train_rank data set has 64 variables.
Sort the work.pva_train_rank data set by Grp_Resp and Grp_Amt. Name the output data set work.pva_train_rank_sort.
Submit the code and check the log to verify that the code ran without errors.


In [5]:
proc sort data=work.pva_train_rank out=work.pva_train_rank_sort;
   by grp_resp grp_amt;
run;

To impute missing values in the work.pva_train_rank_sort data set for each BY group and create an output data set named pmlr.pva_train_imputed, add a PROC STDIZE step with a BY statement.

In [6]:
proc stdize data=work.pva_train_rank_sort method=median
            reponly out=pmlr.pva_train_imputed;
   by grp_resp grp_amt;
   var DONOR_AGE INCOME_GROUP WEALTH_RATING;
run;

The log shows that the pmlr.pva_train_imputed data set was created with 9687 observations and 64 variables.
Use PROC MEANS to determine the values that were used to replace the missing values in the pmlr.pva_train_imputed data set. Add OPTIONS statements to display variable names instead of labels in the output from PROC MEANS (using the NOLABEL option) and then to reset the display of labels. Submit the code and look at the results.
For Grp_Resp=0 and Grp_Amt=0, what value replaced the missing value of Donor_Age?

In [7]:
options nolabel;
proc means data=pmlr.pva_train_imputed median;
   class grp_resp grp_amt;
   var DONOR_AGE INCOME_GROUP WEALTH_RATING;
run;
options label;

grp_resp,grp_amt,N Obs,Variable,Median
0.0,0,487,DONOR_AGE INCOME_GROUP WEALTH_RATING,65.0000000 4.0000000 5.0000000
,1,1147,DONOR_AGE INCOME_GROUP WEALTH_RATING,58.0000000 4.0000000 5.0000000
,2,1612,DONOR_AGE INCOME_GROUP WEALTH_RATING,58.0000000 4.0000000 6.0000000
1.0,0,671,DONOR_AGE INCOME_GROUP WEALTH_RATING,65.0000000 4.0000000 4.5000000
,1,1270,DONOR_AGE INCOME_GROUP WEALTH_RATING,59.0000000 4.0000000 5.0000000
,2,1202,DONOR_AGE INCOME_GROUP WEALTH_RATING,57.0000000 4.0000000 5.0000000
2.0,0,2155,DONOR_AGE INCOME_GROUP WEALTH_RATING,63.0000000 4.0000000 5.0000000
,1,733,DONOR_AGE INCOME_GROUP WEALTH_RATING,61.0000000 4.0000000 6.0000000
,2,410,DONOR_AGE INCOME_GROUP WEALTH_RATING,58.5000000 4.0000000 6.0000000


The results indicate that, for Grp_Resp=0 and Grp_Amt=0, the missing value for Donor_Age was replaced with the value 65.