# Notes about the SAS's Course
# Predictive Modeling Using Logistic Regression (15.1)

This course covers predictive modeling using SAS/STAT software with emphasis on the LOGISTIC procedure. This course also discusses selecting variables and interactions, recoding categorical variables based on the smooth weight of evidence, assessing models, treating missing values, and using efficiency techniques for massive data sets. This notes are based in the course materials, some codes and images are copyrighted by Sas Institute. I made a Jupyter Notebook using JupiterLab with SAS University Edition.

In [6]:
/*Run this script to configurate the session*/

%let InicioCurso=/folders/myfolders/Cursos/EPMLR51/;
%include "&InicioCurso/setup.sas";

## Lesson 1: Understanding Predictive Modeling

### 1.1 Predictive Modeling Fundamentals

#### Goals of Predictive Modeling
* The main goal in predictive modeling is generalization, the ability to predict outcomes for new data.
* the aim to maximize predictive power as assessed by relevant metrics.
* In most business settings, the primary emphasis in predictive modeling is the empirical quality of the predictions. The secondary emphasis is on understanding the relationships between the predictor variables and the response variable. 
* In predictive modeling, we'll focus on fitting models that empirically maximize predictor power to meet the business needs for your organization.

#### Basic Steps of Predictive Modeling
* **The first step** in predictive modeling is building a model on historic data where the outcome variable value is known. 
* supervised classification: When the outcome variable is known and is discrete.The term supervised indicates that the class label is known for each case.
* The goal is to correctly classify cases into groups (or classes). The target variable is a class label. 
* When only two possible classes of outcomes exist, you refer to the target as binary. 
* A predictive model assigns a score to each case. When you have a binary target, each score measures the probability that the case belongs to a particular class. 

* **So, the second step** in predictive modeling is generalization. First, you build a predictive model on a subset of cases in which the target classification is known, and then you apply the predictive model to cases in which the target classification is unknown.

* Now let's look at the main tasks involved at each step of the predictive modeling process. For supervised classification: 
* * To prepare the input variables. missing values, dealing categorical inputs, redundancy, and performing variable screening.
* * Select the most predictive inputs and start fitting models. 

#### Applications of Predictive Modeling
A few typical examples are:

* target marketing.
* attrition prediction.
* credit scoring.
* fraud detection. 

Supervised classification also has fewer business-oriented uses. Image classification has applications in areas such as astronomy, nuclear medicine, and molecular genetics (McLachlan 1992; Ripley 1996; Hand 1997).

#### Activity: Exploring the Bank Data for the Target Marketing Project

In [7]:
data work.develop;
   set pmlr.develop;
run;

%global inputs;
%let inputs=ACCTAGE DDA DDABAL DEP DEPAMT CASHBK 
            CHECKS DIRDEP NSF NSFAMT PHONE TELLER 
            SAV SAVBAL ATM ATMAMT POS POSAMT CD 
            CDBAL IRA IRABAL LOC LOCBAL INV 
            INVBAL ILS ILSBAL MM MMBAL MMCRED MTG 
            MTGBAL CC CCBAL CCPURC SDB INCOME 
            HMOWN LORES HMVAL AGE CRSCORE MOVED 
            INAREA;

proc means data=work.develop n nmiss mean min max;
   var &inputs;
run;

proc freq data=work.develop;
   tables ins branch res;
run;

Variable,Label,N,N Miss,Mean,Minimum,Maximum
AcctAge DDA DDABal Dep DepAmt CashBk Checks DirDep NSF NSFAmt Phone Teller Sav SavBal ATM ATMAmt POS POSAmt CD CDBal IRA IRABal LOC LOCBal Inv InvBal ILS ILSBal MM MMBal MMCred MTG MTGBal CC CCBal CCPurc SDB Income HMOwn LORes HMVal Age CRScore Moved InArea,Age of Oldest Account Checking Account Checking Balance Checking Deposits Amount Deposited Number Cash Back Number of Checks Direct Deposit Number Insufficient Fund Amount NSF Number Telephone Banking Teller Visits Saving Account Saving Balance ATM ATM Withdrawal Amount Number Point of Sale Amount Point of Sale Certificate of Deposit CD Balance Retirement Account IRA Balance Line of Credit Line of Credit Balance Investment Investment Balance Installment Loan Loan Balance Money Market Money Market Balance Money Market Credits Mortgage Mortgage Balance Credit Card Credit Card Balance Credit Card Purchases Safety Deposit Box Income Owns Home Length of Residence Home Value Age Credit Score Recent Address Change Local Address,30194 32264 32264 32264 32264 32264 32264 32264 32264 32264 28131 32264 32264 32264 32264 32264 28131 28131 32264 32264 32264 32264 32264 32264 28131 28131 32264 32264 32264 32264 32264 32264 32264 28131 28131 28131 32264 26482 26731 26482 26482 25907 31557 32264 32264,2070 0 0 0 0 0 0 0 0 0 4133 0 0 0 0 0 4133 4133 0 0 0 0 0 0 4133 4133 0 0 0 0 0 0 0 4133 4133 4133 0 5782 5533 5782 5782 6357 707 0 0,5.9086772 0.8156459 2170.02 2.1346082 2232.76 0.0159621 4.2599182 0.2955616 0.0870630 2.2905464 0.4056024 1.3652678 0.4668981 3170.60 0.6099368 1235.41 1.0756816 48.9261782 0.1258368 2530.71 0.0532792 617.5704550 0.0633833 1175.22 0.0296826 1599.17 0.0495909 517.5692344 0.1148959 1875.76 0.0563786 0.0493429 8081.74 0.4830969 9586.55 0.1541716 0.1086660 40.5889283 0.5418802 7.0056642 110.9121290 47.9283205 666.4935197 0.0296305 0.9602963,0.3000000 0 -774.8300000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -613.0000000 0 -2214.92 0 0 0 0 0 0 0 0 -2060.51 0 0 0 0 0.5000000 67.0000000 16.0000000 509.0000000 0 0,61.5000000 1.0000000 278093.83 28.0000000 484893.67 4.0000000 49.0000000 1.0000000 1.0000000 666.8500000 30.0000000 27.0000000 1.0000000 700026.94 1.0000000 427731.26 54.0000000 3293.49 1.0000000 1053900.00 1.0000000 596497.60 1.0000000 523147.24 1.0000000 8323796.02 1.0000000 29162.79 1.0000000 120801.11 5.0000000 1.0000000 10887573.28 1.0000000 10641354.78 5.0000000 1.0000000 233.0000000 1.0000000 19.5000000 754.0000000 94.0000000 820.0000000 1.0000000 1.0000000

Ins,Frequency,Percent,Cumulative Frequency,Cumulative Percent
0,21089,65.36,21089,65.36
1,11175,34.64,32264,100.0

Branch of Bank,Branch of Bank,Branch of Bank,Branch of Bank,Branch of Bank
Branch,Frequency,Percent,Cumulative Frequency,Cumulative Percent
B1,2819,8.74,2819,8.74
B10,273,0.85,3092,9.58
B11,247,0.77,3339,10.35
B12,549,1.7,3888,12.05
B13,535,1.66,4423,13.71
B14,1072,3.32,5495,17.03
B15,2235,6.93,7730,23.96
B16,1534,4.75,9264,28.71
B17,850,2.63,10114,31.35
B18,541,1.68,10655,33.02

Area Classification,Area Classification,Area Classification,Area Classification,Area Classification
Res,Frequency,Percent,Cumulative Frequency,Cumulative Percent
R,8077,25.03,8077,25.03
S,11506,35.66,19583,60.7
U,12681,39.3,32264,100.0


**Note:** To build the develop data set, the bank included all cases that have an Ins variable value of 1 and a representative sample of cases that have an Ins variable value of 0. This oversampling of the events increases the efficiency of the analysis because you are using a smaller sample and therefore have fewer cases to process. However, this oversampling also biases the results. You learn more about oversampling events, and how to adjust the model for it, later in the course. 

#### Practice: Exploring the Veterans' Organization Data Used in the Practices

In [8]:
data pmlr.pva(drop=control_number 
                   MONTHS_SINCE_LAST_PROM_RESP 
                   FILE_AVG_GIFT 
                   FILE_CARD_GIFT);
   set pmlr.pva_raw_data;
   STATUS_FL=RECENCY_STATUS_96NK in("F","L");
   STATUS_ES=RECENCY_STATUS_96NK in("E","S");
   home01=(HOME_OWNER="H");
   nses1=(SES="1");
   nses3=(SES="3");
   nses4=(SES="4");
   nses_=(SES="?");
   nurbr=(URBANICITY="R");
   nurbu=(URBANICITY="U");
   nurbs=(URBANICITY="S");
   nurbt=(URBANICITY="T");
   nurb_=(URBANICITY="?");
run;

2. To examine the contents of pmlr.pva, write a PROC CONTENTS step, submit it, and review the results. How many character variables are in the data set?

In [12]:
proc contents data=pmlr.pva;
run;

0,1,2,3
Data Set Name,PMLR.PVA,Observations,19372
Member Type,DATA,Variables,58
Engine,V9,Indexes,0
Created,05/25/2020 01:52:22,Observation Length,432
Last Modified,05/25/2020 01:52:22,Deleted Observations,0
Protection,,Compressed,NO
Data Set Type,,Sorted,NO
Label,,,
Data Representation,"SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64",,
Encoding,utf-8 Unicode (UTF-8),,

Engine/Host Dependent Information,Engine/Host Dependent Information.1
Data Set Page Size,65536
Number of Data Set Pages,129
First Data Page,1
Max Obs per Page,151
Obs in First Data Page,129
Number of Data Set Repairs,0
Filename,/folders/myfolders/Cursos/EPMLR51/data/pva.sas7bdat
Release Created,9.0401M6
Host Created,Linux
Inode Number,32552

Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes,Alphabetic List of Variables and Attributes
#,Variable,Type,Len
42,CARD_PROM_12,Num,8
8,CLUSTER_CODE,Char,2
4,DONOR_AGE,Num,8
10,DONOR_GENDER,Char,3
26,FREQUENCY_STATUS_97NK,Num,8
9,HOME_OWNER,Char,3
11,INCOME_GROUP,Num,8
5,IN_HOUSE,Num,8
41,LAST_GIFT_AMT,Num,8
37,LIFETIME_AVG_GIFT_AMT,Num,8


3. Write a PROC MEANS step that generates the following descriptive statistics for the numeric variables in the pmlr.pva data set: mean, number of missing values, maximum value, and minimum value. To specify only the numeric variables in the input data set, use the special SAS name list _NUMERIC_ in the VAR statement.

In [13]:
proc means data=pmlr.pva mean nmiss max min;
   var _numeric_;
run;

Variable,Mean,N Miss,Maximum,Minimum
TARGET_B TARGET_D MONTHS_SINCE_ORIGIN DONOR_AGE IN_HOUSE INCOME_GROUP PUBLISHED_PHONE MOR_HIT_RATE WEALTH_RATING MEDIAN_HOME_VALUE MEDIAN_HOUSEHOLD_INCOME PCT_OWNER_OCCUPIED PCT_MALE_MILITARY PCT_MALE_VETERANS PCT_VIETNAM_VETERANS PCT_WWII_VETERANS PEP_STAR RECENT_STAR_STATUS FREQUENCY_STATUS_97NK RECENT_RESPONSE_PROP RECENT_AVG_GIFT_AMT RECENT_CARD_RESPONSE_PROP RECENT_AVG_CARD_GIFT_AMT RECENT_RESPONSE_COUNT RECENT_CARD_RESPONSE_COUNT LIFETIME_CARD_PROM LIFETIME_PROM LIFETIME_GIFT_AMOUNT LIFETIME_GIFT_COUNT LIFETIME_AVG_GIFT_AMT LIFETIME_GIFT_RANGE LIFETIME_MAX_GIFT_AMT LIFETIME_MIN_GIFT_AMT LAST_GIFT_AMT CARD_PROM_12 NUMBER_PROM_12 MONTHS_SINCE_LAST_GIFT MONTHS_SINCE_FIRST_GIFT PER_CAPITA_INCOME STATUS_FL STATUS_ES home01 nses1 nses3 nses4 nses_ nurbr nurbu nurbs nurbt nurb_,0.2500000 15.6243444 73.4099732 58.9190506 0.0731984 3.9075434 0.4977287 3.3616560 5.0053967 1079.87 341.9702147 69.6989986 1.0290109 30.5739211 29.6032934 32.8524675 0.5044394 0.9311377 1.9839975 0.1901275 15.3653959 0.2308077 11.6854703 3.0431034 1.7305389 18.6680776 47.5705141 104.4257165 9.9797646 12.8583383 11.5878758 19.2088081 7.6209323 16.5841988 5.3671278 12.9018687 18.1911522 69.4820875 15857.33 0.0833161 0.2399339 0.5474912 0.3058022 0.1715362 0.0199773 0.0234359 0.2067417 0.1267809 0.2318294 0.2035928 0.0234359,0 14529 0 4795 0 4392 0 0 8810 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0,1.0000000 200.0000000 137.0000000 87.0000000 1.0000000 7.0000000 1.0000000 241.0000000 9.0000000 6000.00 1500.00 99.0000000 97.0000000 99.0000000 99.0000000 99.0000000 1.0000000 22.0000000 4.0000000 1.0000000 260.0000000 1.0000000 300.0000000 16.0000000 9.0000000 56.0000000 194.0000000 3775.00 95.0000000 450.0000000 997.0000000 1000.00 450.0000000 450.0000000 17.0000000 64.0000000 27.0000000 260.0000000 174523.00 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000,0 1.0000000 5.0000000 0 0 1.0000000 0 0 0 0 0 0 0 0 0 0 0 0 1.0000000 0 0 0 0 0 0 2.0000000 5.0000000 15.0000000 1.0000000 1.3600000 0 5.0000000 0 0 0 2.0000000 4.0000000 15.0000000 0 0 0 0 0 0 0 0 0 0 0 0 0


Is the proportion of events in the sample equal to the proportion of events in the population? (Hint: To find the proportion of events in the population, read the description of pmlr.pva_raw_data.) **different, target B (in the sample) is 0.25. In the population is 0.05 (review the explanation about de pvs_raw file)**


What is the average number of months since the last gift to the organization? **18.2**


How many numeric variables have missing values? **4**

4. Write a PROC FREQ step that generates frequency tables for the character variables in the pmlr.pva data set. To specify only the character variables in the input data set, add the special SAS name list _CHARACTER_ to the TABLES statement. To display the Number of Variable Levels table for each variable specified in the TABLES statement, include the NLEVELS option in the PROC FREQ statement.

In [14]:
proc freq data=pmlr.pva nlevels;
   tables _character_;
run;

Number of Variable Levels,Number of Variable Levels
Variable,Levels
URBANICITY,6
SES,5
CLUSTER_CODE,54
HOME_OWNER,2
DONOR_GENDER,4
OVERLAY_SOURCE,4
RECENCY_STATUS_96NK,6

URBANICITY,Frequency,Percent,Cumulative Frequency,Cumulative Percent
?,454,2.34,454,2.34
C,4022,20.76,4476,23.11
R,4005,20.67,8481,43.78
S,4491,23.18,12972,66.96
T,3944,20.36,16916,87.32
U,2456,12.68,19372,100.0

SES,Frequency,Percent,Cumulative Frequency,Cumulative Percent
1,5924,30.58,5924,30.58
2,9284,47.92,15208,78.51
3,3323,17.15,18531,95.66
4,387,2.0,18918,97.66
?,454,2.34,19372,100.0

CLUSTER_CODE,Frequency,Percent,Cumulative Frequency,Cumulative Percent
.,454,2.34,454,2.34
01,239,1.23,693,3.58
02,380,1.96,1073,5.54
03,300,1.55,1373,7.09
04,113,0.58,1486,7.67
05,199,1.03,1685,8.7
06,123,0.63,1808,9.33
07,184,0.95,1992,10.28
08,378,1.95,2370,12.23
09,153,0.79,2523,13.02

HOME_OWNER,Frequency,Percent,Cumulative Frequency,Cumulative Percent
H,10606,54.75,10606,54.75
U,8766,45.25,19372,100.0

DONOR_GENDER,Frequency,Percent,Cumulative Frequency,Cumulative Percent
A,1,0.01,1,0.01
F,10401,53.69,10402,53.7
M,7953,41.05,18355,94.75
U,1017,5.25,19372,100.0

OVERLAY_SOURCE,Frequency,Percent,Cumulative Frequency,Cumulative Percent
B,8732,45.08,8732,45.08
M,1480,7.64,10212,52.72
N,4392,22.67,14604,75.39
P,4768,24.61,19372,100.0

RECENCY_STATUS_96NK,Frequency,Percent,Cumulative Frequency,Cumulative Percent
A,11918,61.52,11918,61.52
E,427,2.2,12345,63.73
F,1521,7.85,13866,71.58
L,93,0.48,13959,72.06
N,1192,6.15,15151,78.21
S,4221,21.79,19372,100.0


5. Which character variable has the highest number of levels? **cluster_code**
6. How many dummy variables would need to be created for the character variable that has the highest number of levels? **54-1, 53**

###  1.2 Predictive Modeling Challenges
As a predictive modeler, you face a number of challenges. Some of your challenges are due to problems with the data. Other challenges are related to the analytic process of predictive modeling. Fortunately, techniques are available to overcome these challenges.

#### Data Challenges

* observational data, mixed measurement scales, high dimensionality, and rare target events. 

The data used for predictive modeling is typically **observational**. (It can also be called operational or opportunistic.) In other words, the data was collected for operational purposes (such as tax or accounting purposes) unrelated to statistical analysis (Huber 1997). Observational data is frequently massive, and often consists of millions of cases. You might have hundreds of input variables. Many are redundant or irrelevant. Observational data sets typically contain errors and missing values, so preparing data for predictive modeling is often difficult. 


When your data includes large numbers of input variables, you usually have to handle **mixed measurement** scales. The input variables might be interval, as in amounts. Others might be nominal, as in class names; ordinal, as in grades; or counts, as in number of boxes.Regression analysis requires inputs to be numeric, you need to convert the nominal input variable levels to numeric dummy variables, also known as design variables, as shown in this design matrix. A better solution is to collapse the levels when the avriables has many levels.

**high dimensionality**. The dimension refers to the number of input variables—actually, input degrees of freedom. Predictive modelers consider large numbers of input variables, typically hundreds. A high number of input variables gives you high dimensionality. You might think, "Isn't it good to have a lot of input variables?". The remedy for high dimensionality is dimension reduction. You ignore irrelevant and redundant variables without throwing out important ones. 

**rare target events**. That is, the event of interest is relatively rare compared to the non-event. For example, if the event represents less than 1% of the data set, it might be considered rare.When the target event is rare, a representative sample is unlikely to have enough target events to build a good predictive model. With rare events, the effective sample size for building a reliable prediction model is closer to 

**three times the number of event cases than to the nominal size of the data set (Harrell 1997)**. So a smaller sample data set could have the predictive potential of a massive data set. This means that it is possible to use a sample of the original data set to obtain (on average) a model of similar predictive power. However, this sample should be a non-representative sample.

One widespread strategy for predicting rare events is to build a model on a sample that disproportionately over-represents the event cases (for example, an equal number of events and non-events). This is typically done when the original data set is very large and the ratio of events to non-events is very small. This type of sampling is typically called oversampling or separate sampling and the sample is called a biased sample. 