In [222]:
#library of datasets
library(ISLR)

#creates an object for the dataset
Default <- data.frame(Default)

#print names of the columns
names(Default)

## Logistic Regression 
![](logisticR.png)
The picture above simplifies the ML strategy of Logistic Regression.  If we want to predict a varaible Y, we can give a logistic regresssion model inputs, $X_1$, $X_2$, ..., $X_n$ and the model creates an equation on the likelyhood of Y happening.  Usually works best if Y is a yes or no answer but can be modified for non-binary responses.  We will first show how we create this model in R and then in the next section, test our model using a common method in ML.

In [223]:
#creates a logistic regression on default with income and balance as the predictors for all data
glm.fit <- glm(default ~ income + balance , data=Default ,family=binomial)

In [224]:
#summary of logistic regression for all data
summary(glm.fit)


Call:
glm(formula = default ~ income + balance, family = binomial, 
    data = Default)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4725  -0.1444  -0.0574  -0.0211   3.7245  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2920.6  on 9999  degrees of freedom
Residual deviance: 1579.0  on 9997  degrees of freedom
AIC: 1585

Number of Fisher Scoring iterations: 8


### Now we will use the method of Validation Set Approach to test a logistic model. to do this we will split our dataset into two pieces; one for training and the other half for validation.  We will use the training data for our model.  Using our model, we will try to predict whether the people in the validation set default.  If the percentage of defaulting is higher than 50% than that person we predict to default.  We will compare that prediction and their actual default data to see how many times the prediction was wrong, which is our validation error rate.

In [225]:
#randomly selects 50% of the sample for training data
train = sample(dim(Default)[1], dim(Default)[1]/2)

In [226]:
#creates fit using training data
glm.fit <- glm(default ~ income + balance , data=Default ,family=binomial, subset = train)

In [227]:
#creates validation set
Validation = Default[-train,]

#predicts if a person in the validation set defaults using the logistic model 
probs = predict(glm.fit, Validation, type = "response")

#Makes an array with all No's for the validation set
glm.pred = rep("No", dim(Default)[1]/2)

#if the probably of defaulting is higher than 50% than we say the person is likely to default
glm.pred[probs > 0.5] = "Yes"

In [228]:
#prints out the validation error which is the amount of times we predicted incorrectly over all predictions
cat("Validation error rate: ", mean(glm.pred != Validation$default)*100, "%")

Validation error rate:  2.4 %

### We just used supervised ML to predict whether a person defaults based on their income and balance.  The Validation error rate was small enough to say this is a good model to be used for any person!