## Predicting Purchase of an Insurance Policy

This dataset contains 5,822 people and 85 predictors. Only 6% of the people purchase a policy.

In [3]:
#Import Intro to Stat Learning Library
library(ISLR)

"package 'ISLR' was built under R version 3.6.1"

In [4]:
#Shape of data
dim(Caravan)

In [6]:
#List columns/predictors
attach(Caravan)
#summary of purchase column
summary(Purchase)

The following objects are masked from Caravan (pos = 3):

    AAANHANG, ABESAUT, ABRAND, ABROM, ABYSTAND, AFIETS, AGEZONG,
    AINBOED, ALEVEN, AMOTSCO, APERSAUT, APERSONG, APLEZIER, ATRACTOR,
    AVRAAUT, AWABEDR, AWALAND, AWAOREG, AWAPART, AWERKT, AZEILPL,
    MAANTHUI, MAUT0, MAUT1, MAUT2, MBERARBG, MBERARBO, MBERBOER,
    MBERHOOG, MBERMIDD, MBERZELF, MFALLEEN, MFGEKIND, MFWEKIND,
    MGEMLEEF, MGEMOMV, MGODGE, MGODOV, MGODPR, MGODRK, MHHUUR, MHKOOP,
    MINK123M, MINK3045, MINK4575, MINK7512, MINKGEM, MINKM30, MKOOPKLA,
    MOPLHOOG, MOPLLAAG, MOPLMIDD, MOSHOOFD, MOSTYPE, MRELGE, MRELOV,
    MRELSA, MSKA, MSKB1, MSKB2, MSKC, MSKD, MZFONDS, MZPART, PAANHANG,
    PBESAUT, PBRAND, PBROM, PBYSTAND, PFIETS, PGEZONG, PINBOED, PLEVEN,
    PMOTSCO, PPERSAUT, PPERSONG, PPLEZIER, PTRACTOR, Purchase, PVRAAUT,
    PWABEDR, PWALAND, PWAOREG, PWAPART, PWERKT, PZEILPL



In [7]:
#6% of people purchased insurance
348/5474

<b>Scale Variables<b>

In [11]:
#Standardize variables to have a Standard Deviation of 1 and Mean of 0
#Exclude column 86 from standard scaler since it is "Purchase" variable
standardized.X=scale(Caravan[,-86])
var(Caravan[,1])

In [12]:
var(Caravan[,2])

In [13]:
var(standardized.X[,1])

In [14]:
var(standardized.X[,2])

In [15]:
#Split data in test set of first 1,000 observations
test=1:1000

In [16]:
#Place remaining observations into the training set
train.X=standardized.X[-test,]

In [17]:
#Yields submatrix of data ranging from 1 to 1,000
test.X=standardized.X[test,]
#adding "negative" sign yields submatrix that do not range from 1 to 1,000
train.Y=Purchase[-test]
test.Y=Purchase[test]
set.seed(1)

In [20]:
#Import library for Classification
library(class)

In [22]:
#Fit K Nearest Neighbors model with K=1
knn.pred=knn(train.X,test.X,train.Y,k=1)
mean(test.Y!=knn.pred)
mean(test.Y!="No")

<b>KNN Error Rate on 1,000 test observations = 11.4%<b>
    
    However, always predicting "No" to purchasing would yield 6% error rate

In [24]:
#Confusion Matrix
table(knn.pred,test.Y)

        test.Y
knn.pred  No Yes
     No  877  50
     Yes  64   9

In [27]:
9/(64+9)

Among 77 Customers, 12.3% do purchase insurance. KNN predicts nearly double the rate of random guessing.

In [26]:
#New model for k=2
knn.pred=knn(train.X,test.X,train.Y,k=2)
table(knn.pred,test.Y)

        test.Y
knn.pred  No Yes
     No  885  51
     Yes  56   8

In [28]:
8/(56+8)

k=2 predicts 12.5%. A small increase in accuracy

In [29]:
#New model for k=3
knn.pred=knn(train.X,test.X,train.Y,k=3)
table(knn.pred,test.Y)

        test.Y
knn.pred  No Yes
     No  920  54
     Yes  21   5

In [30]:
5/(21+5)

K=3 predicts 19.2% accuracy

In [32]:
#New model for k=4
knn.pred=knn(train.X,test.X,train.Y,k=4)
table(knn.pred,test.Y)

        test.Y
knn.pred  No Yes
     No  920  52
     Yes  21   7

In [33]:
7/(21+7)

k=4 predicts 25% accuracy

In [35]:
#New model for k=5
knn.pred=knn(train.X,test.X,train.Y,k=5)
table(knn.pred,test.Y)

        test.Y
knn.pred  No Yes
     No  930  55
     Yes  11   4

In [36]:
4/(11+4)

<b>k=5 predicts 26.66% accuracy<b>

## Logistic Regression Model

In [37]:
glm.fits=glm(Purchase~.,data=Caravan,family=binomial, subset=-test)

"glm.fit: fitted probabilities numerically 0 or 1 occurred"

In [38]:
#Use a 0.5  predicted probability cut-off for classifier
glm.probs=predict(glm.fits,Caravan[test,],type="response")
glm.pred=rep("No",1000)
glm.pred[glm.probs>.5]="Yes"
table(glm.pred,test.Y)

        test.Y
glm.pred  No Yes
     No  934  59
     Yes   7   0

Only 7 people are predicted to purchase insurance. These are all misclassified.

In [39]:
glm.pred=rep("No",1000)
glm.pred[glm.probs>.25]="Yes"
table(glm.pred,test.Y)

        test.Y
glm.pred  No Yes
     No  919  48
     Yes  22  11

In [40]:
11/(22+11)

<b>When increasing predicted probability exceed 0.25 we get a 33% accuracy in the model. This is nearly 5 times better than a random guess.<b>