In [2]:
rm(list=ls())
library(MASS)
library(psych)

In [3]:
data<-read.csv("../input/caravan/Caravan.csv")
data<- subset(data, select=-1)
for (i in 1:5822) {
    if (data$Purchase[i]=='Yes') {
        data$Purchase[i]=1
    } else {
        data$Purchase[i]=0
    }
}  
data$Purchase<-as.integer(data$Purchase)
dim(data)
head(data)

In [4]:
barplot(table(data$Purchase),ylab="Frequency",xlab="Purchase")

In [5]:
#Split the data into test and train sets
set.seed(42)
work<-sample(5822,1000)
test<-data[work,]
train<-data[-work,]

In [6]:
#Fit a logistic regression model using the train set 
mod_freq <- glm(Purchase ~ . - Purchase, family=binomial, data=train)
summary(mod_freq)

In [35]:
#Refine the model 
mod_freq2 <- glm(Purchase ~ PPERSAUT+APLEZIER+PWAPART+MKOOPKLA+PBRAND+MOPLLAAG+MINKGEM, family=binomial, data=train)
summary(mod_freq2)

In [15]:
#Comparing model predictions to train data
yhat_freq_train <- predict(mod_freq, type="response")
plot(yhat_freq_train ~ train$Purchase)


In [16]:
#Comparing model predictions to test data 
yhat_freq_test<-predict(mod_freq2,test,type='response')
plot(yhat_freq_test ~ test$Purchase)

In [17]:
#Compare correlation between predicted purchases and actual purchases
cor(yhat_freq_train,train$Purchase)
cor(yhat_freq_test,test$Purchase)

In [18]:
#Generate a confusion table with threshold of 0.5 
yhat<-rbind(t(t(yhat_freq_test)),t(t(yhat_freq_train)))
dat<-rbind(test,train)

table(dat$Purchase,(yhat>0.5))

In [19]:
#Calculating senstivity and specificity 
TPR<- function(y,yhat) {sum(y==1 & yhat==1)/ sum(y==1)}
TNR<- function(y,yhat) {sum(y==0 & yhat==0)/ sum(y==0)}

sens<- TPR(dat$Purchase, (yhat>0.5))
spec<- TNR(dat$Purchase, (yhat>0.5))

cat("Sensitivity and specificity of the Logistic regression model are found to be",sens,'and',spec,"respectively." )

The logistic regression model is accurate in prediciting non-purchases (0) as it as high specificity. However due to its low sensitivity, the predictions of purchases (1) are not reliable. The model underestimates the occurance of purchases (1). This may be reduced by increasing the threshold to be less than 0.5.  