### Logistic Regression

In this practice, we will use the white wine quality data set to create a model to predict the quality of the white wine based on the available variables. Let's read the data from 'wine quality/winequality-white.csv'.

In [None]:
wq <- read.csv("../../../datasets/wine quality/winequality-white.csv",header=TRUE)
head(wq)
str(wq)

Let's look at the distribution of the quality variable. 

In [None]:
# distribution of quality variable
table(wq$quality)
# or we can do the same like this (a bit fancier)
library(plyr)
freq = count(wq,'quality')
freq

As we can see, the value 6 for quality dominates the distribution; let's remove that value and label the rest as 'good' or 'bad' to create a binary variable for quality. If the quality is larger than 6, we'll call it 'good' wine, otherwise 'bad' wine. 

In [None]:
# remove 6: create subset that has quality values larger or smaller than (but not equal to) 6.
wq_sub <- subset(wq, <what goes in here>)

# Now create a new column named 'good' with initially all zeros. 
wq_sub$good <- 0

# assign 1 to good if quality is larger than 5
wq_sub$good[<what goes in here>] <- 1

# Now remove the 'quality' column; we don't want that in the model any more.
wq_sub$quality <- NULL
str(wq_sub)
table(wq_sub$good)

So there are 1640 'bad' white wines and 1060 'good' white wines in the data set now. Let's fit a logistic regression model
to predict the variable 'good'. Let's first start with the whole data. Later we'll split it into testing and training sets.

In [None]:
wq_log = glm(good ~ ., data=<what goes in here>, family=binomial)
summary(wq_log)

Most of the variables are useful to predict the quality of the wine except sulfur dioxide and citric acid. Let's see if we can create a model with good generalization. A model's generalization property refers to the ability to predict the outcome accurately for unseen data. We will now create
    a training set to fit a model, and then test it on the testing data the model hasn't 'seen' yet.

In [None]:
library(caTools)
set.seed(1000)
split = sample.split(wq_sub$good, SplitRatio=0.7)
train_data = subset(wq_sub, <what goes in here>)
test_data  = subset(wq_sub, <what goes in here>)
# Now fit a model to the training data
wq_log2 =  glm(good ~ ., data=<what goes in here>, family=binomial)

# now predict on the test data
probs = predict(wq_log2, type = "response", newdata=<what goes in here>)

# Now let's use a threshold of 0.5 to turn probablities into actual predictions
preds <- ifelse(probs > 0.5,1,0)

#Now, compare this to the correct values for 'good' and compute the accuracy.
misClassificError <- mean (preds != test_data$good)
print(paste('Accuracy',1-misClassificError))

In [None]:
table(test_data$good,probs>0.5)

Sensitivity = TP/(TP+FN)

Specificity = TN/(TN+FP)



In [None]:
print(paste('sens =', <what goes in here>))
print(paste('spec =', <what goes in here>))

#let's also find the baseline model accuracy.
# 1640 'bad' wines, 1060 'good wines, so it should predict 'bad' all the time
print(paste('baseline accuracy =', 1640/(1640+1060)))

The accuracy of the model for unseen data is about 82%. It's pretty good. And the baseline is about 61% so we are doing a good job. 

Now find the accuracy of the first model, where we have used all the data to create the model. You'll see it's also about 82%. So our model can achieve the same level of accuracy on the unseen data which is pretty good.

In [None]:
# This is the accuracy for the first model where we have used all available data to fit.
probs1 = predict(wq_log, type = "response", newdata=wq_sub)
preds1 <- ifelse(probs1 > 0.5,1,0)
misClassificError1 <- mean (preds1 != wq_sub$good)
print(paste('Accuracy',1-misClassificError1))