This will be a Jupyter notebook demonstration in R that utilizes a logistic regression model to yield a prediction accuracy of the number of times it accurately determines whether a candy in question is chocolate-based or not. The candy data is data collected from Walt Hickey's "What's the Best Halloween Candy?" experiment. 

More information on the experiment and dataset can be found here: https://fivethirtyeight.com/features/the-ultimate-halloween-candy-power-ranking/     

First off, you will need to individually import the data file "candy-data" into R by setting your corresponding working directory to the specific location of the data file. 

In [161]:
#Insert your working directory path as an argument of the setwd function
setwd()

In [None]:
# Import the data file "candy-data" into R
candy <- read.csv()

Next, let's take a look at the dataset itself.

In [159]:
#Viewing the structure of the dataset
str(candy)
head(candy)

'data.frame':	85 obs. of  13 variables:
 $ competitorname  : Factor w/ 85 levels "100 Grand","3 Musketeers",..: 1 2 45 46 3 4 5 6 7 8 ...
 $ chocolate       : int  1 1 0 0 0 1 1 0 0 0 ...
 $ fruity          : int  0 0 0 0 1 0 0 0 0 1 ...
 $ caramel         : int  1 0 0 0 0 0 1 0 0 1 ...
 $ peanutyalmondy  : int  0 0 0 0 0 1 1 1 0 0 ...
 $ nougat          : int  0 1 0 0 0 0 1 0 0 0 ...
 $ crispedricewafer: int  1 0 0 0 0 0 0 0 0 0 ...
 $ hard            : int  0 0 0 0 0 0 0 0 0 0 ...
 $ bar             : int  1 1 0 0 0 1 1 0 0 0 ...
 $ pluribus        : int  0 0 0 0 0 0 0 1 1 0 ...
 $ sugarpercent    : num  0.732 0.604 0.011 0.011 0.906 ...
 $ pricepercent    : num  0.86 0.511 0.116 0.511 0.511 ...
 $ winpercent      : num  67 67.6 32.3 46.1 52.3 ...


competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
100 Grand,1,0,1,0,0,1,0,1,0,0.732,0.86,66.97173
3 Musketeers,1,0,0,0,1,0,0,1,0,0.604,0.511,67.60294
One dime,0,0,0,0,0,0,0,0,0,0.011,0.116,32.26109
One quarter,0,0,0,0,0,0,0,0,0,0.011,0.511,46.1165
Air Heads,0,1,0,0,0,0,0,0,0,0.906,0.511,52.34146
Almond Joy,1,0,0,1,0,0,0,1,0,0.465,0.767,50.34755


Seeing how the premise of this notebook is to determine whether a specific candy is made out of chocolate or not, this would imply a binary outcome. Luckily within this "candy-data" dataset, we will be using "chocolate" as the categorical response variable we'll be trying to predict. Due to the fact that our response variable is binomial with two outcomes,"0" (not chocolate) and "1" (is chocolate), this is where the reasoning behind using logistic regression model comes from. Lastly as shown in the previous cell, "chocolate" is of data type "int" which isn't what we want. "Chocolate" as a data type of "Factor" more suits our needs due to once again, its categorical quality.    

In [160]:
#Changing the values for chocolate from data type 'int' to 'Factor'
candy$chocolate <- as.factor(candy$chocolate)
str(candy)

'data.frame':	85 obs. of  13 variables:
 $ competitorname  : Factor w/ 85 levels "100 Grand","3 Musketeers",..: 1 2 45 46 3 4 5 6 7 8 ...
 $ chocolate       : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 2 1 1 1 ...
 $ fruity          : int  0 0 0 0 1 0 0 0 0 1 ...
 $ caramel         : int  1 0 0 0 0 0 1 0 0 1 ...
 $ peanutyalmondy  : int  0 0 0 0 0 1 1 1 0 0 ...
 $ nougat          : int  0 1 0 0 0 0 1 0 0 0 ...
 $ crispedricewafer: int  1 0 0 0 0 0 0 0 0 0 ...
 $ hard            : int  0 0 0 0 0 0 0 0 0 0 ...
 $ bar             : int  1 1 0 0 0 1 1 0 0 0 ...
 $ pluribus        : int  0 0 0 0 0 0 0 1 1 0 ...
 $ sugarpercent    : num  0.732 0.604 0.011 0.011 0.906 ...
 $ pricepercent    : num  0.86 0.511 0.116 0.511 0.511 ...
 $ winpercent      : num  67 67.6 32.3 46.1 52.3 ...


In [121]:
#Viewing the summary statistics of the candy dataset
summary(candy)

            competitorname chocolate     fruity          caramel      
 100 Grand         : 1     0:48      Min.   :0.0000   Min.   :0.0000  
 3 Musketeers      : 1     1:37      1st Qu.:0.0000   1st Qu.:0.0000  
 Air Heads         : 1               Median :0.0000   Median :0.0000  
 Almond Joy        : 1               Mean   :0.4471   Mean   :0.1647  
 Baby Ruth         : 1               3rd Qu.:1.0000   3rd Qu.:0.0000  
 Boston Baked Beans: 1               Max.   :1.0000   Max.   :1.0000  
 (Other)           :79                                                
 peanutyalmondy       nougat        crispedricewafer       hard       
 Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000  
 Median :0.0000   Median :0.00000   Median :0.00000   Median :0.0000  
 Mean   :0.1647   Mean   :0.08235   Mean   :0.08235   Mean   :0.1765  
 3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.0000  
 Max. 

We should now think about creating a training set and testing set. Training set to train our logistic regression model and testing set to predict the accuracy of our model against. We want to split our full data into 80/20 training and testing partitions respectively.

In [169]:
#Creating the training and testing data
set.seed(100)
ind <- sample(2, nrow(candy), replace = T, prob = c(0.8,0.2))
train <- candy[ind==1,]
test <- candy[ind==2,]

Now that training and testing sets have been created, let's create our logistic regression model on the training set. We will be using chocolate as our response variable with "pricepercent" and "winpercent" as our predicators.

In [164]:
#Creating a logistic regression model on the training set
chocolate_logistic_model <- glm(chocolate ~  pricepercent + winpercent, 
                                 data = train, family = 'binomial')
summary(chocolate_logistic_model)


Call:
glm(formula = chocolate ~ pricepercent + winpercent, family = "binomial", 
    data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4791  -0.4903  -0.2213   0.3311   2.8018  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -8.57937    1.93765  -4.428 9.52e-06 ***
pricepercent  3.78913    1.37634   2.753 0.005904 ** 
winpercent    0.12578    0.03334   3.773 0.000161 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 93.190  on 68  degrees of freedom
Residual deviance: 48.137  on 66  degrees of freedom
AIC: 54.137

Number of Fisher Scoring iterations: 5


Following the creation of our logistic model, we build a "predictTest" variable to hold the predicted confidence values of our testing set. The cutoff to determine what candy is or isn't chocolate will be settled at 0.5 and higher.

In [166]:
#Making predictions on the testing set and setting a prediction cutoff
predictTest <- predict(chocolate_logistic_model, newdata=test, type = "response")
prediction_cutoff <- ifelse(predictTest>0.5,1,0)

Afterwards, we will be creating a confusion matrix that gives us a tabular visualization of the number of times our model correctly predicted chocolate or not against what was actually recorded in the testing set.

In [152]:
#Creating the confusion matrix
confusion_matrix <- table(Predicted = prediction_cutoff, Actual = test$chocolate)
confusion_matrix 

         Actual
Predicted 0 1
        0 6 2
        1 1 7

Finally, we calculate the accuracy percentage of how well our model did against the testing set. We can come to this value by dividing the number of times our model correctly predicted the candy was and wasn't chocolate over the frequency of all the possible outcome combinations.  

In [163]:
#Determining the accuracy of our model as a percentage
accuracy <- sum(diag(confusion_matrix))/sum(confusion_matrix)
accuracy

Our model scored an 81% accuracy percentage. This is not too bad especially given that our dataset is very small to begin with and our training set smaller. 