# Predicting if a candy contains chocolate or not

Using this dataset: https://www.kaggle.com/datasets/fivethirtyeight/the-ultimate-halloween-candy-power-ranking

In this project, I'll be performing logistic regression to predict if a candy is chocolate or not based on its other features. Let's dive right in and go about with it:

In [4]:
# Uncomment and run the commands below if imports fail
# !conda install numpy pytorch torchvision cpuonly -c pytorch -y
# !pip install matplotlib --upgrade --quiet
!pip install jovian --upgrade --quiet

In [5]:
# Imports
import torch
import jovian
import torchvision
import torch.nn as nn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import torch.nn.functional as F
import torchvision.transforms as transforms
from torchvision.datasets.utils import download_url
from torch.utils.data import random_split
from torch.utils.data import DataLoader, TensorDataset

# Downloading, exploring and cleaning data

In [6]:
dataframe = pd.read_csv("candy-data.csv")

In [7]:
dataframe.head()

Unnamed: 0,competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
0,100 Grand,1,0,1,0,0,1,0,1,0,0.732,0.86,66.971725
1,3 Musketeers,1,0,0,0,1,0,0,1,0,0.604,0.511,67.602936
2,One dime,0,0,0,0,0,0,0,0,0,0.011,0.116,32.261086
3,One quarter,0,0,0,0,0,0,0,0,0,0.011,0.511,46.116505
4,Air Heads,0,1,0,0,0,0,0,0,0,0.906,0.511,52.341465


In [8]:
rows = dataframe.shape[0]
columns = dataframe.shape[1]
print("Number of rows:"+ str(dataframe.shape[0]))
print("Number of columns:"+ str(dataframe.shape[1]))


Number of rows:85
Number of columns:13


In [9]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85 entries, 0 to 84
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   competitorname    85 non-null     object 
 1   chocolate         85 non-null     int64  
 2   fruity            85 non-null     int64  
 3   caramel           85 non-null     int64  
 4   peanutyalmondy    85 non-null     int64  
 5   nougat            85 non-null     int64  
 6   crispedricewafer  85 non-null     int64  
 7   hard              85 non-null     int64  
 8   bar               85 non-null     int64  
 9   pluribus          85 non-null     int64  
 10  sugarpercent      85 non-null     float64
 11  pricepercent      85 non-null     float64
 12  winpercent        85 non-null     float64
dtypes: float64(3), int64(9), object(1)
memory usage: 8.8+ KB


Right off the bat, we get an idea that there aren't any null values in this dataset. Lets just reconfirm this:

In [10]:
dataframe.isnull().sum()

competitorname      0
chocolate           0
fruity              0
caramel             0
peanutyalmondy      0
nougat              0
crispedricewafer    0
hard                0
bar                 0
pluribus            0
sugarpercent        0
pricepercent        0
winpercent          0
dtype: int64

Now lets prepare our dataset for training

## Preparing dataset for training

In order to prepare our dataset for training, we need to first convert data from our dataframe to numpy arrays, so that they can be converted to PyTorch tensors. 

In [12]:
inputcols = ['fruity','caramel','peanutyalmondy','nougat','crispedricewafer','hard','bar','pluribus','sugarpercent','pricepercent']

In [13]:
outputcols = ['chocolate']

In [14]:
dfcopy = dataframe.copy(deep = True)
inputarray = dfcopy[inputcols].to_numpy()
outputarray = dfcopy[outputcols].to_numpy()

In [15]:
inputarray, outputarray

(array([[0.        , 1.        , 0.        , 0.        , 1.        ,
         0.        , 1.        , 0.        , 0.73199999, 0.86000001],
        [0.        , 0.        , 0.        , 1.        , 0.        ,
         0.        , 1.        , 0.        , 0.60399997, 0.51099998],
        [0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.011     , 0.116     ],
        [0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.011     , 0.51099998],
        [1.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.90600002, 0.51099998],
        [0.        , 0.        , 1.        , 0.        , 0.        ,
         0.        , 1.        , 0.        , 0.465     , 0.76700002],
        [0.        , 1.        , 1.        , 1.        , 0.        ,
         0.        , 1.        , 0.        , 0.60399997, 0.76700002],
        [0.        , 0.    

Next, we convert these numpy arrays to tensors

In [16]:
inputs = torch.tensor(inputarray,dtype = torch.float32)
outputs = torch.tensor(outputarray,dtype = torch.float32)

In [17]:
dataset = TensorDataset(inputs, outputs)

Now we need to split our dataset into training and validation sets for the purpose of our regression analysis. I think it would be ideal for 20% of our dataset to be a validation set

In [18]:
validation_size = int(rows * 0.2)
training_size = rows - validation_size
train_ds, val_ds = random_split(dataset, [training_size,validation_size])
len(train_ds), len(val_ds)

(68, 17)

## Data loaders for training and validation

In [19]:
batch = 15
trainloader = DataLoader(train_ds, batch, shuffle = True)
valloader = DataLoader(val_ds, batch)


In [20]:
for x,y in trainloader:
    print(x)
    print(y)
    break

tensor([[0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0340,
         0.2790],
        [0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.9880,
         0.6510],
        [1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 1.0000, 0.0930,
         0.5110],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.3130,
         0.8600],
        [0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.5930,
         0.6510],
        [0.0000, 1.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.1860,
         0.2670],
        [0.0000, 1.0000, 1.0000, 0.0000, 1.0000, 0.0000, 1.0000, 0.0000, 0.6040,
         0.6510],
        [0.0000, 1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.8600,
         0.8600],
        [1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.7320,
         0.0340],
        [0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.4060,
         0.6510],
        [1

In [21]:
for x,y in valloader:
    print(x)
    print(y)
    break

tensor([[0.0000, 0.0000, 1.0000, 1.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.4650,
         0.7670],
        [0.0000, 1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.9650,
         0.7670],
        [1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.2200,
         0.3250],
        [1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.9410,
         0.2200],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.4650,
         0.3250],
        [1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.4650,
         0.4650],
        [1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.2200,
         0.1160],
        [0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.3130,
         0.9180],
        [1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0690,
         0.1160],
        [1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.9060,
         0.5110],
        [0

In [22]:
jovian.commit(filename='logistic_regression', environment=None)

<IPython.core.display.Javascript object>

[jovian] Updating notebook "akkuvilas/mnist-logistic-minimal" on https://jovian.com[0m
[jovian] Committed successfully! https://jovian.com/akkuvilas/mnist-logistic-minimal[0m


'https://jovian.com/akkuvilas/mnist-logistic-minimal'

## Model

In [23]:
class ChocolateModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(len(inputcols), len(outputcols))
        
    def forward(self, xb):
        out = self.linear(xb)
        return out
    
    def training_step(self, batch):
        inputs, outputs = batch 
        out = self(inputs)                  # Generate predictions
        loss = F.mse_loss(out, outputs) # Calculate loss
        return loss
    
    def validation_step(self, batch):
        inputs, outputs = batch 
        out = self(inputs)                    # Generate predictions
        loss = F.mse_loss(out, outputs)   # Calculate loss
        return {'val_loss': loss.detach()}
        
    def validation_epoch_end(self, outputs):
        batch_losses = [x['val_loss'] for x in outputs]
        epoch_loss = torch.stack(batch_losses).mean()   # Combine losses
        return {'val_loss': epoch_loss.item()}
    
    def epoch_end(self, epoch, result):
        print("Epoch [{}], val_loss: {:.4f}".format(epoch, result['val_loss']))
    
model = ChocolateModel()

In [24]:
list(model.parameters())

[Parameter containing:
 tensor([[ 0.0911,  0.1571,  0.0890, -0.1193, -0.1330, -0.1821, -0.0152, -0.2174,
           0.1299,  0.1332]], requires_grad=True),
 Parameter containing:
 tensor([-0.2261], requires_grad=True)]

## Training

In [25]:
def evaluate(model, val_loader):
    outputs = [model.validation_step(batch) for batch in val_loader]
    return model.validation_epoch_end(outputs)

def fit(epochs, lr, model, train_loader, val_loader, opt_func=torch.optim.SGD):
    history = []
    optimizer = opt_func(model.parameters(), lr)
    for epoch in range(epochs):
        # Training Phase 
        for batch in train_loader:
            loss = model.training_step(batch)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
        # Validation phase
        result = evaluate(model, val_loader)
        model.epoch_end(epoch, result)
        history.append(result)
    return history

In [26]:
evaluate(model, valloader)

{'val_loss': 0.2904931306838989}

In [27]:
epochs = 15
lr = 1e-2
history1 = fit(epochs, lr, model, trainloader, valloader)

Epoch [0], val_loss: 0.2222
Epoch [1], val_loss: 0.1813
Epoch [2], val_loss: 0.1558
Epoch [3], val_loss: 0.1426
Epoch [4], val_loss: 0.1347
Epoch [5], val_loss: 0.1308
Epoch [6], val_loss: 0.1291
Epoch [7], val_loss: 0.1286
Epoch [8], val_loss: 0.1284
Epoch [9], val_loss: 0.1273
Epoch [10], val_loss: 0.1273
Epoch [11], val_loss: 0.1274
Epoch [12], val_loss: 0.1266
Epoch [13], val_loss: 0.1270
Epoch [14], val_loss: 0.1267


In [28]:
epochs = 15
lr = 1e-3
history2 = fit(epochs, lr, model, trainloader, valloader)

Epoch [0], val_loss: 0.1266
Epoch [1], val_loss: 0.1266
Epoch [2], val_loss: 0.1265
Epoch [3], val_loss: 0.1264
Epoch [4], val_loss: 0.1263
Epoch [5], val_loss: 0.1262
Epoch [6], val_loss: 0.1262
Epoch [7], val_loss: 0.1260
Epoch [8], val_loss: 0.1259
Epoch [9], val_loss: 0.1257
Epoch [10], val_loss: 0.1256
Epoch [11], val_loss: 0.1255
Epoch [12], val_loss: 0.1254
Epoch [13], val_loss: 0.1253
Epoch [14], val_loss: 0.1252


In [29]:
epochs = 15
lr = 1e-4
history3 = fit(epochs, lr, model, trainloader, valloader)

Epoch [0], val_loss: 0.1252
Epoch [1], val_loss: 0.1252
Epoch [2], val_loss: 0.1252
Epoch [3], val_loss: 0.1252
Epoch [4], val_loss: 0.1251
Epoch [5], val_loss: 0.1251
Epoch [6], val_loss: 0.1251
Epoch [7], val_loss: 0.1251
Epoch [8], val_loss: 0.1251
Epoch [9], val_loss: 0.1251
Epoch [10], val_loss: 0.1251
Epoch [11], val_loss: 0.1251
Epoch [12], val_loss: 0.1251
Epoch [13], val_loss: 0.1251
Epoch [14], val_loss: 0.1251


In [30]:
epochs = 15
lr = 1e-5
history4 = fit(epochs, lr, model, trainloader, valloader)

Epoch [0], val_loss: 0.1251
Epoch [1], val_loss: 0.1251
Epoch [2], val_loss: 0.1251
Epoch [3], val_loss: 0.1251
Epoch [4], val_loss: 0.1251
Epoch [5], val_loss: 0.1251
Epoch [6], val_loss: 0.1251
Epoch [7], val_loss: 0.1251
Epoch [8], val_loss: 0.1251
Epoch [9], val_loss: 0.1251
Epoch [10], val_loss: 0.1251
Epoch [11], val_loss: 0.1251
Epoch [12], val_loss: 0.1251
Epoch [13], val_loss: 0.1251
Epoch [14], val_loss: 0.1251


In [31]:
epochs = 15
lr = 1e-6
history5 = fit(epochs, lr, model, trainloader, valloader)

Epoch [0], val_loss: 0.1251
Epoch [1], val_loss: 0.1251
Epoch [2], val_loss: 0.1251
Epoch [3], val_loss: 0.1251
Epoch [4], val_loss: 0.1251
Epoch [5], val_loss: 0.1251
Epoch [6], val_loss: 0.1251
Epoch [7], val_loss: 0.1251
Epoch [8], val_loss: 0.1251
Epoch [9], val_loss: 0.1251
Epoch [10], val_loss: 0.1251
Epoch [11], val_loss: 0.1251
Epoch [12], val_loss: 0.1251
Epoch [13], val_loss: 0.1251
Epoch [14], val_loss: 0.1251


Thus we can say that our final validation loss is 0.1251, as the value doesnt seem to decrease further

## Making predictions using the trained model

In [84]:
def predict(inputs,output,model):
    preds = model(inputs)
    print("Input:", inputs)
    print("Prediction:", torch.round(preds))
    print("Output:", output)
inputs, output = val_ds[0]
predict(inputs,output,model)


Input: tensor([0.0000, 0.0000, 1.0000, 1.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.4650,
        0.7670])
Prediction: tensor([1.], grad_fn=<RoundBackward0>)
Output: tensor([0.])


The model predicts that the candy contains chocolate, however it does not

In [85]:
inputs, output = val_ds[5]
predict(inputs,output,model)

Input: tensor([1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.4650,
        0.4650])
Prediction: tensor([0.], grad_fn=<RoundBackward0>)
Output: tensor([0.])


Here our model predicts that the candy does not contain chocolate, and it does not

In [86]:
inputs, output = val_ds[15]
predict(inputs,output,model)

Input: tensor([1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.2670,
        0.2790])
Prediction: tensor([0.], grad_fn=<RoundBackward0>)
Output: tensor([0.])


Our model predicts that the candy does not contain chocolate, and it does not

In [87]:
inputs, output = val_ds[4]
predict(inputs,output,model)

Input: tensor([0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.4650,
        0.3250])
Prediction: tensor([1.], grad_fn=<RoundBackward0>)
Output: tensor([1.])


In [88]:
inputs, output = val_ds[9]
predict(inputs,output,model)

Input: tensor([1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.9060,
        0.5110])
Prediction: tensor([1.], grad_fn=<RoundBackward0>)
Output: tensor([0.])


In [89]:
inputs, output = val_ds[6]
predict(inputs,output,model)

Input: tensor([1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.2200,
        0.1160])
Prediction: tensor([0.], grad_fn=<RoundBackward0>)
Output: tensor([0.])


In [90]:
inputs, output = val_ds[7]
predict(inputs,output,model)

Input: tensor([0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.3130,
        0.9180])
Prediction: tensor([1.], grad_fn=<RoundBackward0>)
Output: tensor([1.])


In [91]:
inputs, output = val_ds[11]
predict(inputs,output,model)

Input: tensor([0.0000, 1.0000, 1.0000, 1.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.6040,
        0.7670])
Prediction: tensor([1.], grad_fn=<RoundBackward0>)
Output: tensor([1.])


The model predicts that the candy does not contain chocolate, and it does not

In [92]:
inputs, output = val_ds[10]
predict(inputs,output,model)

Input: tensor([0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 1.0000, 0.0000, 0.3130,
        0.7670])
Prediction: tensor([1.], grad_fn=<RoundBackward0>)
Output: tensor([1.])


In [93]:
inputs, output = val_ds[2]
predict(inputs,output,model)

Input: tensor([1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.2200,
        0.3250])
Prediction: tensor([0.], grad_fn=<RoundBackward0>)
Output: tensor([0.])


In [94]:
inputs, output = val_ds[14]
predict(inputs,output,model)

Input: tensor([1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.1270,
        0.0340])
Prediction: tensor([0.], grad_fn=<RoundBackward0>)
Output: tensor([0.])


## Save and upload

In [95]:
torch.save(model.state_dict(), 'mnist-logistic.pth')

In [96]:
jovian.commit(filename='logistic_regression', environment=None)


<IPython.core.display.Javascript object>

[jovian] Updating notebook "akkuvilas/mnist-logistic-minimal" on https://jovian.com[0m
[jovian] Committed successfully! https://jovian.com/akkuvilas/mnist-logistic-minimal[0m


'https://jovian.com/akkuvilas/mnist-logistic-minimal'