# Context
What’s the best (or at least the most popular) Halloween candy? That was the question this dataset was collected to answer. Data was collected by creating a website where participants were shown presenting two fun-sized candies and asked to click on the one they would prefer to receive. In total, more than 269 thousand votes were collected from 8,371 different IP addresses.

# Content
candy-data.csv includes attributes for each candy along with its ranking. For binary variables, 1 means yes, 0 means no. The data contains the following fields:

- chocolate: Does it contain chocolate?
- fruity: Is it fruit flavored?
- caramel: Is there caramel in the candy?
- peanutalmondy: Does it contain peanuts, peanut butter or almonds?
- nougat: Does it contain nougat?
- crispedricewafer: Does it contain crisped rice, wafers, or a cookie component?
- hard: Is it a hard candy?
- bar: Is it a candy bar?
- pluribus: Is it one of many candies in a bag or box?
- sugarpercent: The percentile of sugar it falls under within the data set.
- pricepercent: The unit price percentile compared to the rest of the set.
- winpercent: The overall win percentage according to 269,000 matchups.

# Classification

Can you predict if a candy is chocolate or not based on its other features


In [None]:
import pandas as pd
import numpy as np

Load the candy data dataset into a DataFrame

In [None]:
df = pd.read_csv('~/notebooks/data/regression/halloween_candy/candy-data.csv')
df.head()

Clean up the data

In [None]:
df.fillna('', inplace=True)
df.head()

Remove any unwanted features and split into the independent and dependent sets

In [None]:
X = df[[c for c in df.columns if c not in ['competitorname', 'chocolate']]]
y = df[['chocolate']]

Lets train a simple logistic regression to see what results we are able to obtain

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [None]:
reg = LogisticRegression(solver='lbfgs')
 
# Obtain he train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 42)
reg.fit(X_train,y_train)

print('Initial score is: ', reg.score(X,y))

What is the accuracy of our initial estimate?

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(reg.predict(X_test), y_test)