## **Importing Libraries**

In [6]:
import pandas as pd
import numpy

## **Context**

What’s the best (or at least the most popular) Halloween candy? That was the question this dataset was collected to answer. Data was collected by creating a website where participants were shown [presenting two fun-sized candies and asked to click on the one they would prefer to receive](https://walthickey.com/2017/10/18/whats-the-best-halloween-candy/).

In total, more than `269,000` votes were collected from 8,371 different IP addresses.

<br>

## **Dataset**

This dataset was curated by the prominent statistics group FiveThirtyEight and housed on Kaggle at this link: <a href="https://www.kaggle.com/datasets/fivethirtyeight/the-ultimate-halloween-candy-power-ranking">The Ultimate Halloween Candy Power Ranking</a>

Content **candy-data.csv** includes attributes for each candy along with its ranking. For binary variables, `1` means `yes`, `0` means `no`. The data contains the following fields:

- **`Chocolate`**: Does it contain Chocolate?
- **`fruity`**: Is it fruit flavored?
- **`caramel`**: Is there caramel in the candy?
- **`peanutalmondy`**: Does it contain peanuts, peanut butter or almonds?
- **`nougat`**: Does it contain nougat?
- **`crispedricewafer`**: Does it contain crisped rice, wafers, or a cookie component?
- **`hard`**: Is it a hard candy?
- **`bar`**: Is it a candy bar?
- **`pluribus`**: Is it one of many candies in a bag or box?
- **`sugarpercent`**: The percentile of sugar it falls under within the data set. This value indicates the sweetness of a candy compared to all other candies in the dataset. For example, if a candy has a `sugarpercent of 80`, it means it is sweeter than `80%` of the candies in the dataset.
- **`pricepercent`**: The unit price percentile compared to the rest of the set. (same concept as sugarpercent)
- **`winpercent`**: The overall win percentage according to `269,000` matchups.

<br>

## **Objective**

The objective is to identify which product characteristics drive customer sentiment towards candies in general. By analyzing the dataset, we aim to determine the factors that contribute to a candy's popularity (winpercent) and make a data-driven recommendation for a new candy product that maximizes positive customer sentiment.

<br>


## **Sample Sugar Percent Calculation**

Suppose we have 5 candies with the following sugar contents:

- Candy A: 15 grams
- Candy B: 20 grams
- Candy C: 10 grams
- Candy D: 25 grams
- Candy E: 5 grams

To calculate the `sugarpercent` for each candy, follow these steps:

1. **List the sugar contents in ascending order**:

   $$
   \text{E: 5 grams} < \text{C: 10 grams} < \text{A: 15 grams} < \text{B: 20 grams} < \text{D: 25 grams}
   $$

2. **Determine the rank of each candy**:
   - Candy E (5 grams): Rank 1
   - Candy C (10 grams): Rank 2
   - Candy A (15 grams): Rank 3
   - Candy B (20 grams): Rank 4
   - Candy D (25 grams): Rank 5

3. **Calculate the percentile for each candy** using the formula:

   $$
   \text{Percentile} = \frac{\text{Rank} - 1}{\text{Total number of candies} - 1} \times 100
   $$

Let's calculate the percentiles:

- **Candy E**:

   $$
   \frac{1 - 1}{5 - 1} \times 100 = \frac{0}{4} \times 100 = 0\%
   $$

- **Candy C**:

   $$
   \frac{2 - 1}{5 - 1} \times 100 = \frac{1}{4} \times 100 = 25\%
   $$

- **Candy A**:

   $$
   \frac{3 - 1}{5 - 1} \times 100 = \frac{2}{4} \times 100 = 50\%
   $$

- **Candy B**:

   $$
   \frac{4 - 1}{5 - 1} \times 100 = \frac{3}{4} \times 100 = 75\%
   $$

- **Candy D**:

   $$
   \frac{5 - 1}{5 - 1} \times 100 = \frac{4}{4} \times 100 = 100\%
   $$

So, the sugarpercent values for the candies are:

- Candy E: 0%
- Candy C: 25%
- Candy A: 50%
- Candy B: 75%
- Candy D: 100%

<br>

****
## **Read the Data and Perform basic EDA**

In [7]:
candy_data = pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv')
candy_data

Unnamed: 0,competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
0,100 Grand,1,0,1,0,0,1,0,1,0,0.732,0.860,66.971725
1,3 Musketeers,1,0,0,0,1,0,0,1,0,0.604,0.511,67.602936
2,One dime,0,0,0,0,0,0,0,0,0,0.011,0.116,32.261086
3,One quarter,0,0,0,0,0,0,0,0,0,0.011,0.511,46.116505
4,Air Heads,0,1,0,0,0,0,0,0,0,0.906,0.511,52.341465
...,...,...,...,...,...,...,...,...,...,...,...,...,...
80,Twizzlers,0,1,0,0,0,0,0,0,0,0.220,0.116,45.466282
81,Warheads,0,1,0,0,0,0,1,0,0,0.093,0.116,39.011898
82,Welch's Fruit Snacks,0,1,0,0,0,0,0,0,1,0.313,0.313,44.375519
83,Werther's Original Caramel,0,0,1,0,0,0,1,0,0,0.186,0.267,41.904308


**Check for any Missing Values**

In [8]:
candy_data.isnull().sum()

competitorname      0
chocolate           0
fruity              0
caramel             0
peanutyalmondy      0
nougat              0
crispedricewafer    0
hard                0
bar                 0
pluribus            0
sugarpercent        0
pricepercent        0
winpercent          0
dtype: int64

**Check for the Data types**

In [9]:
candy_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85 entries, 0 to 84
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   competitorname    85 non-null     object 
 1   chocolate         85 non-null     int64  
 2   fruity            85 non-null     int64  
 3   caramel           85 non-null     int64  
 4   peanutyalmondy    85 non-null     int64  
 5   nougat            85 non-null     int64  
 6   crispedricewafer  85 non-null     int64  
 7   hard              85 non-null     int64  
 8   bar               85 non-null     int64  
 9   pluribus          85 non-null     int64  
 10  sugarpercent      85 non-null     float64
 11  pricepercent      85 non-null     float64
 12  winpercent        85 non-null     float64
dtypes: float64(3), int64(9), object(1)
memory usage: 8.8+ KB


**The Characteristics of the `Binary Features` (all features except for sugarpercent` and `pricepercent):**

In [10]:
candy_data[candy_data.columns[1:-3]].agg(['sum','count'])

Unnamed: 0,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus
sum,37,38,14,14,7,7,15,21,44
count,85,85,85,85,85,85,85,85,85


**The Characteristics of the Non-Binary Features `sugarpercent` and `pricepercent`:**

In [11]:
candy_data[['sugarpercent','pricepercent']].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
sugarpercent,85.0,0.478647,0.282778,0.011,0.22,0.465,0.732,0.988
pricepercent,85.0,0.468882,0.28574,0.011,0.255,0.465,0.651,0.976


**Adjust winpercent to match `sugarpercent` and `pricepercent` formats**

In [5]:
candy_data['winpercent'] = candy_data['winpercent'] / 100

<br>

****
## **Clustering Similar Candies into Groups**

For this problem, clustering seems to be a good option for categorizing various types of candy into distinct groups based on their similarities and differences. By grouping candies with similar characteristics together, we can better understand which features are associated with higher popularity and positive customer sentiment. This can help us identify trends and make informed recommendations for a new candy product.

<br>


#### **The Steps to apply Clustering are:**

1. **Preprocessing**:
   - We will prepare the dataset by cleaning and normalizing the data to ensure all features contribute equally to the analysis.

2. **Dimensionality Reduction**:
   - We will use PCA which is a popular dimensionality reduction technique. This step simplifies the data while retaining essential information.

3. **Apply a Clustering Algorithm**:
   - We will apply the k-Means clustering algorithm which is a popular and simple clustering algorithm. This algorithm partitions the data into `k clusters`, where each candy belongs to the cluster with the nearest mean value.

4. **Analyze Clusters**:
   - Examine the formed clusters to identify common characteristics within each group. Determine which features are most influential in driving customer sentiment.

**1. Preprocessing**

Fortunately, the data is already **clean** (no missing values) and **normalized** (all the feature values are between `0` and `1`)

In [12]:
candy_data.iloc[:, 1:12]

Unnamed: 0,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent
0,1,0,1,0,0,1,0,1,0,0.732,0.860
1,1,0,0,0,1,0,0,1,0,0.604,0.511
2,0,0,0,0,0,0,0,0,0,0.011,0.116
3,0,0,0,0,0,0,0,0,0,0.011,0.511
4,0,1,0,0,0,0,0,0,0,0.906,0.511
...,...,...,...,...,...,...,...,...,...,...,...
80,0,1,0,0,0,0,0,0,0,0.220,0.116
81,0,1,0,0,0,0,1,0,0,0.093,0.116
82,0,1,0,0,0,0,0,0,1,0.313,0.313
83,0,0,1,0,0,0,1,0,0,0.186,0.267


**2. Dimensionality Reduction**

When performing `Principal Component Analysis (PCA)`, the `Explained Variance` represents how much of the total variance in the dataset is captured by each principal component. Each component attempts to capture as much of the variability in the data as possible.

- To keep things simple, we will select the first `2` components which in total contribute to `57 %` of the explained variance.

- We will plot these components `PC1` and `PC2` in order to reveal clusters, trends, and patterns that might not be visible in the higher-dimensional space.

In [14]:
from sklearn.decomposition import PCA
import seaborn as sns

ModuleNotFoundError: No module named 'sklearn'

In [None]:
pca_columns = candy_data.iloc[:, 1:12]
pca = PCA(n_components=2)
candy_pca = pca.fit_transform(pca_columns)

In [None]:
explained_variance = np.round(pca.explained_variance_ratio_*100, decimals=2)
explained_variance_df = pd.DataFrame(explained_variance.reshape(1, -1), columns=["PC1", "PC2"])
print(explained_variance_df)

**3. Applying K-Means Clustering**

The K-Means algorithm is an iterative algorithm that aims to partition a dataset into `K` clusters.

Each cluster is represented by its centroid `mean`, and data points are assigned to the cluster with the nearest centroid. The algorithm optimizes the placement of centroids and assignment of data points to minimize the sum of squared distances between data points and their respective cluster centroids.

<br>

The algorithm works in the following steps:

1. **Initialization**:

    - Randomly choose `K` data points as initial centroids. These centroids serve as the starting points for the clusters.

2. **Assignment**:

    - Calculate the distance (usually Euclidean distance) between each data point and all K centroids.
    - Assign each data point to the cluster with the nearest centroid. This forms K initial clusters.

3. **Update Centroids**:

    - Recalculate the centroid of each cluster by taking the mean of all data points assigned to that cluster. These new centroids may have shifted from their initial positions.

4. **Iteration**:

    - Repeat steps `2` and `3` until a stopping criterion is met. Common stopping criteria include:
        - A maximum number of iterations is reached.
        - The centroids' positions do not change significantly.

5. **Final Clusters**:

    - Once the algorithm converges (stops iterating), the final set of `K` clusters is obtained. Each data point is assigned to one of these clusters.

**4. Analysis of Clusters**

Once the clusters are formed, we can explore the average values of different candy attributes within each cluster.

This analysis will help in understanding the predominant characteristics of each cluster.

<br>

****
## **Further Analysis on Clusters**

To recommend features that will maximize the customer sentiment, we will:

1. **Highlight** and **Extract** the **Feature Combinations** that are predominantly present in each Cluster.
2. Select the **Winning Clusters** based on the `winpercent`
3. Further narrow down the **Potential Combinations**, which could enhance the appeal, using **Statistical Hypothesis Testing**.


**1. Highlight and Extract the Predominant Feature Combinations**

We will set the threshold to `40 %` which means that for each cluster we will consider only those Features whose mean value is greater than 40 %.

This will result in `2 Features per Cluster` and will set a starting point for further analysis.


**2. Select Winning Clusters**

To integrate the concept of `Winning` and `Losing`, we can simple calculate the `Average Win Percentage` for each Cluster.

This categorization is pivotal in understanding which combinations of candy attributes are more successful in terms of popularity and consumer preference.

**3. Narrow down the Potential Combinations using Statistical Hypothesis Testing**

Once we know the basic winning combination of features (`Chocolate and Bar` and `Chocolate and Pluribus`), we can further narrow down the potential combinations that could enhance the appeal of these winning candy types.

We will `iteratively add each additional feature to the base characteristics` of the winning combinations and assess its impact on the `winpercent` using a `t-test`.