# Ice Cream Dataset Exploration
_Author: Robert Dibble_

_**Purpose**_

Imagine that you have just joined a well funded Ice Cream start-up as their data scientist. Your task is to find a unique selling point and/or competitive advantage that will ensure their success.

_**Steps**_

Using the [Ice Cream Dataset](https://www.kaggle.com/datasets/tysonpo/ice-cream-dataset) on Kaggle:
1. Perform EDA to understand the market/consumer
1. Identify possible use cases
1. Select a use case and develop POC
1. Provide recommendation with justification

_N.B. For this analysis the focus shall be the combined products dataset as this is the simplest of the combined sets_

## Import libraries

In [None]:
import matplotlib.pyplot as plt
import numpy as np

np.random.seed(42)
import os
import pandas as pd
import re
import shap
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import MultiLabelBinarizer
import warnings

## Load data

In [None]:
# load from csv into pandas
products_raw = pd.read_csv(os.path.join("data", "combined", "products.csv"))

## Explore the dataset

In [None]:
# show size, dtype and nullness of data
products_raw.info()

In [None]:
# show range and skew of numerical features
products_raw.describe()

In [None]:
# display random sample of data
products_raw.sample(10)

In [None]:
# check if Ben and Jerry's is the only brand to populate subhead
products_raw[["brand", "subhead"]].groupby("brand").nunique()

In [None]:
# drop columns with no value - 'key' is just for joining to other datasets and 'subhead' is minimally populated
products_raw.drop(columns=["key", "subhead"], inplace=True)

The dataset contains a range of features:
- The name of the brand
- The name of the flavour
- A description of the flavour
- The ingredients list
- The average rating of the flavour
- The number of reviews use to create the average rating

The majority of these features are text based but could provide information about what contributes towards a flavour with good rating. The assumption is that higher rated products will have higher sales and profits.

## Possible use cases

1. Identify ingredients that correlate with high customer ratings
    - Use Shapley values to calculate the impact of an ingredient on a rating
1. Repeat the above analysis for other columns
    - What words in the name or description lead to a good rating
1. Repeat the above analysis for the reviews dataset
    - Which descriptive characteristics lead to good reviews

## Selected proof of concept
Identify which ingredients correlate with high customer ratings. This would allow product teams to focus their attention on the most popular ingredients when designing flavours. The relationships identified will be a correlation, not causation. Therefore, it will not ensure that an ingredient will _cause_ increased ratings but this relationship could be identified using A/B experimentation and/or user testing.

_**Steps**_
1. Clean and tokenise ingredients list
1. Select and train model on data
1. Calculate Shapley values to identify the impact of the presence of an ingredient on the rating
1. Analyse impact of ingredients

### Clean and tokenise ingredients list
Process the string of comma separated ingredients into a list of individual ingredients. Removing:
- Secondary information in brackets
- Allergen warnings
- Special characters

In [None]:
# function to clean and split ingredients
def process_ingredients_string(x):

    # remove brackets
    brackets_regex = re.compile("[\[({].*[\])}]")
    x = re.sub(brackets_regex, " ", x)

    # remove 'contains' warning
    contains_regex = re.compile("CONTAIN.*")
    x = re.sub(contains_regex, " ", x)

    # remove special characters
    x = x.replace("†", " ").replace("/", " ").replace("\\", " ")

    # replace and/or with comma
    x = x.replace(" AND ", ",").replace(" OR ", ",")

    # split with comma, full stop or colon as delimiter
    x = x.replace(".", ",").replace(":", ",").split(",")

    # drop white spaces
    x = set([i.strip() for i in x if len(i.strip()) > 0])

    return x

In [None]:
# convert ingredients to list from comma separated string
products_raw["ingredients"] = products_raw["ingredients"].apply(
    process_ingredients_string
)

In [None]:
# display all ingredients to check tokenisation
set([val for sublist in products_raw["ingredients"].to_list() for val in sublist])

Most ingredients have been sufficiently split out. There is some additional processing that could be done to cover things like:
- Differences in UK vs USA spelling
- Should variants of an ingredient be grouped, e.g. 'ALMOND EXTRACT', 'ALMONDS', 'ALMONDS ROASTED IN VEGETABLE OIL'
- Single vs multiple, e.g. 'ARTIFICIAL FLAVOR', 'ARTIFICIAL FLAVORS'
- Word ending - 'ARTIFICIAL FLAVOR', 'ARTIFICIAL FLAVORING'

In [None]:
# vectorise ingredients

# initialise and fit multilabel classifier
mlb = MultiLabelBinarizer()
mlb.fit(products_raw["ingredients"])

# save results to dataframe and drop list version
ingredients = pd.DataFrame(
    mlb.transform(products_raw["ingredients"]).astype(bool),
    columns=mlb.classes_,
)

### Select and train model on data
The purpose of this model is to reflect the data as closely as possible. Therefore, the regression model was selected to be a KNN model as this is a minimally parametrised model architecture. A KNN model would normally require the features to be scaled - due to the distance based nature of the model - but as the features are all boolean, this is not required.

In [None]:
# initialise and fit KNN regressor
knn = KNeighborsRegressor(n_neighbors=1, algorithm="brute")
knn.fit(ingredients.values, products_raw["rating"])
feature_names = ingredients.columns.to_list()

There is no value in evaluating the performance of the model as defined here. Firstly, the model will have 100% as the number of neighbours considered is one and the full dataset is used for training. Secondly, there's no desire to make predictions so over-fitting is inconsequential.

### Calculate Shapley values to identify the impact of the presence of an ingredient on the rating

In [None]:
# set summary of data for baseline and sample instances to examine shap values of
with warnings.catch_warnings():  # silence warnings due to deprecations of sklearn components not resolved in SHAP
    warnings.filterwarnings("ignore")
    ingredients_cluster = shap.kmeans(ingredients, 10)

In [None]:
# calculate shap values
explainer = shap.KernelExplainer(
    model=lambda x: knn.predict(x), data=ingredients_cluster
)
shap_values = pd.DataFrame(
    data=explainer.shap_values(X=ingredients),
    columns=ingredients.columns,
)

In [None]:
# calculate the impact of not having an ingredient vs having it

# initialise series object
rating_impact = pd.Series(dtype=float)

# loop over each ingredient
for (ingredient, presence), (ingredient_comp, shap) in zip(
    ingredients.copy().items(), shap_values.copy().items()
):

    # check ingredient names match
    assert ingredient == ingredient_comp

    # rename series prior to concatenation
    presence.name = "presence"
    shap.name = "rating_impact"

    # calculate the average impact of having vs not having the ingredient
    impact = (
        pd.concat([presence, shap], axis=1).groupby("presence").mean()["rating_impact"]
    )

    # store the improvement in average rating due to the presence of the ingredient
    rating_impact[ingredient] = impact[True] - impact[False]

### Analyse impact of ingredients

In [None]:
# identify most significant ingredients

# set number of ingredients
n = 25

# ingredients to include
top = (
    rating_impact.nlargest(n)
    .to_frame()
    .reset_index()
    .rename(columns={"index": "ingredient", 0: "rating_impact"})
)
top.columns = pd.MultiIndex.from_tuples(
    (("include", item) for item in top.columns), names=[None] + top.columns.names
)

# ingredients to avoid
bottom = (
    rating_impact.nsmallest(n)
    .to_frame()
    .reset_index()
    .rename(columns={"index": "ingredient", 0: "rating_impact"})
)
bottom.columns = pd.MultiIndex.from_tuples(
    (("exclude", item) for item in bottom.columns), names=[None] + bottom.columns.names
)

# display results
pd.concat([top, bottom], axis=1)

This set of ingredient allows product developers to flavours to be consider or avoided:
- ✅ Reese's Peanut Butter Cups
- ✅ Banana
- ✅ Mint
- ✅ Pineapple
- ✅ Pistachio
- ✅ Toffee
- ✅ Mango
- ✅ Apple
- ❌ Peaches
- ❌ Green tea
- ❌ Raisins
- ❌ Plum
- ❌ Rum

It also highlights the importance of using higher quality variants of particular flavours
- ✅ Ground vanilla beans vs ❌ Vanilla extract
- ✅ Coffee vs ❌ Coffee extract
- ✅ Chocolate vs ❌ Cocoa