# Ice Cream Dataset Exploration
_Author: Robert Dibble_

_**Purpose**_

Imagine that you have just joined a well funded Ice Cream start-up as their data scientist. Your task is to find a unique selling point and/or competitive advantage that will ensure their success.

_**Steps**_

Using the [Ice Cream Dataset](https://www.kaggle.com/datasets/tysonpo/ice-cream-dataset) on Kaggle:
1. Perform EDA to understand the market/consumer
1. Identify possible use cases
1. Select a use case and develop POC
1. Provide recommendation with justification

_N.B. For this analysis the focus shall be the combined products dataset as this is the simplest of the combined sets_

## Import libraries

In [None]:
from datavizml import (
    ExploratoryDataAnalysis,
)  # home made EDA library available at https://github.com/dibble07/datavizml/
import os
import pandas as pd
import re
from sklearn.preprocessing import MultiLabelBinarizer

## Load data

In [None]:
# load from csv intro pandas
products_raw = pd.read_csv(os.path.join("data", "combined", "products.csv"))

## Explore the dataset

In [None]:
# show size, dtype and nullness of data
products_raw.info()

In [None]:
# display random sample of data
products_raw.sample(10)

In [None]:
# check if Ben and Jerry's is the only brand to populate subhead
products_raw[["brand", "subhead"]].groupby("brand").nunique()

In [None]:
# drop columns with no value - 'key' is just for joining to other datasets and 'subhead' is minimally populated
products_raw.drop(columns=["key", "subhead"], inplace=True)

In [None]:
# function to clean and split string list
def process_string_list(x):

    # remove brackets
    brackets_regex = re.compile("[\[({].*[\])}]")
    x = re.sub(brackets_regex, " ", x)

    # remove 'contains' warning
    contains_regex = re.compile("CONTAIN.*")
    x = re.sub(contains_regex, " ", x)

    # remove special characters
    x = x.replace("†", " ").replace("/", " ").replace("\\", " ")

    # replace and/or with comma
    x = x.replace(" AND ", ",").replace(" OR ", ",")

    # split with comma, full stop or colon as delimiter
    x = x.replace(".", ",").replace(":", ",").split(",")

    # drop white spaces
    x = set([i.strip() for i in x if len(i.strip()) > 0])

    return x

In [None]:
# convert ingredients to list from comma separated string
products_raw["ingredients"] = products_raw["ingredients"].apply(process_string_list)

In [None]:
# display all ingredients to check tokenisation
set([val for sublist in products_raw["ingredients"].to_list() for val in sublist])

In [None]:
# vectorise ingredients

# initialise and fit multilabel classifier
mlb = MultiLabelBinarizer()
mlb.fit(products_raw["ingredients"])

# save results to dataframe and drop list version
products_raw[mlb.classes_] = mlb.transform(products_raw["ingredients"]).astype(bool)
products_raw.drop(columns="ingredients", inplace=True)

## Possible use cases

1. Identify ingredients that correlate with high custoemr ratings
    - Use Shapley values to calculate the impact of an ingredient on a rating
    - ~~Use the Apriori algorithm to identify frequent ingredient combinations~~ - frequent doesn't mean good or unique