## Introduction

In this project, we are taking a look at what makes an ice cream flavor more enjoyable for people to eat. We found this ice cream rating dataset on Kaggle which rates ice cream flavor of four ice cream producers and wish to understand the underlying reasons which makes an ice cream flavor rating higher than other ones.  Throughout this analysis, we will take a look at different hypotheses related to the type of ingredients contained within one ice cream, toppings used as well as what would be the perfect mix of toppings and ingredients which, could potentially produce the best rated ice cream flavor maximizing the ratings.

## Data Set Description

The dataset we chose is from Kaggle and contains reviews of multiple ice cream flavors across 4 major brands. Reviews comprise star ratings as well as a descriptive text which collected from the brand websites. 

Link: [https://www.kaggle.com/tysonpo/ice-cream-dataset](https://www.kaggle.com/tysonpo/ice-cream-dataset)

- **products.csv**: contains information about each flavor
    - 242 observations
    - Variables: Ice cream key, brand name, name, subhead, description, rating, rating counts, ingredients
- **reviews.csv**: contains reviews for each flavor of ice cream
    - 21,674 observations
    - Variables: Date of the review, star rating, title, helpful review indicator, review text

Additional dataset:
- **dairy.txt**: contains dairy keywords
    - Source: https://www.godairyfree.org/dairy-free-grocery-shopping-guide/dairy-ingredient-list-2

The dataset had to be cleaned in order to facilitate the statistical analysis to be done. This includes extracting the ingredient list, tokenizing it, and creating indicator variables. Additionally, we need to restrict the number of ingredients to keep track of to the following: raspberries, peanuts, almonds, coffee, chocolate, strawberries and raspberries. More will be added to this list following analysis on ingredient prevalence across flavors and brands.

## Data Preparation

In [1]:
import pandas as pd
import numpy as np
import os
import re

In [2]:
#read csv files

#Data from kaggle:
#https://www.kaggle.com/tysonpo/ice-cream-dataset

df = pd.read_csv("products.csv")
df_reviews = pd.read_csv("reviews.csv")

#dairy free keywords
#https://www.godairyfree.org/dairy-free-grocery-shopping-guide/dairy-ingredient-list-2
dairy = pd.read_csv("dairy.txt", header = None)


In [3]:
def get_ingredients_dict(split_regex, ingredient_series):
    
    """
    Split ingredients series by chr string specified by regex
    Parent- separate by commas, ignore parenthesis 
    Child- separate by commas
    Return dictionary with keys as ingredients, values as frequency
    
    """
    
    ingredient_dict = {}
    for i in range(len(ingredient_series)):
        ingredient_split = [x.lstrip() for x in re.split(split_regex, ingredient_series[i]) if x != ""]
        for element in ingredient_split:
            if element in ingredient_dict.keys():
                ingredient_dict[element] = ingredient_dict[element] + 1
            else:
                ingredient_dict[element] = 1
    return ingredient_dict

In [104]:
def split_ingredients(split_regex):
    
    """
    Split ingredients series by chr string specified by regex
    Parent- separate by commas, ignore parenthesis 
    Child- separate by commas
    Return dictionary with keys as flavor keys, values as split ingredients
    """
    
    ingredient_dict = {}
    
    for i in range(len(df)):
        ingredient_split = [x.lstrip() for x in re.split(split_regex, df.iloc[i]["ingredients"]) if x != ""]
        ingredient_dict[df.iloc[i]["key"]] = ingredient_split
        
    return ingredient_dict

In [86]:
def indicator_df(split_regex):
    
    """
    Append ingredients count to dataframe
    """
    
    ing_dict = get_ingredients_dict(split_regex, df["ingredients"])
    indicator_df = pd.DataFrame({"ingredients" : ing_dict.keys()})
    
    return indicator_df

In [74]:
def ingredient_indicator(ing_dict, ing_df, columns = True):
    
    """
    ing_dict: ingredients dictionary preprocessed by Regular Expressions syntax
    ing_df: Target dataframe containing the individual flavors
    Columns: Add ingredients to columns, FALSE returns ingredients as rows (flavors as columns). Columns is default
    Returns tokenized dataframe
    """
    
    indicator_dict = {}
    
    for key in ing_dict.keys():
        indicator_dict[key] = [item in ing_dict[key] for item in ing_df["ingredients"]]
    
    indicator_df = ing_df.join(pd.DataFrame(data = indicator_dict))
    
    if columns:
        indicator_df = indicator_df.set_index("ingredients").transpose().reset_index().rename(columns = {"index" : "key"})
    
    return indicator_df

In [99]:
def contains_ingredient(target_ing_list, split_regex):
    
    """
    target_ing_list: List of ingredients to tokenize (dairy-free, etc.)
    split_regex: Syntax to split ingredients (parent, child)
    Returns DataFrame with tokenized ingredients given the ingredients list
    
    """
    split_ingredients_list = split_ingredients(split_regex)
    ind_df = indicator_df(split_regex)
    df_token = ingredient_indicator(split_ingredients_list, ind_df)
    
    df_target_ing = df_token[target_ing_list]
    
    return df_target_ing

In [41]:
def test_indicator(test_dict, test_df):
    
    indicator_dict = {}
    for key in test_dict.keys():
        print(key)
        indicator_dict[key] = [item in test_dict[key] for item in test_df["ingredients"]]

In [111]:
child_ing_indicator_df = ingredient_indicator(child_ing, indicator_df(child_regex), columns = False)

In [112]:
[item in list(dairy[0]) for item in child_ing_indicator_df["ingredients"]]

[False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,


In [109]:
list(dairy[0])

['Acidophilus Milk',
 'Ammonium Caseinate',
 'Butter',
 'Butter Esters',
 'Butter Fat',
 'Butter Oil',
 'Butter Solids',
 'Buttermilk',
 'Buttermilk Powder',
 'Calcium Caseinate',
 'Casein',
 'Caseinate',
 'Cheese',
 'Condensed Milk',
 'Cottage Cheese',
 'Cream',
 'Cream Cheese',
 'Curds',
 'Custard',
 'Delactosed Whey',
 'Demineralized Whey',
 'Dry Milk Powder',
 'Dry Milk Solids',
 'Evaporated Milk',
 'Ghee',
 'Goat Cheese',
 'Goat Milk',
 'Half & Half',
 'Heavy Cream',
 'Hydrolyzed Casein',
 'Hydrolyzed Milk Protein',
 'Ice Cream',
 'Iron Caseinate',
 'Lactalbumin',
 'Lactoferrin',
 'Lactoglobulin',
 'Lactose',
 'Lactulose',
 'Low-Fat Milk',
 'Magnesium Caseinate',
 'Malted Milk',
 'Milk',
 'Milk Chocolate',
 'Milk Derivative',
 'Milk Fat',
 'Milk Powder',
 'Milk Protein',
 'Milk Protein Concentrate',
 'Milk Solids',
 'Milkfat',
 'Natural Butter Flavor',
 'Nonfat Dry Milk',
 'Nonfat Milk',
 'Nonfat Milk Solids',
 'Nougat',
 'Paneer',
 'Potassium Caseinate',
 'Pudding',
 'Recaldent',

In [7]:
parents_regex = r'[.,]\s*(?![^()]*\))'
child_regex = r'[.,:()]'

In [8]:
parents_ing = split_ingredients(parents_regex)
child_ing = split_ingredients(child_regex)

In [58]:
indicator_df = indicator_df(parents_regex)

In [73]:
tokenize_df = ingredient_indicator(parents_ing, indicator_df, columns = False)
tokenize_df

Unnamed: 0,ingredients,0_bj,1_bj,2_bj,3_bj,4_bj,5_bj,6_bj,7_bj,8_bj,...,59_breyers,60_breyers,61_breyers,62_breyers,63_breyers,64_breyers,65_breyers,66_breyers,67_breyers,68_breyers
0,CREAM,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
1,SKIM MILK,True,True,True,True,True,True,True,True,True,...,True,False,False,False,False,False,False,False,False,False
2,"LIQUID SUGAR (SUGAR, WATER)",True,True,True,True,True,True,True,True,True,...,False,False,False,False,False,False,False,False,False,False
3,WATER,True,True,True,True,True,True,True,True,True,...,True,False,True,False,True,True,True,True,False,True
4,BROWN SUGAR,True,False,False,False,False,True,False,False,False,...,False,False,False,True,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
446,CREAM (MILK),False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
447,THIAMIN MONONITRATE,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
448,SKIM MILK POWDER,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
449,SOY LECITHIN (EMULSIFIER),False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True


Below are WIP

In [81]:
#Append dairy free keywords to df

is_dairy = []
for ing_parent in ingredients:
    if ing_parent.title() in dairy[0].to_list():
        is_dairy.append(ing_parent)
    ing_par_list = re.split(r'[.,:()]',ing_parent)
    for ing_child in ing_par_list:
        if ing_child.strip().title() in dairy[0].to_list():
            is_dairy.append(ing_parent)
dairy_list = list(set(is_dairy))

In [82]:
dairy_list

['WEET CREAM ICE CREAM: CREAM',
 'BUTTER (MILK)',
 'CREAM (MILK)',
 'MILK CHOCOLATE',
 'CARAMEL ICE CREAM: CREAM',
 'MILK FAT',
 'BUTTER',
 'CREAM CHEESE (CREAM, MILK, CHEESE CULTURE, SALT, GUAR GUM, CAROB BEAN GUM, XANTHAN GUM)',
 'CREAM CHEESE',
 'SWEETENED CONDENSED MILK (CONDENSED MILK SUGAR)',
 'CONDENSED MILK',
 'HEAVY CREAM',
 'CONTAINS: MILK',
 'WHITE CHOCOLATE ICE CREAM: CREAM',
 'MILK',
 'CREAM CHEESE (MILK, CREAM, CHEESE CULTURE, SALT, CAROB GUM, GUAR GUM, XANTHAN GUM)',
 'ULCE DE LECHE ICE CREAM: CREAM',
 'BUTTER (CREAM)',
 'SWEETENED CONDENSED MILK',
 'ORGANIC BUTTER (CREAM, SALT)',
 'CHOCOLATE ICE CREAM: CREAM',
 'WHEY PROTEIN CONCENTRATE',
 'LACTOSE',
 'CARAMEL SWIRL: SWEETENED CONDENSED MILK (MILK, SUGAR)',
 'BUTTER (CREAM, SALT)',
 'MILK CHOCOLATE AND VEGETABLE OIL COATING: MILK CHOCOLATE (SUGAR, WHOLE MILK POWDER, CHOCOLATE, COCOA BUTTER, SOY LECITHIN, VANILLA EXTRACT)',
 'SWEETENED CONDENSED SKIM MILK (SKIM MILK, SUGAR)',
 'CREAM CHEESE (PASTEURIZED MILK, CREAM, CHEE

In [66]:
#pd.DataFrame(data = ingredients).sort_values(by = 0).to_csv("testdf.csv")

In [84]:
nondairy = [x for x in ingredients if x not in dairy_list]
pd.DataFrame(data = nondairy).to_csv("nondairy.csv")

In [91]:
df.sort_values(by = "rating").tail()

Unnamed: 0,brand,key,name,subhead,description,rating,rating_count,ingredients,ingredients_count
101,hd,44_hd,Peppermint Bark Ice Cream Bar,,Our peppermint bark ice cream bars start with ...,5.0,8,"WHITE CHOCOLATE ICE CREAM: CREAM, SKIM MILK, S...",11
7,bj,7_bj,Chocolate Peanut Butter Split,Chocolate & Banana Ice Creams with Mini Peanut...,We’ve loaded our banana and chocolate ice crea...,5.0,7,"CREAM, SKIM MILK, LIQUID SUGAR (SUGAR, WATER),...",20
29,bj,29_bj,Ice Cream Sammie,Vanilla Ice Cream with Chocolate Sandwich Cook...,To capture the great taste of the classic ice ...,5.0,31,"CREAM, SKIM MILK, LIQUID SUGAR (SUGAR, WATER),...",22
71,hd,14_hd,Chocolate Fudge Non-Dairy Bar,,Enjoy an indulgent non-dairy and vegan bar mad...,5.0,22,"WATER, SUGAR, CORN SYRUP, CHOCOLATE, COCONUT O...",18
120,hd,63_hd,Vanilla Caramel White Chocolate TRIO CRISPY LA...,,This incredible combination of tastes and text...,5.0,32,"CREAM, SKIM MILK, SUGAR, SWEETENED CONDENSED M...",38


In [86]:
nondairy

['ARTIFICIAL FLAVORS',
 'RAISINS',
 'PINEAPPLE',
 'CANE SUGAR',
 'POTATO FLOUR',
 'WHEAT BRAN',
 'ALMONDS ROASTED IN VEGETABLE OIL',
 'VEGETABLE JUICE (FOR COLOR), PEPPERMINT OIL, SOY LECITHIN)',
 'RASPBERRY PUREE',
 'PECANS',
 'SODIUM ACID PYROPHOSPHATE',
 'UNBLEACHED UNENRICHED WHEAT FLOUR',
 'ORANGE JUICE CONCENTRATE',
 'CHOCOLATY COATING: SUGAR',
 'GRAHAM CRUMB',
 'PUMPKIN PUREE',
 'ORGANIC CARAMELIZED SUGAR (ORGANIC CANE SUGAR, WATER)',
 'SORBITOL',
 'BLUEBERRY PUREE CONCENTRATE',
 'SEMI-SWEET CHOCOLATE CHUNKS',
 'OREO COOKIE PIECES',
 'CHERRY PUREE',
 'CITRIC ACID (TO MAINTAIN FRESHNESS)',
 'LEMON JUICE CONCENTRATE',
 'PGPR (EMULSIFIER)',
 'MAY CONTAIN OTHER TREE NUTS',
 'TBHQ AND CITRIC ACID (TO MAINTAIN FRESHNESS)',
 'COCOA PROCESSED WITH ALKALI',
 'SOYBEAN LECITHIN',
 'THIAMIN MONONITRATE',
 'COCONUT AND SOYBEAN OIL',
 'CHOCOLATE (PROCESSED WITH ALKALI)',
 'VEGETABLE GUMS',
 'YELLOW 6',
 'CHOCOLATE COOKIE PIECES',
 'WHOLE EGG',
 'VEGETABLE GUM (GUAR)',
 'LOCUST BEAN GUM',
 "RE

In [85]:
dairy

Unnamed: 0,0
0,Acidophilus Milk
1,Ammonium Caseinate
2,Butter
3,Butter Esters
4,Butter Fat
...,...
73,Whipped Topping
74,Whole Milk
75,Whole Milk Powder
76,Yogurt


In [21]:
##Food order

"""
Q. How are ingredients listed on a product label?
A. Food manufacturers are required to list all ingredients in the food on the label. 
On a product label, the ingredients are listed in order of predominance, 
with the ingredients used in the greatest amount first, followed in descending order by those in smaller amounts. 
The label must list the names of any FDA-certified color additives (e.g., FD&C Blue No. 1 or the abbreviated name, Blue 1). 
But some ingredients can be listed collectively as "flavors," "spices," "artificial flavoring,"
or in the case of color additives exempt from certification, "artificial colors", without naming each one. 
Declaration of an allergenic ingredient in a collective or single color, flavor, or spice 
could be accomplished by simply naming the allergenic ingredient in the ingredient list.

https://www.fda.gov/food/food-ingredients-packaging/overview-food-ingredients-additives-colors


"""



In [12]:
#Try to separate out sub keywords?

sub_ing_dict = {}
sub_ing_list = []
for i in range(len(ingredients)):
    sub_ing_i = ingredients[i]
    sub_ing_split = sub_ing_i.split(",")
    for ele in sub_ing_split:
        if ele in sub_ing_dict.keys():
            sub_ing_dict[ele] = sub_ing_dict[ele] + 1
        else:
            sub_ing_dict[ele] = 1
    sub_ing_list.append(sub_ing_split)
sub_ing_flat = [ing for sublist in sub_ing_list for ing in sublist]
sub_ingredients = list(set(sub_ing_flat))

In [13]:
#append ingredients_count column to dataframe

ingCount = []
top20_dict = {}
for i in range(len(df)):
    ing_split = re.split(r'[.,]\s*(?![^()]*\))', df["ingredients"][i])
    ingCount.append(len(ing_split))
    top20_dict[df["key"][i]] = ing_split[0:20]
df["ingredients_count"] = ingCount
df.sort_values(by = "ingredients_count", ascending = False)

Unnamed: 0,brand,key,name,subhead,description,rating,rating_count,ingredients,ingredients_count
204,breyers,32_breyers,SNICKERS® & M&M'S® 2in1,,SNICKERS® or M&M'S®? When it comes to your fav...,3.9,52,"SKIM MILK, SUGAR, SNICKERS PIECES*, MILK CHOCO...",77
217,breyers,45_breyers,New York Style Cheesecake,,"Can’t get enough of rich, creamy cheesecake? Y...",3.3,36,"NONFAT MILK, STRAWBERRY SWIRL, WATER, SUGAR, C...",69
240,breyers,68_breyers,Layered Dessert Brownie Cheesecake,,Love brownie cheesecake? What about Breyers®? ...,2.8,25,"MILK, CORN SYRUP, SUGAR, ENRICHED WHEAT FLOUR,...",67
203,breyers,31_breyers,SNICKERS®,,Breyers® joins forces with America’s favorite ...,4.4,109,"SKIM MILK, SUGAR, CARAMEL SWIRL, CORN SYRUP, W...",62
201,breyers,29_breyers,REESE'S & REESE'S PIECES 2in1,,REESE'S PIECES or REESE'S Peanut Butter Cups? ...,3.3,88,"MILK, CORN SYRUP, REESE'S PEANUT BUTTER CUP PI...",59
...,...,...,...,...,...,...,...,...,...
66,hd,9_hd,Chocolate Ice Cream,,"Rich, creamy, and totally indulgent. Made from...",4.9,90,"CREAM, SKIM MILK, CANE SUGAR, COCOA PROCESSED ...",5
116,hd,59_hd,Vanilla Ice Cream,,Vanilla is the essence of elegance and sophist...,3.0,228,"CREAM, SKIM MILK, CANE SUGAR, EGG YOLKS, VANIL...",5
172,breyers,0_breyers,Natural Vanilla,,Our Original Vanilla Ice Cream. The way vanill...,4.1,467,"MILK, CREAM, SUGAR, VEGETABLE GUM (TARA), NATU...",5
78,hd,21_hd,Coffee Ice Cream,,We roast the finest Brazilian coffee beans and...,4.6,173,"CREAM, SKIM MILK, CANE SUGAR, EGG YOLKS, COFFEE",5


In [16]:
pd.DataFrame.from_dict(top20_dict, orient = "index")

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0_bj,CREAM,SKIM MILK,"LIQUID SUGAR (SUGAR, WATER)",WATER,BROWN SUGAR,SUGAR,MILK,WHEAT FLOUR,EGG YOLKS,CORN SYRUP,EGGS,"BUTTER (CREAM, SALT)",BUTTEROIL,PECTIN,SEA SALT,SOYBEAN OIL,VANILLA EXTRACT,GUAR GUM,SOY LECITHIN,"BAKING POWDER (SODIUM ACID PYROPHOSPHATE, SODI..."
1_bj,CREAM,SKIM MILK,"LIQUID SUGAR (SUGAR, WATER)",WATER,SUGAR,PEANUTS,WHEAT FLOUR,CANOLA OIL,EGG YOLKS,CORN STARCH,PEANUT OIL,COCOA POWDER,SALT,SOYBEAN OIL,INVERT CANE SUGAR,MILK FAT,EGGS,EGG WHITES,GUAR GUM,SOY LECITHIN
2_bj,CREAM,"LIQUID SUGAR (SUGAR, WATER)",SKIM MILK,WATER,SUGAR,COCOA (PROCESSED WITH ALKALI),POTATO,COCONUT OIL,CORN SYRUP SOLIDS,SOYBEAN OIL,EGG YOLKS,RICE STARCH,SUNFLOWER OIL,BARLEY MALT,COCOA POWDER,WHEAT FLOUR,MILK,SALT,SOY LECITHIN,YEAST EXTRACT
3_bj,CREAM,SKIM MILK,"LIQUID SUGAR (SUGAR, WATER)",WATER,CORN SYRUP,COCONUT OIL,SUGAR,DRIED CANE SYRUP,EGG YOLKS,WHEAT FLOUR,MILK,COCOA,NATURAL FLAVOR,GUAR GUM,SOY LECITHIN,BUTTER OIL,NATURAL FLAVORS,LOCUST BEAN GUM,SALT,CITRIC ACID
4_bj,CREAM,SKIM MILK,WATER,"LIQUID SUGAR (SUGAR, WATER)",SUGAR,CANOLA OIL,SOYBEAN OIL,EGG YOLKS,CORN SYRUP,WHEAT FLOUR,COCONUT OIL,CORN STARCH,COCOA (PROCESSED WITH ALKALI),CORN SYRUP SOLIDS,COCOA,GRAHAM FLOUR,SALT,EGG WHITES,BUTTEROIL,TAPIOCA STARCH
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64_breyers,MILK,CORN SYRUP,SUGAR,BROWN SUGAR,SOYBEAN OIL,WATER,BUTTER,CREAM,SALT,CORN SYRUP,SPICE,SALT,SOYBEAN LECITHIN,VANILLA EXTRACT,DRIED CANE SYRUP,UNBLEACHED UNENRICHED WHEAT FLOUR,COCONUT OIL,WATER,BUTTER,CREAM
65_breyers,MILK,WATER,CARAMEL SWIRL,SUGAR,WATER,CORN SYRUP,HIGH FRUCTOSE CORN SYRUP,NONFAT MILK SOLIDS,BUTTER,CREAM,SALT,SALT,MOLASSES,PECTIN,SOY LECITHIN,NATURAL FLAVOR,POTASSIUM SORBATE (PRESERVATIVE),SODIUM CITRATE,LACTIC ACID,MALTITOL SYRUP
66_breyers,MILK,CORN SYRUP,SUGAR,WHEAT FLOUR,BUTTER,CREAM (MILK),SALT,PALM OIL,CORN SYRUP,NONFAT MILK,WATER,RICE FLOUR,NATURAL FLAVORS,SALT,WHEAT FLOUR,SUGAR,PALM OIL,MOLASSES,SPICES,SALT
67_breyers,MILK,CORN SYRUP,ENRICHED WHEAT FLOUR,WHEAT FLOUR,NIACIN,REDUCED IRON,THIAMIN MONONITRATE,RIBOFLAVIN,FOLIC ACID,SUGAR,BUTTER,CREAM,SALT,CANOLA OIL,SKIM MILK POWDER,SALT,SODIUM BICARBONATE,PEACHES,FRUCTOSE,COCONUT OIL
