## Introduction

In this project, we are taking a look at what makes an ice cream flavor more enjoyable for people to eat. We found this ice cream rating dataset on Kaggle which rates ice cream flavor of four ice cream producers and wish to understand the underlying reasons which makes an ice cream flavor rating higher than other ones.  Throughout this analysis, we will take a look at different hypotheses related to the type of ingredients contained within one ice cream, toppings used as well as what would be the perfect mix of toppings and ingredients which, could potentially produce the best rated ice cream flavor maximizing the ratings.

## Data Set Description

The dataset we chose is from Kaggle and contains reviews of multiple ice cream flavors across 4 major brands. Reviews comprise star ratings as well as a descriptive text which collected from the brand websites. 

Link: [https://www.kaggle.com/tysonpo/ice-cream-dataset](https://www.kaggle.com/tysonpo/ice-cream-dataset)

- **products.csv**: contains information about each flavor
    - 242 observations
    - Variables: Ice cream key, brand name, name, subhead, description, rating, rating counts, ingredients
- **reviews.csv**: contains reviews for each flavor of ice cream
    - 21,674 observations
    - Variables: Date of the review, star rating, title, helpful review indicator, review text

Additional dataset:
- **dairy.txt**: contains dairy keywords
    - Source: https://www.godairyfree.org/dairy-free-grocery-shopping-guide/dairy-ingredient-list-2

The dataset had to be cleaned in order to facilitate the statistical analysis to be done. This includes extracting the ingredient list, tokenizing it, and creating indicator variables. Additionally, we need to restrict the number of ingredients to keep track of to the following: raspberries, peanuts, almonds, coffee, chocolate, strawberries and raspberries. More will be added to this list following analysis on ingredient prevalence across flavors and brands.

In [1]:
import pandas as pd
import numpy as np
import os
import re

In [2]:
#read csv files

#Data from kaggle:
#https://www.kaggle.com/tysonpo/ice-cream-dataset

df = pd.read_csv("products.csv")
df_reviews = pd.read_csv("reviews.csv")

#dairy free keywords
#https://www.godairyfree.org/dairy-free-grocery-shopping-guide/dairy-ingredient-list-2
dairy = pd.read_csv("dairy.txt", header = None)


## Functions

In [89]:
def split_ingredients(split_regex, ing, ignore_organic = True):
    
    """
    Split ingredients string given specified regex
    Ignores blanks
    Can ignore Organics label (i.e. Milk = Organic Milk), default TRUE
    Returns list of elements
    """
    
    split_list = [x.lstrip() for x in re.split(split_regex, ing) if x != ""]
    
    if ignore_organic:
        return_list = [re.sub("ORGANIC", "", item).lstrip() for item in split_list]
    else:
        return_list = split_list
    
    return return_list

def get_ingredients_count(split_regex, df, ignore_organic = True):
    
    """
    Split ingredients series by chr string specified by regex
    Parent- separate by commas, ignore parenthesis 
    Child- separate by commas
    Return dictionary with keys as ingredients, values as frequency
    
    """
    
    ingredient_dict = {}
    ingredient_series = df["ingredients"]
    
    for i in range(len(ingredient_series)):
        ingredient_split = split_ingredients(split_regex, ingredient_series[i], ignore_organic)
        for element in ingredient_split:
            if element in ingredient_dict.keys():
                ingredient_dict[element] = ingredient_dict[element] + 1
            else:
                ingredient_dict[element] = 1
    return ingredient_dict

def get_ingredients(split_regex, df, ignore_organic = True):
    
    """
    Split ingredients series by chr string specified by regex
    Parent- separate by commas, ignore parenthesis 
    Child- separate by commas
    Return dictionary with keys as flavor keys, values as split ingredients
    """
    
    ingredient_dict = {}
    
    for i in range(len(df)):
        ingredient_split = split_ingredients(split_regex, df.iloc[i]["ingredients"], ignore_organic)
        ingredient_dict[df.iloc[i]["key"]] = ingredient_split
        
    return ingredient_dict


def get_indicator_df(split_regex, target_ing, df, ingCols = True, ignore_organic = True):
    
    """
    Split ingredients specified by regex
    Returns dataFrame with flavors and True/False values for all ingredients in target_ing
    ingCols- returns the ingredients as columns, flavors as rows (default)
    
    """
    
    ing_dict = get_ingredients(split_regex, df, ignore_organic)
    ing_list = list(get_ingredients_count(split_regex, df, ignore_organic).keys())
    target_ing = [ing.upper() for ing in target_ing if ing.upper() in ing_list]
    
    indicator_dict = {}
    target_ing_df = pd.DataFrame({"ingredients" : target_ing})
    
    for key in ing_dict.keys():
        indicator_dict[key] = [item in ing_dict[key] for item in target_ing_df["ingredients"]]
    
    indicator_df = target_ing_df.join(pd.DataFrame(data = indicator_dict))
    
    if ingCols:
        indicator_df = indicator_df.set_index("ingredients").transpose() 
        #\
        #.reset_index().rename(columns = {"index" : "key"}).rename_axis(None, axis = 1)
    
    return indicator_df



def merge_indicator_df(split_regex, target_ing, df, ignore_organic = True):
    """
    Split ingredients specified by regex
    Appends indicator column each of the target ingredients in list
    Returns appended dataFrame
    """

    df_indicator = get_indicator_df(split_regex, target_ing, df)
    df_indicator["ANY"] = df_indicator.any(axis = 1)
    df_indicator["ALL"] = df_indicator.all(axis = 1)
    df_merge = df.merge(df_indicator.reset_index().rename(columns = {"index" : "key"}))

    #flavors = list(df_indicator.loc[df_indicator["any"] == False].index)
    #subset_df = df.loc[df["key"].isin(flavors)]
    
    return df_merge


def get_flavors(split_regex, target_ing, df, criteria_any = True, ignore_organic = True):
    
    """
    Split ingredients specified by regex
    Returns dataFrame of only those that meet the critera
    Can set to either any or true (default: any)
    """
    
    df_merge = merge_indicator_df(split_regex, target_ing, df, ignore_organic)
    
    if criteria_any:
        df_filter = df_merge.loc[df_merge["ANY"] == True]
    else:
        df_filter = df_merge.loc[df_merge["ALL"] == True]
    
    return df_filter

In [83]:
#Example functions

target_ing = ["chocolate", "cream", "mango", "sodium"]
get_indicator_df(r'[.,:()]', target_ing, df)

ingredients,CHOCOLATE,CREAM,MANGO
0_bj,False,True,False
1_bj,False,True,False
2_bj,False,True,False
3_bj,False,True,False
4_bj,False,True,False
...,...,...,...
64_breyers,False,True,False
65_breyers,False,True,False
66_breyers,False,True,False
67_breyers,False,True,False


## Prepare indicator dataframes

In [64]:
parents_regex = r'[.,]\s*(?![^()]*\))' #Ignore parenthesis: i.e. Chocolate (Cocoa, Sugar, ...etc)
child_regex = r'[.,:()]' #Separates at parethesis: i.e. Chocolate, Cocoa, Sugar are considered separately

## Case 1: Dairy-Free Indicators

In [84]:
dairy_keywords = list(dairy[0].apply(lambda x: x.upper())) #From .csv file

dairy_df = get_indicator_df(child_regex, dairy_keywords, df) 

df_dairy_keys = list(get_flavors(child_regex, dairy_keywords, df)["key"])

dairy_free = df.loc[df["key"].isin(df_dairy_keys) == False]

In [85]:
dairy_free.name #Check flavor names

71                    Chocolate Fudge Non-Dairy Bar
73         Chocolate Salted Fudge Truffle Non-Dairy
75                        Coconut Caramel Non-Dairy
77     Coconut Caramel Dark Chocolate Non-Dairy Bar
91                                     Lemon Sorbet
94                                     Mango Sorbet
98          Peanut Butter Chocolate Fudge Non-Dairy
99      Peanut Butter Chocolate Fudge Non-Dairy Bar
104                                Raspberry Sorbet
127                         ALPHONSO MANGO SORBETTO
139                        COCONUT CHOCOLATE COOKIE
142                       COLD BREW COFFEE SORBETTO
144                         DARK CHOCOLATE SORBETTO
159                    PEANUT BUTTER FUDGE SORBETTO
164                        ROMAN RASPBERRY SORBETTO
168                    STRAWBERRY HIBISCUS SORBETTO
198                 Non-Dairy Vanilla Peanut Butter
222                 Non-Dairy OREO® Cookies & Cream
Name: name, dtype: object

This returns just the list of flavors that are dairy free.
We want to have the whole flavor list with dairy indicators

In [86]:
#Merge with original dataFrame, drop unnecessary columns.

dairy_indicator_df = merge_indicator_df(child_regex, dairy_keywords, df)

keep_cols = [
    "brand",
    "key",
    "name",
    "rating",
    "rating_count",
    "ANY"
]

dairy_cleaned = dairy_indicator_df[keep_cols].rename(columns = {"ANY" : "contains_dairy"})

#Export to csv

dairy_cleaned.to_csv("dairy_indicators.csv")

## Case 2: Organic Ice Cream

In [90]:
organic_df = get_indicator_df(child_regex, "ORGANIC", df, ignore_organic = False)


In [96]:
ing_dict = get_ingredients(child_regex, df, ignore_organic = False)

In [105]:
ing_list = list(get_ingredients_count(child_regex, df, ignore_organic = False).keys())
target_ing = [ing.upper() for ing in target_ing if ing.upper() in ing_list]

In [106]:
target_ing

[]

In [111]:
test_dict_count = get_ingredients_count(child_regex, df, False)
test_dict_ing = get_ingredients(child_regex, df, False)

In [None]:
test_ing_list = list(test_dict_count.keys())
target_ing_test = "ORGANIC"



In [112]:
ing_dict = get_ingredients(split_regex, df, ignore_organic)
ing_list = list(get_ingredients_count(split_regex, df, ignore_organic).keys())
target_ing = [ing.upper() for ing in target_ing if ing.upper() in ing_list]

indicator_dict = {}
target_ing_df = pd.DataFrame({"ingredients" : target_ing})

for key in ing_dict.keys():
    indicator_dict[key] = [item in ing_dict[key] for item in target_ing_df["ingredients"]]

indicator_df = target_ing_df.join(pd.DataFrame(data = indicator_dict))

if ingCols:
    indicator_df = indicator_df.set_index("ingredients").transpose() 
    #\
    #.reset_index().rename(columns = {"index" : "key"}).rename_axis(None, axis = 1)

{'0_bj': ['CREAM',
  'SKIM MILK',
  'LIQUID SUGAR ',
  'SUGAR',
  'WATER',
  'WATER',
  'BROWN SUGAR',
  'SUGAR',
  'MILK',
  'WHEAT FLOUR',
  'EGG YOLKS',
  'CORN SYRUP',
  'EGGS',
  'BUTTER ',
  'CREAM',
  'SALT',
  'BUTTEROIL',
  'PECTIN',
  'SEA SALT',
  'SOYBEAN OIL',
  'VANILLA EXTRACT',
  'GUAR GUM',
  'SOY LECITHIN',
  'BAKING POWDER ',
  'SODIUM ACID PYROPHOSPHATE',
  'SODIUM BICARBONATE',
  'CORN STARCH',
  'MONOCALCIUM PHOSPHATE',
  'BAKING SODA',
  'SALT',
  'CARRAGEENAN',
  'LACTASE'],
 '1_bj': ['CREAM',
  'SKIM MILK',
  'LIQUID SUGAR ',
  'SUGAR',
  'WATER',
  'WATER',
  'SUGAR',
  'PEANUTS',
  'WHEAT FLOUR',
  'CANOLA OIL',
  'EGG YOLKS',
  'CORN STARCH',
  'PEANUT OIL',
  'COCOA POWDER',
  'SALT',
  'SOYBEAN OIL',
  'INVERT CANE SUGAR',
  'MILK FAT',
  'EGGS',
  'EGG WHITES',
  'GUAR GUM',
  'SOY LECITHIN',
  'TAPIOCA STARCH',
  'BAKING SODA',
  'CARRAGEENAN',
  'VANILLA EXTRACT',
  'BARLEY MALT',
  'MALTED BARLEY FLOUR'],
 '2_bj': ['CREAM',
  'LIQUID SUGAR ',
  'SUGAR'

In [95]:
indicator_dict = {}
target_ing_df = pd.DataFrame({"ingredients" : target_ing})

for key in ing_dict.keys():
    indicator_dict[key] = [item in ing_dict[key] for item in target_ing_df["ingredients"]]

indicator_df = target_ing_df.join(pd.DataFrame(data = indicator_dict))

if ingCols:
    indicator_df = indicator_df.set_index("ingredients").transpose() 

ingredients
0_bj
1_bj
2_bj
3_bj
4_bj
...
64_breyers
65_breyers
66_breyers
67_breyers
