## Predicting Ice Cream Rating

Given *data about various ice creams and their ingredients*, let's try to predict the **average user rating** of a given ice cream.

We will use a linear regression model to make our predictions.

Data source: https://www.kaggle.com/datasets/tysonpo/ice-cream-dataset

### Importing Libraries

In [33]:
import numpy as np
import pandas as pd

import re
from nltk.stem import PorterStemmer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression, Ridge, Lasso

In [34]:
data = pd.read_csv('products.csv')
data.head()

Unnamed: 0,brand,key,name,subhead,description,rating,rating_count,ingredients
0,bj,0_bj,Salted Caramel Core,Sweet Cream Ice Cream with Blonde Brownies & a...,Find your way to the ultimate ice cream experi...,3.7,208,"CREAM, SKIM MILK, LIQUID SUGAR (SUGAR, WATER),..."
1,bj,1_bj,Netflix & Chilll'd™,Peanut Butter Ice Cream with Sweet & Salty Pre...,There’s something for everyone to watch on Net...,4.0,127,"CREAM, SKIM MILK, LIQUID SUGAR (SUGAR, WATER),..."
2,bj,2_bj,Chip Happens,A Cold Mess of Chocolate Ice Cream with Fudge ...,Sometimes “chip” happens and everything’s a me...,4.7,130,"CREAM, LIQUID SUGAR (SUGAR, WATER), SKIM MILK,..."
3,bj,3_bj,Cannoli,Mascarpone Ice Cream with Fudge-Covered Pastry...,As a Limited Batch that captured the rapture o...,3.6,70,"CREAM, SKIM MILK, LIQUID SUGAR (SUGAR, WATER),..."
4,bj,4_bj,Gimme S’more!™,Toasted Marshmallow Ice Cream with Chocolate C...,It’s a gimme: there’s always room for s’more. ...,4.5,281,"CREAM, SKIM MILK, WATER, LIQUID SUGAR (SUGAR, ..."


### Preprocessing

In [35]:
data = data.drop(['key', 'name', 'subhead', 'description'], axis=1)
data

Unnamed: 0,brand,rating,rating_count,ingredients
0,bj,3.7,208,"CREAM, SKIM MILK, LIQUID SUGAR (SUGAR, WATER),..."
1,bj,4.0,127,"CREAM, SKIM MILK, LIQUID SUGAR (SUGAR, WATER),..."
2,bj,4.7,130,"CREAM, LIQUID SUGAR (SUGAR, WATER), SKIM MILK,..."
3,bj,3.6,70,"CREAM, SKIM MILK, LIQUID SUGAR (SUGAR, WATER),..."
4,bj,4.5,281,"CREAM, SKIM MILK, WATER, LIQUID SUGAR (SUGAR, ..."
...,...,...,...,...
236,breyers,4.0,28,"MILK, CORN SYRUP, SUGAR, BROWN SUGAR, SOYBEAN ..."
237,breyers,4.7,18,"MILK, WATER, CARAMEL SWIRL, SUGAR, WATER, CORN..."
238,breyers,2.5,31,"MILK, CORN SYRUP, SUGAR, WHEAT FLOUR, BUTTER, ..."
239,breyers,3.2,38,"MILK, CORN SYRUP, ENRICHED WHEAT FLOUR, WHEAT ..."


In [36]:
data = data.drop(data.query('rating_count < 10').index, axis=0).reset_index(drop=True)
data

Unnamed: 0,brand,rating,rating_count,ingredients
0,bj,3.7,208,"CREAM, SKIM MILK, LIQUID SUGAR (SUGAR, WATER),..."
1,bj,4.0,127,"CREAM, SKIM MILK, LIQUID SUGAR (SUGAR, WATER),..."
2,bj,4.7,130,"CREAM, LIQUID SUGAR (SUGAR, WATER), SKIM MILK,..."
3,bj,3.6,70,"CREAM, SKIM MILK, LIQUID SUGAR (SUGAR, WATER),..."
4,bj,4.5,281,"CREAM, SKIM MILK, WATER, LIQUID SUGAR (SUGAR, ..."
...,...,...,...,...
231,breyers,4.0,28,"MILK, CORN SYRUP, SUGAR, BROWN SUGAR, SOYBEAN ..."
232,breyers,4.7,18,"MILK, WATER, CARAMEL SWIRL, SUGAR, WATER, CORN..."
233,breyers,2.5,31,"MILK, CORN SYRUP, SUGAR, WHEAT FLOUR, BUTTER, ..."
234,breyers,3.2,38,"MILK, CORN SYRUP, ENRICHED WHEAT FLOUR, WHEAT ..."


In [37]:
data = data.drop('rating_count', axis=1)
data

Unnamed: 0,brand,rating,ingredients
0,bj,3.7,"CREAM, SKIM MILK, LIQUID SUGAR (SUGAR, WATER),..."
1,bj,4.0,"CREAM, SKIM MILK, LIQUID SUGAR (SUGAR, WATER),..."
2,bj,4.7,"CREAM, LIQUID SUGAR (SUGAR, WATER), SKIM MILK,..."
3,bj,3.6,"CREAM, SKIM MILK, LIQUID SUGAR (SUGAR, WATER),..."
4,bj,4.5,"CREAM, SKIM MILK, WATER, LIQUID SUGAR (SUGAR, ..."
...,...,...,...
231,breyers,4.0,"MILK, CORN SYRUP, SUGAR, BROWN SUGAR, SOYBEAN ..."
232,breyers,4.7,"MILK, WATER, CARAMEL SWIRL, SUGAR, WATER, CORN..."
233,breyers,2.5,"MILK, CORN SYRUP, SUGAR, WHEAT FLOUR, BUTTER, ..."
234,breyers,3.2,"MILK, CORN SYRUP, ENRICHED WHEAT FLOUR, WHEAT ..."


In [38]:
# Add all unique ingredients to all_ingredients
all_ingredients = set()

for row in data.iterrows():
    ingredients = data.loc[row[0], 'ingredients']
    for ingredient in ingredients.split(','):
        if ingredient not in all_ingredients:
            all_ingredients.add(ingredient)

In [39]:
all_ingredients

{'  WATER',
 ' ACESULFAME POTASSIUM',
 ' ALMOND EXTRACT',
 ' ALMONDS',
 ' ALMONDS ROASTED IN VEGETABLE OIL',
 ' AND/OR BAKING SODA',
 ' AND/OR CALCIUM PHOSPHATE',
 ' AND/OR CANOLA OIL',
 ' AND/OR PALM OIL',
 ' AND/OR SUNFLOWER OIL)',
 ' ANHYDROUS MILKFAT',
 ' ANNATTO (COLOR)',
 ' ANNATTO (FOR COLOR)',
 ' APPLE JUICE',
 ' ARTIFICIAL COLOR',
 ' ARTIFICIAL FLAVOR',
 ' ARTIFICIAL FLAVORING',
 ' ARTIFICIAL FLAVORS',
 ' ASCORBIC ACID',
 ' BAKING POWDER',
 ' BAKING POWDER (SODIUM ACID PYROPHOSPHATE',
 ' BAKING SODA',
 ' BAKING SODA AND/OR CALCIUM PHOSPHATE',
 ' BAKING SODA. CONTAINS MILK',
 ' BALSAMIC VINEGAR (RED WINE VINEGAR',
 ' BANANA PUREE',
 ' BANANAS',
 ' BARLEY MALT',
 ' BEET JUICE (FOR COLOR)',
 ' BELGIAN CHOCOLATE',
 ' BLACK CARROT CONCENTRATE (FOR COLOR)',
 ' BLACK CHERRIES',
 ' BLACK RASPBERRIES',
 ' BLACK RASPBERRY PUREE',
 ' BLACKBERRY JUICE CONCENTRATE',
 ' BLEACHED WHEAT FLOUR',
 ' BLUE 1',
 ' BLUE 1 LAKE',
 ' BLUE 2',
 ' BLUE 2 LAKE',
 ' BLUEBERRIES',
 ' BLUEBERRY PUREE CONCE

In [40]:
def process_ingredients(ingredients):
    ps = PorterStemmer()
    new_ingredients = re.sub(r'\(.*?\)', '', ingredients)
    new_ingredients = re.sub(r'CONTAINS:.*$', '', new_ingredients)
    new_ingredients = re.sub(r'\..*?', ',', new_ingredients)
    new_ingredients = re.sub(r'( AND/OR )', ',', new_ingredients)
    new_ingredients = re.sub(r'( AND )', ',', new_ingredients)
    new_ingredients = new_ingredients.split(',')
    for i in range(len(new_ingredients)):
        new_ingredients[i] = new_ingredients[i].replace('+', '').replace('*', ' ').replace(')', '').replace('/', ' ')
        new_ingredients[i] = re.sub(r'^.+:', '', new_ingredients[i])
        new_ingredients[i] = ps.stem(new_ingredients[i].strip())
        if new_ingredients[i] == 'milk fat':
            new_ingredients[i] = 'milkfat'
    return new_ingredients

In [41]:
data.loc[0, 'ingredients']

'CREAM, SKIM MILK, LIQUID SUGAR (SUGAR, WATER), WATER, BROWN SUGAR, SUGAR, MILK, WHEAT FLOUR, EGG YOLKS, CORN SYRUP, EGGS, BUTTER (CREAM, SALT), BUTTEROIL, PECTIN, SEA SALT, SOYBEAN OIL, VANILLA EXTRACT, GUAR GUM, SOY LECITHIN, BAKING POWDER (SODIUM ACID PYROPHOSPHATE, SODIUM BICARBONATE, CORN STARCH, MONOCALCIUM PHOSPHATE), BAKING SODA, SALT, CARRAGEENAN, LACTASE'

In [42]:
process_ingredients(data.loc[0, 'ingredients'])

['cream',
 'skim milk',
 'liquid sugar',
 'water',
 'brown sugar',
 'sugar',
 'milk',
 'wheat flour',
 'egg yolk',
 'corn syrup',
 'egg',
 'butter',
 'butteroil',
 'pectin',
 'sea salt',
 'soybean oil',
 'vanilla extract',
 'guar gum',
 'soy lecithin',
 'baking powd',
 'baking soda',
 'salt',
 'carrageenan',
 'lactas']

In [43]:
# Add all unique ingredients to all_ingredients
all_ingredients = set()

for row in data.iterrows():
    ingredients = process_ingredients(data.loc[row[0], 'ingredients'])
    for ingredient in ingredients:
        if ingredient not in all_ingredients:
            all_ingredients.add(ingredient)
all_ingredients.remove('')

In [44]:
all_ingredients

{'acesulfame potassium',
 'almond',
 'almond extract',
 'almond milk',
 'almonds roasted in vegetable oil',
 'anhydrous milkfat',
 'annatto',
 'apple juic',
 'artificial color',
 'artificial flavor',
 'ascorbic acid',
 'baking powd',
 'baking soda',
 'balsamic vinegar',
 'banana',
 'banana pure',
 'barley malt',
 'beet juic',
 'belgian chocol',
 'black carrot concentr',
 'black cherri',
 'black raspberri',
 'black raspberry pure',
 'blackberry juice concentr',
 'bleached wheat flour',
 'blue 1',
 'blue 1 lak',
 'blue 2',
 'blue 2 lak',
 'blueberri',
 'blueberry puree concentr',
 'bourbon',
 'brown sugar',
 'brown sugar†',
 'butter',
 'butter oil',
 'butteroil',
 'calcium carbon',
 'calcium phosph',
 'cane sugar',
 'canola oil',
 'caramel',
 'caramel color',
 'caramel flavor',
 'caramel swirl',
 'caramelized sugar',
 'caramelized sugar syrup',
 'carnauba wax',
 'carob bean',
 'carob bean gum',
 'carob gum',
 'carrageenan',
 'carrot juice concentr',
 'carrot powd',
 'cheese cultur',
 'ch

In [59]:
y = data.loc[:, 'rating']
X = data.drop('rating', axis=1)

In [60]:
X

Unnamed: 0,brand,ingredients
0,bj,"CREAM, SKIM MILK, LIQUID SUGAR (SUGAR, WATER),..."
1,bj,"CREAM, SKIM MILK, LIQUID SUGAR (SUGAR, WATER),..."
2,bj,"CREAM, LIQUID SUGAR (SUGAR, WATER), SKIM MILK,..."
3,bj,"CREAM, SKIM MILK, LIQUID SUGAR (SUGAR, WATER),..."
4,bj,"CREAM, SKIM MILK, WATER, LIQUID SUGAR (SUGAR, ..."
...,...,...
231,breyers,"MILK, CORN SYRUP, SUGAR, BROWN SUGAR, SOYBEAN ..."
232,breyers,"MILK, WATER, CARAMEL SWIRL, SUGAR, WATER, CORN..."
233,breyers,"MILK, CORN SYRUP, SUGAR, WHEAT FLOUR, BUTTER, ..."
234,breyers,"MILK, CORN SYRUP, ENRICHED WHEAT FLOUR, WHEAT ..."


In [61]:
def onehot_encode(df, column, prefix):
    df = df.copy()
    dummies = pd.get_dummies(df[column], prefix=prefix)
    df = pd.concat([df, dummies], axis=1)
    df = df.drop(column, axis=1)
    return df

In [62]:
X = onehot_encode(X, 'brand', 'b')
X

Unnamed: 0,ingredients,b_bj,b_breyers,b_hd,b_talenti
0,"CREAM, SKIM MILK, LIQUID SUGAR (SUGAR, WATER),...",True,False,False,False
1,"CREAM, SKIM MILK, LIQUID SUGAR (SUGAR, WATER),...",True,False,False,False
2,"CREAM, LIQUID SUGAR (SUGAR, WATER), SKIM MILK,...",True,False,False,False
3,"CREAM, SKIM MILK, LIQUID SUGAR (SUGAR, WATER),...",True,False,False,False
4,"CREAM, SKIM MILK, WATER, LIQUID SUGAR (SUGAR, ...",True,False,False,False
...,...,...,...,...,...
231,"MILK, CORN SYRUP, SUGAR, BROWN SUGAR, SOYBEAN ...",False,True,False,False
232,"MILK, WATER, CARAMEL SWIRL, SUGAR, WATER, CORN...",False,True,False,False
233,"MILK, CORN SYRUP, SUGAR, WHEAT FLOUR, BUTTER, ...",False,True,False,False
234,"MILK, CORN SYRUP, ENRICHED WHEAT FLOUR, WHEAT ...",False,True,False,False


In [63]:
X['ingredients'] = X['ingredients'].apply(process_ingredients)

In [64]:
X

Unnamed: 0,ingredients,b_bj,b_breyers,b_hd,b_talenti
0,"[cream, skim milk, liquid sugar, water, brown ...",True,False,False,False
1,"[cream, skim milk, liquid sugar, water, sugar,...",True,False,False,False
2,"[cream, liquid sugar, skim milk, water, sugar,...",True,False,False,False
3,"[cream, skim milk, liquid sugar, water, corn s...",True,False,False,False
4,"[cream, skim milk, water, liquid sugar, sugar,...",True,False,False,False
...,...,...,...,...,...
231,"[milk, corn syrup, sugar, brown sugar, soybean...",False,True,False,False
232,"[milk, water, caramel swirl, sugar, water, cor...",False,True,False,False
233,"[milk, corn syrup, sugar, wheat flour, butter,...",False,True,False,False
234,"[milk, corn syrup, enriched wheat flour, wheat...",False,True,False,False


In [65]:
ingredient_columns = []

for ingredient_list in X['ingredients']:
    for ingredient in ingredient_list:
        if ingredient not in ingredient_columns:
            ingredient_columns.append(ingredient)

ingredient_columns

['cream',
 'skim milk',
 'liquid sugar',
 'water',
 'brown sugar',
 'sugar',
 'milk',
 'wheat flour',
 'egg yolk',
 'corn syrup',
 'egg',
 'butter',
 'butteroil',
 'pectin',
 'sea salt',
 'soybean oil',
 'vanilla extract',
 'guar gum',
 'soy lecithin',
 'baking powd',
 'baking soda',
 'salt',
 'carrageenan',
 'lactas',
 'peanut',
 'canola oil',
 'corn starch',
 'peanut oil',
 'cocoa powd',
 'invert cane sugar',
 'milkfat',
 'egg whit',
 'tapioca starch',
 'barley malt',
 'malted barley flour',
 'cocoa',
 'potato',
 'coconut oil',
 'corn syrup solid',
 'rice starch',
 'sunflower oil',
 'yeast extract',
 'natural flavor',
 'enzym',
 'contains milk',
 'wheat',
 'soy',
 'dried cane syrup',
 'butter oil',
 'locust bean gum',
 'citric acid',
 'vanilla bean se',
 'lactic acid',
 'graham flour',
 'molass',
 'honey',
 'caramelized sugar syrup',
 'chocolate liquor',
 'tapioca flour',
 'peanut flour',
 'peanut extract',
 'cocoa butt',
 'roasted almond',
 'blackberry juice concentr',
 'invert suga

In [66]:
ingredients_df = X['ingredients']

In [67]:
mlb = MultiLabelBinarizer()
ingredients_df = pd.DataFrame(mlb.fit_transform(ingredients_df))

In [68]:
ingredients_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,347,348,349,350,351,352,353,354,355,356
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
231,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
232,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
233,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
234,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [69]:
X = pd.concat([X, ingredients_df], axis=1)
X = X.drop('ingredients', axis=1)
X.columns = X.columns.astype(str)
X

Unnamed: 0,b_bj,b_breyers,b_hd,b_talenti,0,1,2,3,4,5,...,347,348,349,350,351,352,353,354,355,356
0,True,False,False,False,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,True,False,False,False,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,True,False,False,False,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,True,False,False,False,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,True,False,False,False,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
231,False,True,False,False,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
232,False,True,False,False,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
233,False,True,False,False,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
234,False,True,False,False,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [70]:
y

0      3.7
1      4.0
2      4.7
3      3.6
4      4.5
      ... 
231    4.0
232    4.7
233    2.5
234    3.2
235    2.8
Name: rating, Length: 236, dtype: float64

### Training

In [71]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=100)
X_train

Unnamed: 0,b_bj,b_breyers,b_hd,b_talenti,0,1,2,3,4,5,...,347,348,349,350,351,352,353,354,355,356
149,False,False,False,True,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
201,False,True,False,False,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
191,False,True,False,False,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,True,False,False,False,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
28,True,False,False,False,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
87,False,False,True,False,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
103,False,False,True,False,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
67,False,False,True,False,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
24,True,False,False,False,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Without Regularization

In [72]:
model = LinearRegression()

model.fit(X_train, y_train)

In [73]:
model.score(X_test, y_test) 

-17.452585340920987

### With Regularization

In [83]:
l2_model = Ridge(alpha=1000.0)

l2_model.fit(X_train, y_train)

In [84]:
l2_model.score(X_test, y_test)

-0.06339632798077366

In [93]:
l1_model = Lasso(alpha=0.1)

l1_model.fit(X_train, y_train)

In [94]:
l1_model.score(X_test, y_test)

-0.06882219995590577