<a href="https://colab.research.google.com/github/avivis/cookie-rating-predictor/blob/main/cookie_rating_predictor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**This is a multiple linear regression model predicting the allrecipes.com rating (on a scale of 3 to 5) of a chocolate chip cookie recipe based on the ratio of various ingredients (fat (oil, butter, etc.), sugar, brown sugar, eggs, vanilla extract, all purpose flour, baking soda, salt, and chocolate chips) in the recipe.**

In [26]:
#importing libraries
import numpy as np
import pandas as pd
from numpy import math

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

import matplotlib.pyplot as plt

First, we import the dataset. 

I created this dataset by finding allrecipes.com chocolate chip cookie recipes and writing down the measurements of fat (oil/butter, etc.), sugar, brown sugar, egg, vanilla extract, flour, baking soda, salt, chocolate chips, etc. in the recipe. I made sure to only use recipes that are for regular chocolate chip cookies (not peanut butter chocolate chip, toffee chocolate chip, etc.) to avoid the ratios of the ingredients I was recording being skewed by the ingredients I wasn't and made sure to only use recipes with over 10 reviews. I couldn't find recipes with less than 3 stars and more than 10 reviews, so I kept the output scale of the model between 3 and 5. I made sure the number of recipes I found data for for each rating group (3/3.5, 4/4.5, 5) was constant as to not skew the model.

After getting all the measurements down, I needed to standardize the unit. Most of the measurements were already in cups, so I converted them all to cups by multiplying teaspoon measurements by 0.02 and egg measurements by 0.2.

Then, I ran a script to scale the data from cup measurements to proportion of the recipe, making sure all the rows added up to 1.

In [2]:
#importing the dataset
dataset = pd.read_csv('cookiesheet.csv')

Here's what some of the dataset looks like after the scaling.

In [3]:
dataset.head()

Unnamed: 0,fat,sugar,brown sugar,egg,vanilla extract,flour,baking soda,salt,choc chips,rating
0,0.255977,0.127989,0.127989,0.051195,0.00512,0.255977,0.00256,0.00256,0.17065,3.0
1,0.109572,0.146082,0.164358,0.043829,0.0,0.383503,0.004383,0.002191,0.146082,3.0
2,0.075226,0.100292,0.150453,0.060181,0.006018,0.300906,0.003009,0.003009,0.300906,3.0
3,0.140978,0.105734,0.105734,0.028196,0.00282,0.328944,0.00282,0.00282,0.281956,3.0
4,0.167364,0.167364,0.167364,0.066946,0.006695,0.251046,0.003347,0.00251,0.167364,3.0


Now, I am defining the dependent and independent variables of the model, the dependent variable being the recipe rating and the independent variables being the ratios of the various ingredients.

In [4]:
#setting the dv to rating
dependent_variable = 'rating'

In [27]:
#setting the ivs to everything else
independent_variables = dataset.columns.tolist()
independent_variables.remove(dependent_variable)
independent_variables

['fat',
 'sugar',
 'brown sugar',
 'egg',
 'vanilla extract',
 'flour',
 'baking soda',
 'salt',
 'choc chips']

Now, it's time to train the model and fit the Linear Regression. I did a 20:80 testing group:training group split.

In [17]:
X = dataset[independent_variables].values
y = dataset[dependent_variable].values

In [18]:
X_train, X_test, y_train, y_test, = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [19]:
X_train[0:10]

array([[0.07522643, 0.10029188, 0.15045286, 0.06018115, 0.00601811,
        0.30090573, 0.00300906, 0.00300906, 0.30090573],
       [0.16492669, 0.08246335, 0.12369502, 0.06597068, 0.00824633,
        0.32985338, 0.00329853, 0.00164927, 0.21989676],
       [0.1607717 , 0.12057878, 0.12057878, 0.06430868, 0.00643087,
        0.36173633, 0.00321543, 0.00160772, 0.1607717 ],
       [0.1344086 , 0.10080645, 0.10080645, 0.05376344, 0.00268817,
        0.30241935, 0.00268817, 0.03360215, 0.2688172 ],
       [0.13066293, 0.05806661, 0.13066293, 0.03484345, 0.00348434,
        0.29036091, 0.00174217, 0.00174217, 0.34843448],
       [0.15479876, 0.07739938, 0.15479876, 0.0619195 , 0.00309598,
        0.23219814, 0.00309598, 0.00309598, 0.30959752],
       [0.27901786, 0.05580357, 0.11160714, 0.04464286, 0.00446429,
        0.27901786, 0.00223214, 0.        , 0.22321429],
       [0.15432099, 0.07716049, 0.15432099, 0.0617284 , 0.00617284,
        0.38580247, 0.00308642, 0.00308642, 0.15432099],


In [28]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression()

Now that we have our model, we can input our testing group.

In [29]:
#these are the true ratings of our testing group inputs
y_test

array([4. , 4. , 4. , 3.5, 5. , 4.5, 4. , 3.5])

In [30]:
#this is my model's prediction given the testing group inputs
y_pred = regressor.predict(X_test)
y_pred

array([3.17670885, 4.09507255, 3.67174102, 4.20810311, 4.13360872,
       5.09609026, 4.8684483 , 3.73154549])

Here's a written-out mapping of the actual ratings of the recipes with the model's respective predictions (actual --> predicted):

*   **4 -> 3.17**
*   **4 -> 4.09**
*   **4 -> 3.67**
*   **3.5 -> 4.2**
*   **5 -> 4.13**
*   **4.5 -> 5.09**
*   **4 -> 4.8**
*   **3.5 -> 3.7**


This model isn't perfect by any means, but it does make some interestingly close predictions. I think a variety of things could improve this model. Here are a few:

*   way more recipe data
*   recipe data for recipes with ratings under 3 stars
*   more recipe data from recipes with minimal to no ingredients that are not independent variables (eg. cinnamon, almond extract, etc.)









