In [None]:
!pip install --upgrade numpy pandas matplotlib scikit-learn

In [77]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

In [85]:
data = pd.read_json('https://cdn.c18l.org/full_format_recipes.json')
data.head()

Unnamed: 0,directions,fat,date,categories,calories,desc,protein,rating,title,ingredients,sodium
0,"[1. Place the stock, lentils, celery, carrot, ...",7.0,2006-09-01 04:00:00+00:00,"[Sandwich, Bean, Fruit, Tomato, turkey, Vegeta...",426.0,,30.0,2.5,"Lentil, Apple, and Turkey Wrap","[4 cups low-sodium vegetable or chicken stock,...",559.0
1,[Combine first 9 ingredients in heavy medium s...,23.0,2004-08-20 04:00:00+00:00,"[Food Processor, Onion, Pork, Bake, Bastille D...",403.0,This uses the same ingredients found in boudin...,18.0,4.375,Boudin Blanc Terrine with Red Onion Confit,"[1 1/2 cups whipping cream, 2 medium onions, c...",1439.0
2,[In a large heavy saucepan cook diced fennel a...,7.0,2004-08-20 04:00:00+00:00,"[Soup/Stew, Dairy, Potato, Vegetable, Fennel, ...",165.0,,6.0,3.75,Potato and Fennel Soup Hodge,"[1 fennel bulb (sometimes called anise), stalk...",165.0
3,[Heat oil in heavy large skillet over medium-h...,,2009-03-27 04:00:00+00:00,"[Fish, Olive, Tomato, Sauté, Low Fat, Low Cal,...",,The Sicilian-style tomato sauce has tons of Me...,,5.0,Mahi-Mahi in Tomato Olive Sauce,"[2 tablespoons extra-virgin olive oil, 1 cup c...",
4,[Preheat oven to 350°F. Lightly grease 8x8x2-i...,32.0,2004-08-20 04:00:00+00:00,"[Cheese, Dairy, Pasta, Vegetable, Side, Bake, ...",547.0,,20.0,3.125,Spinach Noodle Casserole,"[1 12-ounce package frozen spinach soufflé, th...",452.0


In [86]:
print(data.dtypes)

directions                  object
fat                        float64
date           datetime64[ns, UTC]
categories                  object
calories                   float64
desc                        object
protein                    float64
rating                     float64
title                       object
ingredients                 object
sodium                     float64
dtype: object


##Data Prep and Cleaning
I chose to omit many of the categorical variables, partially because I do not want/know how to easily convert them into quantitative variables, but also because I do not think that they play as big a part in predicting the rating. I omitted the categories and desc. I also removed date becuase it is in the type of dataline (which is different than when we converted a date a few assignments ago) and because I do not think it plays a major role in predicting the rating of a recipe.

In [87]:
data = data.loc[:, ['directions', 'fat', 'calories', 'protein', 'rating', 'title', 'ingredients', 'sodium']]

In [88]:
data.head()

Unnamed: 0,directions,fat,calories,protein,rating,title,ingredients,sodium
0,"[1. Place the stock, lentils, celery, carrot, ...",7.0,426.0,30.0,2.5,"Lentil, Apple, and Turkey Wrap","[4 cups low-sodium vegetable or chicken stock,...",559.0
1,[Combine first 9 ingredients in heavy medium s...,23.0,403.0,18.0,4.375,Boudin Blanc Terrine with Red Onion Confit,"[1 1/2 cups whipping cream, 2 medium onions, c...",1439.0
2,[In a large heavy saucepan cook diced fennel a...,7.0,165.0,6.0,3.75,Potato and Fennel Soup Hodge,"[1 fennel bulb (sometimes called anise), stalk...",165.0
3,[Heat oil in heavy large skillet over medium-h...,,,,5.0,Mahi-Mahi in Tomato Olive Sauce,"[2 tablespoons extra-virgin olive oil, 1 cup c...",
4,[Preheat oven to 350°F. Lightly grease 8x8x2-i...,32.0,547.0,20.0,3.125,Spinach Noodle Casserole,"[1 12-ounce package frozen spinach soufflé, th...",452.0


In [89]:
data.describe()

Unnamed: 0,fat,calories,protein,rating,sodium
count,15908.0,15976.0,15929.0,20100.0,15974.0
mean,346.0975,6307.857,99.946199,3.71306,6211.474
std,20431.02,358585.1,3835.616663,1.343144,332890.3
min,0.0,0.0,0.0,0.0,0.0
25%,7.0,198.0,3.0,3.75,80.0
50%,17.0,331.0,8.0,4.375,294.0
75%,33.0,586.0,27.0,4.375,711.0
max,1722763.0,30111220.0,236489.0,5.0,27675110.0


I took out any entries that were NaN.

In [90]:
data = data.dropna()

In [92]:
data.describe()

Unnamed: 0,fat,calories,protein,rating,sodium
count,15896.0,15896.0,15896.0,15896.0,15896.0
mean,346.3496,6338.838,100.145823,3.759476,6241.392
std,20438.73,359486.1,3839.59368,1.287856,333705.8
min,0.0,0.0,0.0,0.0,0.0
25%,7.0,199.0,3.0,3.75,81.0
50%,17.0,333.0,8.0,4.375,296.0
75%,33.0,587.0,27.0,4.375,713.0
max,1722763.0,30111220.0,236489.0,5.0,27675110.0


Limiting protein to below 50 since the 3rd quartile is 27 while max is 236489. Limiting calories to 1000 as 3rd quartile is 587 but I think 1000 is a better number than ~700. Limiting sodium to 1000 as 3rd quartile is 713. This process took out about 3 thousand ratings but I think it still cleaned the data.

In [93]:
data = data.loc[data['protein'] <= 50]
data = data.loc[data['calories'] <= 1000]
data = data.loc[data['sodium'] <= 1000]

data.describe()

Unnamed: 0,fat,calories,protein,rating,sodium
count,12296.0,12296.0,12296.0,12296.0,12296.0
mean,17.306034,321.592632,10.053513,3.709895,271.502765
std,15.338553,202.925005,11.347284,1.338333,261.673865
min,0.0,0.0,0.0,0.0,0.0
25%,6.0,175.0,2.0,3.75,51.0
50%,14.0,277.0,6.0,4.375,186.0
75%,24.0,439.0,13.0,4.375,430.0
max,108.0,1000.0,50.0,5.0,1000.0


In [94]:
data.head()

Unnamed: 0,directions,fat,calories,protein,rating,title,ingredients,sodium
0,"[1. Place the stock, lentils, celery, carrot, ...",7.0,426.0,30.0,2.5,"Lentil, Apple, and Turkey Wrap","[4 cups low-sodium vegetable or chicken stock,...",559.0
2,[In a large heavy saucepan cook diced fennel a...,7.0,165.0,6.0,3.75,Potato and Fennel Soup Hodge,"[1 fennel bulb (sometimes called anise), stalk...",165.0
4,[Preheat oven to 350°F. Lightly grease 8x8x2-i...,32.0,547.0,20.0,3.125,Spinach Noodle Casserole,"[1 12-ounce package frozen spinach soufflé, th...",452.0
10,[Heat oil in heavy large skillet over medium-h...,5.0,256.0,4.0,3.75,"Yams Braised with Cream, Rosemary and Nutmeg","[4 teaspoons olive oil, 1/2 cup finely chopped...",30.0
12,[Preheat oven to 350°F. Coat cake pans with no...,48.0,766.0,12.0,4.375,Banana-Chocolate Chip Cake With Peanut Butter ...,"[Nonstick vegetable oil spray, 3 cups all-purp...",439.0


##Feature Engineering

I am creating three neew columns: cook time, soup, and chicken. The cook time is self explanatory and, to me, is very important on how willing someone is to try a recipe. The is_soup and is_chicken columns are numerical boolean values indicating if the recipe is for soup and if the recipe includes chicken. Though I could make many more similar columns for other types of meals and other ingredients, I stopped at these because I thought these categories (soup, chicken) were ones people rating recipes would likely think important.

Calculating cook time

In [95]:
import re

data['mins'] = data['directions'].apply(lambda x:
         [re.search(r'([0-9]{1,2} (?:min|hour))', elem)
         for elem in x]
).apply(lambda x: [elem.group() for elem in x if elem is not None])
data['mins'] = data['mins'].apply(
    lambda x: [elem.split() for elem in x]
).apply(
    lambda x: sum([
        int(elem[0]) if elem[1] == 'min'
        else int(elem[0])*60
        for elem in x
    ])
)

Creating a column indicating if the recipe is for soup.

In [96]:
data['is_soup'] = data['title'].str.contains('Soup').astype(int)

Creating a column if chicken is an ingredient (I also need to convert the ingredients from a list to a string.)

In [97]:
data['ingredients'] = [','.join(map(str, l)) for l in data['ingredients']]

In [98]:
data['has_chicken'] = data['ingredients'].str.contains('chicken').astype(int)

In [99]:
data.head()

Unnamed: 0,directions,fat,calories,protein,rating,title,ingredients,sodium,mins,is_soup,has_chicken
0,"[1. Place the stock, lentils, celery, carrot, ...",7.0,426.0,30.0,2.5,"Lentil, Apple, and Turkey Wrap","4 cups low-sodium vegetable or chicken stock,1...",559.0,30,0,1
2,[In a large heavy saucepan cook diced fennel a...,7.0,165.0,6.0,3.75,Potato and Fennel Soup Hodge,"1 fennel bulb (sometimes called anise), stalks...",165.0,10,1,1
4,[Preheat oven to 350°F. Lightly grease 8x8x2-i...,32.0,547.0,20.0,3.125,Spinach Noodle Casserole,"1 12-ounce package frozen spinach soufflé, tha...",452.0,45,0,0
10,[Heat oil in heavy large skillet over medium-h...,5.0,256.0,4.0,3.75,"Yams Braised with Cream, Rosemary and Nutmeg","4 teaspoons olive oil,1/2 cup finely chopped s...",30.0,3,0,1
12,[Preheat oven to 350°F. Coat cake pans with no...,48.0,766.0,12.0,4.375,Banana-Chocolate Chip Cake With Peanut Butter ...,"Nonstick vegetable oil spray,3 cups all-purpos...",439.0,101,0,0


##Model Building

The type of model I chose is Linear Regressions, partially because I understand the coding process for it, but also because rating is a continuous numerical variable.

Based on my newly created dataset, I used all the numerical columns (fat, calories, protein, sodium, mins, is_soup, has_chicken) into my model.

In [100]:
new_data = data.loc[:, ['fat', 'calories', 'protein', 'rating', 'sodium', 'mins', 'is_soup', 'has_chicken']]

In [101]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(
    new_data,
    train_size=0.8,
    random_state=48)

model = LinearRegression().fit(
    X=train_data.loc[:, [
        'fat', 'calories', 'protein',
        'sodium', 'mins', 'is_soup', 'has_chicken']],
    y=train_data['rating']
)

In [102]:
model

##Model Evaluation

My model performed horribly. I evaluated my model by using the model.score() function which is what we learned to use to rate Linear Regressions. It returnd that my model can accurately predict about 3% of the ratings based on my chosen columns/variables.

In [103]:
model.score(
    X=test_data.loc[:, [
        'fat', 'calories', 'protein',
        'sodium', 'mins', 'is_soup', 'has_chicken']],
    y=test_data['rating']
)

0.03115144967966832

A possible conclusion from my model and its score would be that none of the chosen columns (variables) are good predictors of the rating of a recipe. In fact, there is a possibly no good predictor of the rating of a recipe because the rating is likely based on how well the user/chef used the recipe and how well the resulting food tasted to the human. Thus, there is the possibility that the rating is only based on the human effect of execution.