A notebook for gathering non-celebrity-chef recipe data via the Yummly rest API.

#0. Setup

In [1]:
from nltk.stem import WordNetLemmatizer
import pymongo
import requests
from tqdm import tqdm
import pickle
import caffeine
import requests

In [2]:
client = pymongo.MongoClient()
chefs = client.chefs_db
celebrity_recipes = client.chefs_db.celebrity_recipes
yummly_recipes = client.chefs_db.yummly_recipes
yummly_recipes2 = client.chefs_db.yummly_recipes2
yummly_recipes3 = client.chefs_db.yummly_recipes3

#1. NLP

Stopwords:

In [3]:
specs = ['dash', 'pinch', 'teaspoon', 'tablespoon', 'cup', 'scoop', 'pound', 'ounce', 'oz', 
         'quart', 'pint', 'gallon', 'milliliter', 'ml', 'liter', 'small', 'medium', 'large', 
         'freshly', 'ground', 'piece', 'clove', 'boneless', 'cube', 'dice', 'finely', 
         'grated', 'to', 'inch', 'each', 'whole', 'about', 'as', 'thawed', 'by', 'all', 'a',
         'chopped', 'crushed', 'plus', 'minus', 'such', 'the', 'an', 'slice', 'approximately',
         'and', 'or', 'weight', 'of', 'recipe', 'basic', 'slab', 'stick', 'pure', 'melt',
         'melted'] 

Helper function to parse title and scrubbed ingredients from a given JSON recipe document:

In [4]:
def parse_recipe(recipe):
    # Title:
    title = recipe['recipeName'].encode('utf-8')
    # Ingredients:
    lemmatized = [WordNetLemmatizer().lemmatize(
                   ingredient.encode('ascii', 'ignore').decode('latin-1')) 
                   for ingredient in recipe['ingredients']]
    split = [ingredient.split() for ingredient in lemmatized]
    stemmed = [[i.encode('utf-8') for i in line if i not in specs] for line in split]
    ingredients = [' '.join(ingredient) for ingredient in stemmed]
    # Flavors:
    flavors = recipe['flavors']
    return title, ingredients, flavors

An example:

In [5]:
#unpickling
with open('yummly_sample.pkl', 'r') as picklefile: 
    sample = pickle.load(picklefile)

In [9]:
example = {u'attributes': {u'course': [u'Breads']},
   u'flavors': {u'bitter': 0.6666666666666666,
    u'meaty': 0.3333333333333333,
    u'piquant': 0.0,
    u'salty': 1.0,
    u'sour': 0.6666666666666666,
    u'sweet': 0.8333333333333334},
   u'id': u'Beer-bread-369136',
   u'imageUrlsBySize': {u'90': u'http://lh4.ggpht.com/XOlUW2enMRRVKfzFY1IP0uu5TnmDXj7XOMsIH-U_PA_sFOMLKP1qswyYvo-rzpFvAwa8AB48Ds74Mws52j0B=s90-c'},
   u'ingredients': [u'self rising flour',
    u'sugar',
    u'kosher salt',
    u'beer',
    u'melted butter'],
   u'rating': 5,
   u'recipeName': u'Beer Bread',
   u'smallImageUrls': [u'http://lh3.ggpht.com/YgheDnFweX-4mE5zILhtZB20AhSvfrl6j8fdwN3KJK6P7WIR9hJyFzz0a_SYtFDQ3Dku327VhVRGYfIJpuat02Y=s90'],
   u'sourceDisplayName': u'My Baking Addiction',
   u'totalTimeInSeconds': 4200}

In [11]:
title, ingredients, flavors = parse_recipe(example)
print title
print ingredients
print flavors

Beer Bread
['self rising flour', 'sugar', 'kosher salt', 'beer', 'butter']
{u'piquant': 0.0, u'sour': 0.6666666666666666, u'salty': 1.0, u'sweet': 0.8333333333333334, u'bitter': 0.6666666666666666, u'meaty': 0.3333333333333333}


#2. Querying
Setup: get recipe title to search for and tie each recipe to its corresponding chef.

In [7]:
titles_and_chefs = []
for recipe in celebrity_recipes.find({}, {"title": 1, "chef": 1}):
    titles_and_chefs.append((recipe["chef"], recipe["title"]))

In [8]:
len(titles_and_chefs)

8135

The process:

1. For every celebrity chef recipe, make a call to the Yummly API for 400 recipes with the recipe title as the search term.

2. For each recipe returned, parse out the title and ingredients.

3. Make a dictionary of (a) Yummly recipe title, (b) Yummly recipe ingredients, (c) corresponding celebrity chef dish title, and (d) name of that celebrity chef.

4. Insert the dictionary as a document in the `yummly_recipes` collection.

**N.B., this process will take 2-2.5 hours to run.** So, we `import caffeine` to keep the computer from sleeping. When we're done, `caffeine.off()` will take things back to normal.

In [6]:
errorcount = 0
for chef, title in tqdm(titles_and_chefs): 
    #first, check if the title's in mongo; if not, download from the API 
    count = yummly_recipes3.find({"celebrity_recipe":title}).count()
    if count == 0:
        try:
            search_term = title.replace(' ', '+')
            yummly_data = requests.get('http://api.yummly.com/v1/api/recipes?_app_id=fd0752d7&_app_key=\
c2b47c97e29091d99a7d02d5861c9e27&q=' + search_term + '&maxResult=400&start=2') 
            json = yummly_data.json()
            for recipe in json.values()[0]: 
                yummly_title, ingredients, flavors = parse_recipe(recipe)
                document = {'yummly_recipe': yummly_title, 'yummly_ingredients': ingredients, 
                            'celebrity_recipe': title, 'chef': chef, 'flavors': flavors}
                yummly_recipes3.save(document)
        except:
            errorcount += 1 
            print chef, title, errorcount

In [10]:
caffeine.off()