# Recipes PCA
### By Brian Kitano

Okay, I'm going to use the Epicurious dataset to identify palates common across their recipes. 

### Naming conventions of variables
1. The raw dataset is loaded as an array called `data`, but we will exclusively use the dictionary `titleToRawRecipe` which contains a mapping of recipe titles to their data.
2. There are three kinds of ingredient objects: `rawIngredient` is the original ingredient string as it is found in `titleToRawRecipe`; `cleanIngredient` is the ingredient after NLP cleaning; and `cookedIngredient` is the dictionary object returned by the CRF.


## Introduction / Hypothesis

## Materials

## Procedure
1. Download the JSON data
2. Parse the JSON to extract recipe names and ingredients with their quantities.
3. Create the data matrix M, where each column is a recipe and each row is an ingredient; the entry is the quantity in a normalized and standardized quantity (grams?)
4. PCA

Bonus: construct the bipartite graph of ingredients to recipes, and then project it down onto a unipartite graph of ingredients where the weight of each edge is the frequency of connections. 

In [1]:
# 1. Parse the ingredients to extract recipe names and ingredients with their quantities
import json

# load in the epicurious set
with open('full_format_recipes.json') as f:
    data = json.load(f)
    f.close()
    
# print data[0]['ingredients']
print len(data)

# need to filter out the recipes without a title
data = list(filter(lambda recipe: 'title' in recipe.keys(), data)) 
print len(data)

# need to filter out the recipes that don't have ingredients listed
data = list(filter(lambda recipe: 'ingredients' in recipe.keys(), data))
print len(data)

# as a simple means of cleaning everything, let's strip whitespace from all the listings
for recipe in data:
    ingredients = recipe['ingredients']
    for ingredient in ingredients:
        ingredient = ingredient.strip()

20130
20111
20111


In [4]:
# there are duplicate recipes lol
# start by making the hash map of title to recipe
titleToRawRecipe = dict()

# for a recipe in the dataset
for recipe in data:
    title = recipe['title']
    # if we haven't seen that recipe before, add it to the dictionary
    if title not in titleToRawRecipe.keys():
        titleToRawRecipe[title] = recipe
        # otherwise, doesn't matter, ignore it

# from then on out, we can only work with the dictionary
print len(titleToRawRecipe)

17775


In [5]:
# get all of the ingredient lists as a list of lists
rawIngredientsLists = [ titleToRawRecipe[titleToRawRecipe.keys()[i]]['ingredients'] for i in range(len(titleToRawRecipe)) ]

# flatten this list, which might contain duplicates
rawIngredients = [ ingredient for ingredientList in rawIngredientsLists for ingredient in ingredientList]
print len(rawIngredients)

180467


We need to deal with the redundancy neatly. What we can do is map each original listing to a number, in another dictionary map that number to a processed listing. Then we only have to work with the processed listings and not fuck with the original mapping. We'll need to make a temporary reverse mapping of the set list.

In [10]:
# a deduplicated list of ingredients
uniqueRawIngredients = list(set(rawIngredients))
print len(uniqueRawIngredients)

# a temporary map from cleaned ingredient to index
uniqueRawIngredientToIndex = dict(zip(uniqueRawIngredients, range(len(uniqueRawIngredients))))

# now create a map from the original ingredients to these indices
rawIngredientToIndex = dict()
for ingredient in rawIngredients:
    rawIngredientToIndex[ingredient] = uniqueRawIngredientToIndex[ingredient]
    
print len(rawIngredientToIndex)

82097
82097


Now there's a mapping of the original listing to the unique listing, so we can safely process the unique list without losing track of where the original ones came from.

### Data Cleaning
Before we write all of the ingredients to a file, we should do some NLP cleaning. In looking at the results of the first model run, it seems like to be conservative we should remove all the text that occurs in parentheses, as this seems to really mess up the CRF's ability to identify units. One unfortunate consequence is that we'll no longer be able to filter lists using lambdas, but instead replace them with null strings.

#### remove things in parentheses (use regex)

In [11]:
import re

# remove all the text that is inside a parenthesis
noParenthesisIngredients = [re.sub('\s*\([^)]*\)', '', ingredient) for ingredient in uniqueRawIngredients]

print len(noParenthesisIngredients)

82097


#### dealing with the word "plus"

More complex problem. There are lots of ways that "plus" is used. Some examples:

##### when quantities don't add nicely
- "1/2 cup plus 1 1/2 tablespoons red wine vinegar"
- "1/2 cup plus 2 tablespoons granola"
- "1/4 cup plus 1 tablespoon warm water"
- "1/4 teaspoon plus 1/3 cup sugar"
- "2 tablespoons plus 1/2 cup chopped fresh dill"
- "1 tablespoon plus 1/2 teaspoon Dijon mustard"
- "2/3 cup plus 6 tablespoons coarsely chopped pecans"
- "1 cup plus 2 tablespoons whole milk"
- "1 1/2 cups plus 2 tablespoons sugar"
- "1 1/2 cups plus 2 tablespoons water"
- "1 tablespoon plus 3/4 teaspoon ground cinnamon"
- "1 tablespoon plus one teaspoon fresh lemon juice"

These are in a consistent format of UNIT QUANTITY PLUS UNIT QUANTITY INGREDIENT. If we add PLUS as a label, then over the 3k samples we have we might improve, but we might also tag some things as being PLUS when we don't want them to be.

##### when there's a suggestion for more on the side (not a lot of errors there)
- "1/4 cup olive oil, plus more for grilling"
- "5 teaspoons all-purpose flour plus more for dusting"
- "2 tablespoons drained capers plus more for serving"
- "1/2 cup freshly grated Parmesan cheese plus additional for passing"
- "12 rice-paper rounds, plus more in case some tear"
- "1 can whole tomatoes, plus juice"
- "1 tablespoon chile oil containing sesame oil plus some of sediment from jar"

These ones seem like i can just remove all the words after the plus.

##### other, stupid ones
- "8 cornichons, finely chopped, plus 2 pickled onions from the jar, minced"
- "1/2 cup oil-packed sun-dried tomatoes, chopped, plus 2 tablespoons tomato oil"
- "1 tablespoon fresh rosemary leaves or 1 teaspoon crumbled dried, plus rosemary sprigs for garnish"
- "Juice of 1/4 lime, plus 1 lime wedge for garnish"
- "1 1/2 cups sugar, plus 1/4 cup mixed with 1 tablespoon cinnamon, on a plate"
- "1/2 fennel bulb, finely chopped, plus 1 tablespoon finely chopped fronds"
- "1/4 cup chopped fresh cilantro plus 32 whole fresh cilantro leaves"
- "6 large celery stalks, thickly sliced, plus 2 1/2 cups 1/2-inch-thick slices"
- "6 fresh mint leaves plus 1 mint sprig for garnish"

So also there's like a utility function that might need to be taken into account: we really want our data to fit the format nicely of having a name, a unit, and a quantity. 

A really, really easy way to deal with all of this is just to get rid of all the "plus" ingredient listings, which are only ~3000 out of the 83k samples. It might mess up the data but it's easier. Also none of this is training or testing data, this is like actually "I need this" data so it's convenient if I just scrap the shitty stuff. It will also probably have come up in other sections. 

In [12]:
# we use a regex to tag an igredient any time "plus" appears as a word with or without a comma on its own
def removePlus(ingredient):
    if (re.search("\s*(plus)\,*\s*", ingredient) == None):
        return ingredient
    else:
        return ""

noPlusListings = [removePlus(ingredient) for ingredient in noParenthesisIngredients]

len(noPlusListings)

82097

#### dashes, commas, and other grammar thingies
might be worth removing all of that, but not going to yet. 

##### Asterisks (*)
Asterisks appear in two variants:
"2 1/2 pounds Jerusalem artichokes *" where the asterisk is at the end, and "*seedless red grapes" where it's indicating that this is the start of a comment. We can thus remove anything after an asterisk, since it doesn't matter in either case.

In [13]:
# removing all the text after an asterisk
asteriskFreeListings = [ re.sub("\*.*\n*",'',ingredient) for ingredient in noPlusListings ]
print len(asteriskFreeListings)

82097


#### "a" and "an"
This probably maps to the number 1 right?

#### typos

like fam what how is that even ugh how do i check for typos here. 

#### "or"
we could remove all the tokens after the word "or", since it's optional.

examples:
- 1 cup fresh or frozen cranberries (about 4 ounces)


In [14]:
# remove all the things after an or
noOrListings = [ re.sub("[^A-z]\.*\,*\s*(or|OR|Or)+\,*\s+.*",'', ingredient) for ingredient in asteriskFreeListings ]

print len(noOrListings)

82097


Let's make sure that our mapping methods are still valid.

In [16]:
# let's make a function to make our lives easier for doing lookup
def getCleanedIngredientFromRawIngredient(rawIngredient, uniqueList):
    index = rawIngredientToIndex[rawIngredient]
    return uniqueList[index]

for i in range(20,25):
    print uniqueRawIngredients[i]
    print getCleanedIngredientFromRawIngredient( uniqueRawIngredients[i] , noOrListings)
    print '\n'

4 oz extra-sharp reduced-fat Cheddar (made from 2% milk), coarsely grated
4 oz extra-sharp reduced-fat Cheddar, coarsely grated


12 large fennel bulbs, trimmed, halved lengthwise, cored, sliced crosswise
12 large fennel bulbs, trimmed, halved lengthwise, cored, sliced crosswise


1 pork tenderloin
1 pork tenderloin


2 lb medium shrimp in shell (31 to 35 per pound), peeled and deveined
2 lb medium shrimp in shell, peeled and deveined


2 tablespoon white wine vinegar
2 tablespoon white wine vinegar




In [22]:
# now we make a hash map from the cleaned inputs to their index
cleanedIngredientToIndex = dict(zip(noOrListings, range(len(noOrListings))))

# show that the originalIndex and the cleanedIndex are the same
for i in range(20,25):
    rawIngredient = uniqueRawIngredients[i]
    cleanedIngredient = getCleanedIngredientFromRawIngredient(rawIngredient, noOrListings)
    print "raw: " + rawIngredient + ", " + str(rawIngredientToIndex[rawIngredient])
    print "cleaned: " + cleanedIngredient + ", " + str(cleanedIngredientToIndex[cleanedIngredient])

raw: 4 oz extra-sharp reduced-fat Cheddar (made from 2% milk), coarsely grated, 20
cleaned: 4 oz extra-sharp reduced-fat Cheddar, coarsely grated, 20
raw: 12 large fennel bulbs, trimmed, halved lengthwise, cored, sliced crosswise, 21
cleaned: 12 large fennel bulbs, trimmed, halved lengthwise, cored, sliced crosswise, 21
raw: 1 pork tenderloin, 22
cleaned: 1 pork tenderloin, 61194
raw: 2 lb medium shrimp in shell (31 to 35 per pound), peeled and deveined, 23
cleaned: 2 lb medium shrimp in shell, peeled and deveined, 23
raw: 2 tablespoon white wine vinegar, 24
cleaned: 2 tablespoon white wine vinegar, 24


In [24]:
i = cleanedIngredientToIndex["1 cup fresh"]
print uniqueRawIngredients[i]
print noParenthesisIngredients[i]
print noPlusListings[i]
print asteriskFreeListings[i]
print noOrListings[i]

1 cup fresh or frozen cranberries (about 4 ounces)
1 cup fresh or frozen cranberries
1 cup fresh or frozen cranberries
1 cup fresh or frozen cranberries
1 cup fresh


Okay, so how will we get from the modeled stuff to the original recipe?

1. map json to listing index

1a. map model json to input to model aka cleanedListing

1b. map cleanedListing to index

2. map original listing to listing index (done)
3. reverse map listing index to json

and then i think we're good

Okay, now let's write this clean stuff to a file.

In [12]:
# write the ingredients to a file, which we'll then feed to a model
with open('ingredientsList.txt', 'a') as the_file:
    for ingredient in noOrListings:
        if ingredient != "":
            asciiOnlyIngredient = "".join(i for i in ingredient if ord(i)<128)
            ingredientString = asciiOnlyIngredient + "\n"
            the_file.write(ingredientString)


In [25]:
# okay, the model ran and i've got the sauce

# load in the labeled stuff
with open('results.json') as g:
    cookedIngredients = json.load(g)
    g.close()
    
print cookedIngredients[0]['name']
print cookedIngredients[0]['unit']
print cookedIngredients[0]['qty']

print len(cookedIngredients)

lemon juice
cup
1 1/4
78688


In [14]:
# now we need to normalize all of the units and measures. We'll use milliliters for volume and grams for mass.

# first we'll get a list of all the units
def containsUnit(i):
    if 'unit' in cookedIngredients[i].keys():
        return i
    else:
        return 0
    
# get all of the indices which contain units
unitContainingIndices = [containsUnit(i) for i in range(len(labeledIngredients))]
unitContainingIndices = list(set(unitContainingIndices))

# get all of the units
unitList = [labeledIngredients[i]['unit'] for i in list(set(unitContainingIndices))]

# remove duplicates
uniqueUnitList = list(set(unitList))

print len(uniqueUnitList)

93


### Pre and Post Modeling Cleaning
What cleaning should be done before we feed the model, and what cleaning should be done after? Also, should we change our factor functions? 

Well, let's think quantitatively about what cleaning means now. We've identified the units from the model, and they're obviously not perfect. We should look at whether we can just cut the shitty ones out now.

In [15]:
# make a dictionary mapping unit to ingredients
sortedIngredientsByUnit = dict()

for ingredient in labeledIngredients:
    unit = 'na'
    # if there's a unit associated with the ingredient
    if 'unit' in ingredient.keys():
        unit = ingredient['unit']
    
    if isinstance(ingredient, dict):
        # if that unit is already in the dictionary
        if unit in sortedIngredientsByUnit.keys():
            sortedIngredientsByUnit[unit].append(ingredient)
        else:
            # that unit is unseen, so we need to create it
            sortedIngredientsByUnit[unit] = [ingredient]

In [16]:
unitByCount = dict()

for unit in sortedIngredientsByUnit.keys():
    unitByCount[unit] = len(sortedIngredientsByUnit[unit])
    
unitByCountSorted = (sorted(unitByCount.iteritems(), key=lambda (k,v): (v,k), reverse=True))

print unitByCountSorted

[('na', 28250), (u'cup', 20551), (u'tablespoon', 8040), (u'pound', 5805), (u'teaspoon', 5375), (u'ounce', 3670), (u'slice', 1042), (u'clove', 802), (u'bunch', 686), (u'head', 589), (u'piece', 483), (u'can', 447), (u'sprig', 436), (u'stick', 305), (u'stalk', 274), (u'package', 267), (u'pint', 234), (u'quart', 162), (u'pinch', 154), (u'fillet', 142), (u'strip', 140), (u'bottle', 109), (u'ear', 103), (u'dash', 76), (u'jar', 53), (u'bag', 50), (u'handful', 47), (u'loaf', 40), (u'gram', 35), (u'dozen', 31), (u'bulb', 25), (u'sheet', 23), (u'envelope', 23), (u'cup sprig', 22), (u'box', 19), (u'gallon', 11), (u'cube', 11), (u'batch', 10), (u'clove teaspoon', 9), (u'knob', 8), (u'square', 7), (u'rack', 7), (u'pound fillet', 7), (u'ounce slice', 7), (u'wedge', 6), (u'ball', 6), (u'cup tablespoon', 5), (u'chunk', 5), (u'12-ounce', 5), (u'twist', 4), (u'splash', 4), (u'liter', 4), (u'drop', 4), (u'tablespoon tablespoon', 3), (u'log', 3), (u'cup slice', 3), (u'can fillet', 3), (u'teaspoon teaspoon

So I think since the first 25 units account for ~99% of the ingredients in the set, I'm just gonna drop the remaining ones. 

In [17]:
unitByCountTruncated = dict(unitByCountSorted[:25])
print unitByCountTruncated
print sum(unitByCountTruncated.values())

{u'pound': 5805, u'dash': 76, u'strip': 140, u'bunch': 686, u'clove': 802, u'slice': 1042, u'cup': 20551, 'na': 28250, u'jar': 53, u'fillet': 142, u'teaspoon': 5375, u'stalk': 274, u'pint': 234, u'head': 589, u'tablespoon': 8040, u'quart': 162, u'stick': 305, u'ear': 103, u'package': 267, u'pinch': 154, u'ounce': 3670, u'sprig': 436, u'can': 447, u'bottle': 109, u'piece': 483}
78195


At this point, i'm not really sure how my previous work is that helpful. Anyways, what I need to do now is reassociate each recipe with its ingredients, now modeled. I think I'll have to make a new dictionary, where keys are titles, and the modeled ingredients are values.

In [26]:
# make a mapping from cleaned listing to model
cleanedIngredientToCookedIngredient = dict()
for cookedIngredient in cookedIngredients:
    cleanedIngredient = cookedIngredient['input']
    cleanedIngredientToCookedIngredient[cleanedIngredient] = cookedIngredient

okay so that dictionary is definitely working

In [27]:
# titleToCookedRecipe is a mapping from title to recipe containing cooked ingredients
titleToCookedRecipe = dict()

# for every recipe title
for recipe in titleToRawRecipe.values():
    title = recipe['title']
    print "recipe title: " + title
    # create an empty list to store the ingredients
    cookedIngredients = list()
    # for every ingredient in that recipe
    for rawIngredient in recipe['ingredients']:
        print "raw ingredient: " + rawIngredient

        # get the cleaned listing
        cleanedIngredient = getCleanedIngredientFromRawIngredient(rawIngredient, noOrListings)
        print "cleaned ingredient: " + cleanedIngredient
        
        # get the model based on the cleaned listing
        cookedIngredient = cleanedIngredientToCookedIngredient[cleanedIngredient]
        print cookedIngredient
        
        # append that model to the list
        cookedIngredients.append(cookedIngredient)
    print cookedIngredients
    # enter the title and the list into the dictionary as key value pairs
    titleToCookedRecipe[title] = cookedIngredients

recipe title: Roasted Butternut Squash, Rosemary, and Garlic Lasagne 
raw ingredient: 3 pounds butternut squash, quartered, seeded, peeled, and cut into 1/2-inch dice (about 9 1/2 cups)
cleaned ingredient: 3 pounds butternut squash, quartered, seeded, peeled, and cut into 1/2-inch dice
{u'comment': u'quartered seeded peeled and cut into 1/2-inch dice', u'name': u'butternut squash', u'qty': u'3', u'other': u',, , ,', u'input': u'3 pounds butternut squash, quartered, seeded, peeled, and cut into 1/2-inch dice', u'display': u"<span class='qty'>3</span><span class='unit'>pounds</span><span class='name'>butternut squash</span><span class='other'>,</span><span class='comment'>quartered</span><span class='other'>,</span><span class='comment'>seeded</span><span class='other'>,</span><span class='comment'>peeled</span><span class='other'>,</span><span class='comment'>and cut into 1/2-inch dice</span>", u'unit': u'pound'}
raw ingredient: 3 tablespoons vegetable oil
cleaned ingredient: 3 tablespo

KeyError: u'2 ounces Manchego cheese,'

In [29]:
for cookedIngredient in titleToCookedRecipe['Roasted Butternut Squash, Rosemary, and Garlic Lasagne ']:
    print cookedIngredient['name']

butternut squash
vegetable oil
milk
rosemary
garlic
unsalted butter
all-purpose flour
nine 7- by 3 1/2-inch sheets dry no-boil lasagne pasta
Parmesan
heavy cream
salt
rosemary
