# Recipes PCA
### By Brian Kitano

Okay, I'm going to use the Epicurious dataset to identify palates common across their recipes. 

## Introduction / Hypothesis

## Materials

## Procedure
1. Download the JSON data
2. Parse the JSON to extract recipe names and ingredients with their quantities.
3. Create the data matrix M, where each column is a recipe and each row is an ingredient; the entry is the quantity in a normalized and standardized quantity (grams?)
4. PCA

Bonus: construct the bipartite graph of ingredients to recipes, and then project it down onto a unipartite graph of ingredients where the weight of each edge is the frequency of connections. 

In [1]:
# 1. Parse the ingredients to extract recipe names and ingredients with their quantities
import json

# load in the epicurious set
with open('full_format_recipes.json') as f:
    data = json.load(f)
    f.close()
    
# print data[0]['ingredients']
# len(data) = 20130

# need to filter out the recipes that don't have ingredients listed
# extractIngredients = lambda i: data[i].keys


In [2]:
def containsIngredients(index):
    if 'ingredients' in data[i].keys():
        return i
    else:
        return 0

# get all of the indices which contain ingredients
cleanedIndices = [ containsIngredients(i) for i in range(len(data)) ]

# get all of the ingredient lists as a list of lists
cleanedIngredientsLists = [ data[i]['ingredients'] for i in cleanedIndices ]

# flatten this list, which might contain duplicates
cleanedIngredients = [ ingredient for ingredientList in cleanedIngredientsLists for ingredient in ingredientList]
print len(cleanedIngredients)

# remove any duplicates
uniqueCleanedIngredients = list(set(cleanedIngredients))
print len(uniqueCleanedIngredients)

199315
83465


### Data Cleaning
Before we write all of the ingredients to a file, we should do some NLP cleaning. In looking at the results of the first model run, it seems like to be conservative we should remove all the text that occurs in parentheses, as this seems to really mess up the CRF's ability to identify units.

#### remove things in parentheses (use regex)

In [3]:
import re

# remove all the text that is inside a parenthesis
noParenthesisIngredients = [re.sub(r'\([^)]*\)', '', ingredient) for ingredient in uniqueCleanedIngredients]

print len(noParenthesisIngredients)

83465


#### dealing with the word "plus"

More complex problem. There are lots of ways that "plus" is used. Some examples:

##### when quantities don't add nicely
- "1/2 cup plus 1 1/2 tablespoons red wine vinegar"
- "1/2 cup plus 2 tablespoons granola"
- "1/4 cup plus 1 tablespoon warm water"
- "1/4 teaspoon plus 1/3 cup sugar"
- "2 tablespoons plus 1/2 cup chopped fresh dill"
- "1 tablespoon plus 1/2 teaspoon Dijon mustard"
- "2/3 cup plus 6 tablespoons coarsely chopped pecans"
- "1 cup plus 2 tablespoons whole milk"
- "1 1/2 cups plus 2 tablespoons sugar"
- "1 1/2 cups plus 2 tablespoons water"
- "1 tablespoon plus 3/4 teaspoon ground cinnamon"
- "1 tablespoon plus one teaspoon fresh lemon juice"

These are in a consistent format of UNIT QUANTITY PLUS UNIT QUANTITY INGREDIENT. If we add PLUS as a label, then over the 3k samples we have we might improve, but we might also tag some things as being PLUS when we don't want them to be.

##### when there's a suggestion for more on the side (not a lot of errors there)
- "1/4 cup olive oil, plus more for grilling"
- "5 teaspoons all-purpose flour plus more for dusting"
- "2 tablespoons drained capers plus more for serving"
- "1/2 cup freshly grated Parmesan cheese plus additional for passing"
- "12 rice-paper rounds, plus more in case some tear"
- "1 can whole tomatoes, plus juice"
- "1 tablespoon chile oil containing sesame oil plus some of sediment from jar"

These ones seem like i can just remove all the words after the plus.

##### other, stupid ones
- "8 cornichons, finely chopped, plus 2 pickled onions from the jar, minced"
- "1/2 cup oil-packed sun-dried tomatoes, chopped, plus 2 tablespoons tomato oil"
- "1 tablespoon fresh rosemary leaves or 1 teaspoon crumbled dried, plus rosemary sprigs for garnish"
- "Juice of 1/4 lime, plus 1 lime wedge for garnish"
- "1 1/2 cups sugar, plus 1/4 cup mixed with 1 tablespoon cinnamon, on a plate"
- "1/2 fennel bulb, finely chopped, plus 1 tablespoon finely chopped fronds"
- "1/4 cup chopped fresh cilantro plus 32 whole fresh cilantro leaves"
- "6 large celery stalks, thickly sliced, plus 2 1/2 cups 1/2-inch-thick slices"
- "6 fresh mint leaves plus 1 mint sprig for garnish"

So also there's like a utility function that might need to be taken into account: we really want our data to fit the format nicely of having a name, a unit, and a quantity. 

A really, really easy way to deal with all of this is just to get rid of all the "plus" ingredient listings, which are only ~3000 out of the 83k samples. It might mess up the data but it's easier. Also none of this is training or testing data, this is like actually "I need this" data so it's convenient if I just scrap the shitty stuff. It will also probably have come up in other sections. 

In [4]:
# we use a regex to tag an igredient any time "plus" appears as a word with or without a comma on its own
noPlusListings = list(filter(lambda ingredient: (re.search("\s*(plus)\,*\s*", ingredient) == None), noParenthesisIngredients))

len(noPlusListings)

80480

#### dashes, commas, and other grammar thingies
might be worth removing all of that, but not going to yet. 

##### Asterisks (*)
Asterisks appear in two variants:
"2 1/2 pounds Jerusalem artichokes *" where the asterisk is at the end, and "*seedless red grapes" where it's indicating that this is the start of a comment. We can thus remove anything after an asterisk, since it doesn't matter in either case.

In [26]:
# removing all the text after an asterisk
asteriskFreeListings = [ re.sub("\*.*\n*",'',ingredient) for ingredient in noPlusListings ]

#### typos

like fam what how is that even ugh how do i check for typos here. 

#### "or"
we could remove all the tokens after the word "or", since it's optional.

In [27]:
# remove all the things after an or
noOrListings = [ re.sub("[^A-z]\.*\,*\s*(or|OR|Or)+.*",'', ingredient) for ingredient in asteriskFreeListings ]

Okay, now let's write this clean stuff to a file.

In [29]:
# write the ingredients to a file, which we'll then feed to a model
with open('ingredientsList.txt', 'a') as the_file:
    for ingredient in noOrListings:
        if ingredient != "":
            asciiOnlyIngredient = "".join(i for i in ingredient if ord(i)<128)
            ingredientString = asciiOnlyIngredient + "\n"
            the_file.write(ingredientString)


In [48]:
# okay, the model ran and i've got the sauce

# load in the labeled stuff
with open('results.json') as g:
    labeledIngredients = json.load(g)
    g.close()
    
print labeledIngredients[0]['name']
print labeledIngredients[0]['unit']
print labeledIngredients[0]['qty']

lemon juice
cup
1 1/4


In [49]:
# now we need to normalize all of the units and measures. We'll use milliliters for volume and grams for mass.

# first we'll get a list of all the units
def containsUnit(i):
    if 'unit' in labeledIngredients[i].keys():
        return i
    else:
        return 0
    
# get all of the indices which contain units
unitContainingIndices = [containsUnit(i) for i in range(len(labeledIngredients))]
unitContainingIndices = list(set(unitContainingIndices))

# get all of the units
unitList = [labeledIngredients[i]['unit'] for i in list(set(unitContainingIndices))]

# remove duplicates
uniqueUnitList = list(set(unitList))

print len(uniqueUnitList) # 764 bruh

for unit in uniqueUnitList:
    print unit

103
pound ounce
bunch head
fifth
bunch
cup
stalk tablespoon
clove clove
jar
tablespoon stalk
teaspoon
knob
liter
pound cup
tablespoon cup
pound fillet
piece fillet
pinch
drop
cup teaspoon
teaspoon teaspoon
bag
gram
pound pound
steak
rack
square
chunk
cup cup
packet
teaspoon bag
12-ounce
teaspoon tablespoon
cup cup tablespoon
slice
fillet
can fillet
slice slice
handful
stem
quart
tablespoon sprig
12-ounce bottle
pound stalk
cup sprig
box
ounce ounce
package
12-ounce bag
cup piece
ounce
sprig
head
loaf
wedge
cup tablespoon cup
cup ounce
strip
pound slice
cup cup cup
bulb tablespoon
sheet
log
splash
tablespoon tablespoon
ounce fillet
envelope
cup slice
stick
bulb
ear
twist
bunch sprig
batch
tablespoon teaspoon
stalk
can
ounce cup
piece
cup tablespoon
pound
ounce can
dash
head clove
gallon
dozen
clove
clove teaspoon
branch
pint
cube
ball
tablespoon
bunch bunch
cup stalk
pinch sprig
pair
segment
ounce tablespoon
cup strip
12-ounce bunch
fillet teaspoon
ounce slice
bottle


### Pre and Post Modeling Cleaning
What cleaning should be done before we feed the model, and what cleaning should be done after? Also, should we change our factor functions? 
- "5 flat anchovy filets" the model reads with "flat" being the unit

I can probably safely ignore where the unit is registered as not a real unit. In this case, I should create some sort of lookup for real units. 