The spidering and data handling is in spiderlings.ipynb, this one is for cleaning up the data and normalzing the JSON format. 

The reason I put this all in a notebook is that I expect a lot of the stuff to vary wildly from file to file. There are a couple of reasons for this. One is that I didn't do a good job of keeping it clean when I was developing the spiders over the course of multiple years of development. The other reason is that the datasets themselves admitted different levels of parsing. 

For a first pass, I want to clean up ingredients. I'm going to want to normalize a couple of different aspects of the representation:

- All liquids in the same units
    - Might be centiliters
    - dashes, barspoons, etc. need conversion
- All names of cocktails in title case (Gin Martini, not GIN MARTINI)
- Cocktials with matching names assigned a unique ID as well (Martini-01, etc.)
- Ingredients normalized so largest amount is 1, others proportional to that
    - Not so they sum to one, that loses relative scale
- Some way of handling garnishes and muddled ingredients
    - Eggs, muddled ingredients tend to get counted rather than measured. 
- Language modeling to normalize instructions

For no reason other than that I picked it at random, I'm going for the Martha Stewart data first. The file is, unfortunately, not well-formatted JSON, because I created it... fuuuuck. 4 years ago. Time flies. 

Anyway, issue one is that the data is, for each line in the file, a JSON dictionary of the form:

```json
{"name": "Strawberry-Cucumber Gin-Elderflower Spritz", "ingredients": ["12 strawberries, hulled and sliced (1 1/4 cups), plus whole berries for serving", "12 thin cucumber slices, halved (3/4 cup), plus whole rounds for serving", "2 tablespoons superfine sugar", "3 ounces fresh lemon juice", "9 ounces gin, such as Citadelle, chilled", "6 ounces St-Germain, chilled", "Club soda, chilled; and Peychaud's bitters, for serving"], "instructions": ["Muddle sliced strawberries, halved cucumber slices, sugar, and lemon juice in the bottom of a pitcher until fruits break down and release most of their juices and sugar has dissolved. Stir in gin and St-Germain to combine. Fill 6 glasses halfway with ice. Divide fruit-and-gin mixture evenly among glasses. Top each with 1 to 2 ounces club soda; stir once. Top each with a few dashes of bitters, whole strawberries, and cucumber rounds; serve immediately."]}
```

What I actually want is a list of these dicts.

In [2]:
import json

recipes = []
with open("./spiders/data/martha_stewart.json", 'r') as infile:
    for line in infile:
        data = json.loads(line)
        recipes.append(data)

with open("./spiders/data/martha_stewart_cleaned.json", 'w') as outfile:
    json.dump(recipes, outfile, indent=4)

Ok, that's way cleaner. Now the drinks are in a proper list, and the ingredients are, frankly, looking like parsing them is AGI-complete. 

```json
    {
        "name": "Sour-Cherry Mojitos",
        "ingredients": [
            "1 1/4 cups sugar",
            "2/3 cup fresh lemon juice (from about 3 lemons)",
            "3 pounds frozen pitted sour cherries, partially thawed with juices",
            "1 cup fresh basil leaves, plus more for serving",
            "2 to 3 cups vodka",
            "6 cups sparkling water"
        ],
        "instructions": [
            "Bring sugar and 1 1/4 cups water to a boil in a small saucepan, stirring until sugar is dissolved, 3 minutes. Remove from heat; let cool 15 minutes. Syrup can be refrigerated for up to 1 month.",
            "Combine lemon juice, fruit, and basil in a bowl. Add syrup; mash lightly to release juices. Refrigerate at least 1 day and up to 4 days.",
            "Combine fruit mixture and vodka in a pitcher or punch bowl; ladle about 1/3 cup into each glass. Fill with ice. Top with sparkling water, garnish with more basil, and serve."
        ]
    },
```

That one isn't bad, in the sense that everything is, more or less, a number, a unit, and an ingredient. However, some of them have stuff like "Club soda, chilled; and Peychaud's bitters, for serving", which is actually two ingredients and no amounts. There's also "Licorice Ice Cubes", which is not further explained. Some of the ingredients also have a link, the link text is usually the ingredient name, although there's also stuff like "Simple Syrup for Whiskey Sours". 

So let's do something simple: for every ingredient that starts with a number, count what the next token is, and graph that. 

In [22]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/ams/nltk_data...


True

In [6]:
import re
from nltk.stem.wordnet import WordNetLemmatizer
lemma = WordNetLemmatizer()

def update(d, k):
    # No more toy stemmer
    k = k.strip(',')
    k = lemma.lemmatize(k)
    if k in d.keys():
        d[k] += 1
    else:
        d[k] = 1

In [None]:


counts = {}
total = 0
unhandled = 0

#ps = PorterStemmer()
lemma = WordNetLemmatizer()



try:
    for recipe in recipes:
        for ing in recipe['ingredients']:
            total += 1
            if re.match("[1-9] [1-9]/[1-9]", ing):
                # Number followed by a fraction
                tokens = ing.split()
                update(counts, tokens[2])
            elif re.match("[1-9]/[1-9] to [1-9] [1-9]/[1-9]", ing):
                # Fraction followed by a fraction
                tokens = ing.split()
                update(counts, tokens[4])
            elif re.match("[1-9] to [1-9]", ing):
                # Variable amount
                tokens = ing.split()
                update(counts, tokens[3])
            elif re.match("[1-9]", ing):
                tokens = ing.split()
                update(counts, tokens[1])
            else:
                unhandled += 1
except IndexError as e:
    print(e)
    print(ing)

# Prettyprint the counts
print(f"There are {len(counts.keys())} things that could be ingredients") 
print(sorted([[k, v] for k, v in counts.items()], key = lambda x: x[1], reverse=True))
print(f"Didn't handle {unhandled} of {total} ingredients ({unhandled/total * 100:.3f}%)")

There are 79 things that could be ingredients
[['ounce', 554], ['cup', 529], ['tablespoon', 171], ['teaspoon', 129], ['lime', 30], ['bottle', 28], ['dash', 19], ['thin', 14], ['cinnamon', 13], ['pound', 12], ['small', 12], ['large', 11], ['can', 11], ['orange', 10], ['sprig', 9], ['whole', 8], ['strip', 8], ['lemon', 7], ['cucumber', 6], ['pint', 6], ['blackberry', 5], ['fresh', 5], ['750-ml', 5], ['strawberry', 4], ['quart', 4], ['mint', 4], ['bunch', 4], ['organic', 4], ['black', 4], ['slice', 4], ['rosemary', 4], ['glass', 3], ['vanilla', 3], ['maraschino', 3], ['to', 3], ['sliced', 3], ['sugar', 3], ['medium', 3], ['serrano', 3], ['basil', 3], ['(2', 3], ['ruby-red', 2], ['egg', 2], ['edible', 2], ['plum', 2], ['thyme', 2], ['piece', 2], ['cherry', 2], ['stalk', 2], ['ripe', 2], ['canned', 2], ['star-anise', 1], ['pimento-stuffed', 1], ['1/4-inch-thick', 1], ['mango', 1], ['jalapeno', 1], ['inch', 1], ['thick', 1], ['quarter-inch-thick', 1], ['one-inch', 1], ['packet', 1], ['hibisc

Lemmatizing the (suspected) amounts gets very clean results. Stemming was doing stuff like "bottl" and "ounc". Variations of "cup" and "ounce" account for fully half of the ingredients (1070-some out of 2139), and the basic handler I wrote handles a hair over 80% of the items.  

Of the 79 things that could be units, the ones that actually are are

```python
dict_keys(['tablespoon', 'ounce', 'cup', 'pound', 'teaspoon', 'quart', 'dash', 'glass', 'can', 'bottle', 'pint', 'jar'])
```

In [52]:
units = ['tablespoon', 'ounce', 'cup', 'pound', 'teaspoon', 'quart', 'dash', 'can', 'bottle', 'pint', 'jar']

# Have to rebuild the data structure, since we can't mutate the lists as we edit them
cleaned = []

for recipe in recipes:
    # Copy name and instructions over
    clean_recipe = {"name": recipe["name"], "instructions": recipe["instructions"], "ingredients": []}

    for ing in recipe['ingredients']:
        # Split and then find the first unit
        tokens = ing.split()
        replaced = False
        tmp_ingred = {}
        for t in tokens:
            t_clean = t.strip(',')
            t_clean = lemma.lemmatize(t)
            if t_clean in units:
                unit_idx = tokens.index(t)
                tmp_ingred["ingred_amount"] = " ".join(tokens[:unit_idx])
                tmp_ingred["ingred_unit"] = t_clean
                tmp_ingred["ingred_name"] = " ".join(tokens[unit_idx+1:])
                replaced = True
                break
        if replaced:
            clean_recipe["ingredients"].append(tmp_ingred)
        else:
            clean_recipe["ingredients"].append(ing)
        replaced = False
            
    cleaned.append(clean_recipe)

with open("./spiders/data/martha_stewart_cleaned.json", 'w') as outfile:
    json.dump(cleaned, outfile, indent=4)             

The Martha Stewart Data was created with a very particular presentation in mind, which is to say entertaining. One of the recipes is of a can of ginger ale with a shot of bourbon and a squeeze of lime in it, which is fine, but it's presented as for four people, and so calls for four of everything. It also calls for ice when drinks are shaken with it, or when they're served over it. 

There are a lot of ingredients that are not handled yet, and they're represented in a number of different ways. Lets dump the remaining ingredients and see what we get. 

In [57]:
unhandled = []
with open("./spiders/data/martha_stewart_cleaned.json", 'r') as infile:
    clean_data = json.load(infile)
    for recipe in clean_data:
        for ingred in recipe["ingredients"]:
            if type(ingred) == str:
                unhandled.append(ingred)

unhandled = list(set(unhandled))
print(f"Got {len(unhandled)} items")
for ingred in sorted(unhandled):
    print(ingred)

Got 303 items
1 750ml-bottle sparkling rose
1 black licorice twist, for serving
1 bunch fresh tarragon
1 bunch mint, tough stems removed, 6 sprigs reserved for serving
1 cinnamon stick
1 cinnamon stick, for garnish
1 cucumber spear, for garnish
1 egg white
1 jalapeno chile, seeded and chopped
1 large English cucumber or 2 standard 6-inch-long cucumbers, peeled, seeded, and thinly sliced (3 cups), plus more slices for garnish
1 large English cucumber, peeled and cut into chunks
1 large honeydew melon, seeds and rind removed, cut into large chunks
1 lemon wedge
1 lemon, sliced into 1/8-inch rounds
1 lime
1 lime wedge
1 lime wedge, for garnish
1 lime wedge, plus more for serving
1 lime, cut into wedges
1 lime, sliced into 1/8-inch rounds
1 lime, thinly sliced
1 lime, zested and cut in half
1 maraschino cherry, for serving (optional)
1 medium English cucumber, peeled and chopped
1 medium green apple (about 8 ounces), peeled (peels reserved)
1 navel orange, thinly sliced into half-moons
1 n

There are about 309 instructions, but some of them are duplicates. It looks like there are some strings that we can simply delete, like "for serving" or "for garnish", since those instructions get mentioned in the instructions. 

Single dashes sometimes didn't have a 1 in front of them, finding those instances can fix a few things up, but there were only three of them, so I did it by hand. 

The "Juice of one Lime" is about an ounce, so we can do that substitution

The recipes call out a lemon as being 3 Tbsp of juice, or 1.5oz

The juice of 1/2 grapefruit is variable with the size, one source has "A small grapefruit yields around 1/4 cup of juice, a medium grapefruit yields around 1/3 cup of juice, and a large grapefruit yields around 1/2 cup of juice" (https://thejuiceryworld.com/how-much-juice-in-a-grapefruit/). 

A highball glass can contain 240 to 350 millilitres.

A rocks glass usually holds 180–300 ml / 6-10 us fluid ounces.

That's kind of interesting, because a highball glass and a rocks glass hold about the same amount, maybe a little more on the highball glass. The problem with knowing this is that if I want to top the glass I can take the amount of other ingredients, subtract from the capacity, and get a value, _but I still need to know the glass type_

--- 

Cleaning up Mr Boston. I did the thing with the lines, so clean that up first. 

In [None]:
import json

recipes = []
with open("./spiders/data/mr_boston.json", 'r') as infile:
    for line in infile:
        try:
            data = json.loads(line)
        except:
            print(line)
        # Ignore bad recipes (no ingredients)    
        if len(data["ingredients"]) > 0:
            # Convert names to title case
            data['name'] = data['name'].title()
            recipes.append(data) 



In [121]:
cleaned = []
# Purge recipes with no title
for r in recipes:
    if len(r["name"]) > 0:
        cleaned.append(r)

In [131]:
with open("./spiders/data/mr_boston_cleaned.json", 'r') as infile:
    recipes = json.load(infile)

    units = ['Ounce(s)', 'Dash(es)', "Teaspoon(s)", "Liter", "Quart(s)", "Cup", "Tablespoon(s)", "Pint(s)", "Bottle(s)"]

    # Have to rebuild the data structure, since we can't mutate the lists as we edit them
    cleaned = []

    for recipe in recipes:
        # Copy name and instructions over
        clean_recipe = {"name": recipe["name"], "instructions": recipe["instructions"], "ingredients": []}

        for ing in recipe['ingredients']:
            # Split and then find the first unit
            if type(ing) == str:
                tokens = ing.split()
                replaced = False
                tmp_ingred = {}
                for t in tokens:
                    if t in units:
                        unit_idx = tokens.index(t)
                        tmp_ingred["ingred_amount"] = " ".join(tokens[:unit_idx])
                        tmp_ingred["ingred_unit"] = t
                        tmp_ingred["ingred_name"] = " ".join(tokens[unit_idx+1:])
                        replaced = True
                        break
                if replaced:
                    clean_recipe["ingredients"].append(tmp_ingred)
                else:
                    clean_recipe["ingredients"].append(ing)
                replaced = False
                
        cleaned.append(clean_recipe)

The Mr Boston drinks have "Amount Juice of Orange" and similar, so if we get an ounces amount for each kind of fruit, we can calculate these. 

In [89]:
fruits = {}
amounts = {}

for recipe in cleaned:
    for ing in recipe['ingredients']:
        if type(ing) == str:
            if re.match("[0-9/]* Juice of", ing):
                tok = ing.split()
                idx = tok.index("Juice")
                amount = " ".join(tok[:idx])
                fruit = " ".join(tok[idx+2:])
                update(fruits, fruit)
                update(amounts, amount)
                
print(fruits)
print(amounts)

{'a Lime': 219, 'a Lemon': 388, 'Orange': 91, 'a small orange': 6}
{'1': 174, '1/2': 442, '1/4': 68, '2': 18, '12': 2}


In [123]:
# Converts juice to an amount of juice
juice_amts = {'a Lime': 1, 'a Lemon': 1.5, 'Orange': 2.6, 'a small orange': 1.5}
juice_names = {'a Lime': 'lime juice', 'a Lemon': 'lemon juice', 'Orange': 'orange juice', 'a small orange': 'orange juice'}
# Converts an amount to a scale
scales = {'1': 1, '1/2': 0.5, '1/4': 0.25, '2': 2, '12': 0.5}

fruit_converted = []

for recipe in cleaned:
    converted = {"name": recipe["name"], "instructions": recipe["instructions"], "ingredients": []}
    for ing in recipe['ingredients']:
        tmp_ingred = {}

        if type(ing) == str:
            if re.match("[0-9/]* Juice of", ing):
                tok = ing.split()
                idx = tok.index("Juice")
                amount = scales[" ".join(tok[:idx])]
                fruit_amount = juice_amts[" ".join(tok[idx+2:])]
                fruit_name = juice_names[" ".join(tok[idx+2:])]
                tmp_ingred["ingred_amount"] = str(amount * fruit_amount)
                tmp_ingred["ingred_unit"] = "Ounce(s)"
                tmp_ingred["ingred_name"] = fruit_name
                converted["ingredients"].append(tmp_ingred)
            else:
                converted["ingredients"].append(ing)
        else:
            converted["ingredients"].append(ing)
    fruit_converted.append(converted)



In [132]:
with open("./spiders/data/mr_boston_cleaned.json", 'w') as outfile:
    json.dump(fruit_converted, outfile, indent=4)

What I really should be doing here is implementing the handling of ingredients as a filter, where the filter takes in a thing and returns another thing, and the thing it returns gets put in in place of the input. If the filter doesn't know what to do with the input, it returns the input unchanged. 

Then I can have a second recipe iterator that just takes a file and an ingredient filter and outputs the new JSON for the cleaned up version. Then I can implement a set of filters for e.g. ingredients, call them successively on the file, and get the results. 

In [1]:
import copy
def recipe_processor(infile, ingredient_filter):
    processed = []
    with open(infile, 'r') as recipes:
        recipe_data = json.load(recipes)
        for r in recipe_data:
            # Skip anything that doesn't have ingredients
            if "ingredients" not in r.keys():
                continue
            r_clean = copy.deepcopy(r)
            r_clean["ingredients"] = []
            for i in r['ingredients']:
                clean_i = ingredient_filter(i)
                if clean_i is not None:
                    # This is how we handle deletion of an ingredient
                    r_clean["ingredients"].append(clean_i)
            processed.append(r_clean)
    return processed

In [155]:
def ing_printer(ingredient):
    print(ingredient)
    # Doesn't really do anything to the data
    return ingredient

In [2]:
units = {}
def unit_finder(ingredient):
    match = re.search("[0-9.-]+( )+([^ ]*)", ingredient)
    if match:
        unit = lemma.lemmatize(match[2]).lower()
        update(units, unit)
    return ingredient

In [20]:
_ = recipe_processor("./data/raw/beth_skwarecki_cocktails.json", unit_finder)
print(f"There are {len(units.keys())} things that could be ingredients") 
print(sorted([[k, v] for k, v in units.items()], key = lambda x: x[1], reverse=True))

There are 82 things that could be ingredients
[['ounce', 2105], ['1/2', 362], ['teaspoon', 304], ['dash', 86], ['tablespoon', 59], ['egg', 42], ['lemon', 36], ['cup', 32], ['1/4', 30], ['lime', 26], ['to', 24], ['scoop', 19], ['whole', 10], ['sugar', 10], ['mint', 9], ['or', 9], ['small', 8], ['orange', 7], ['part', 7], ['drop', 7], ['slice', 6], ['banana', 6], ['ice', 5], ['1/3', 5], ['1/8', 5], ['3/4', 5], ['splash', 5], ['cinnamon', 5], ['fresh', 4], ['peach', 4], ['bottle', 4], ['ripe', 3], ['quart', 3], ['fifth', 3], ['dry', 2], ['medium', 2], ['red', 2], ['chilled', 2], ['large', 2], ['brandied', 2], ['of', 2], ['sliced', 2], ['pint', 2], ['crushed', 2], ['can', 2], ['ripened', 1], ['peeled', 1], ['if', 1], ['liter', 1], ['46-ounce', 1], ['5-ounce', 1], ['blue', 1], ['champagne', 1], ['sweet', 1], ['unsweetened', 1], ['sprig', 1], ['pear', 1], ['pitted', 1], ['almond', 1], ['paper-thin', 1], ['raspberry', 1], ['milky', 1], ['starlight', 1], ['campari', 1], ['cherry', 1], ['rum-so

Oz is far and away the leader, there's also a set four "oz)", but that should be straightforward to deal with. 

Dash is a contender. "Pc" is "pcs", but lemmatized, which is actually pretty brilliant of it. 

There are three spoons in here, "bsp" for barspoon, "tbsp" for tablespoon, and "tsp" for teaspoon. 

The units that are actually units are:
```python
['oz', 'drop', 'pc', 'dash', 'cup', 'tsp', 'tbsp', 'oz)', 'pinch', 'bsp', 'splash', 'ounce', 'shot', 'spoon', 'cl', 'ml']

```

In [232]:
unit_list = ['oz', 'drop', 'pc', 'dash', 'cup', 'tsp', 'tbsp', 'oz)', 'pinch', 'bsp', 'splash', 'ounce', 'shot', 'spoon', 'cl', 'ml']

def ingredient_parser(ingredient):
    match = re.search("([0-9.-]+)( )+([\S]*) (.*)", ingredient)
    if match:
        unit = lemma.lemmatize(match[3]).lower()
        if unit in unit_list:
            tmp_i = {}
            tmp_i["ingred_amount"] = match[1]
            tmp_i["ingred_unit"] = unit
            tmp_i["ingred_name"] = match[4].strip()
            return tmp_i
        else:
            return ingredient
    else:
        return ingredient

In [233]:
cleaned = recipe_processor("./spiders/data/cocktail_society.json", ingredient_parser)
with open("./spiders/data/cocktail_society_cleaned.json", 'w') as outfile:
    json.dump(cleaned, outfile, indent=4)

In [None]:
# Convert the data from Beth Skwarecki into a JSON doc 
# Her formatting is tidy enough that it's basically a JSON doc already. 

ingredients = []
instructions = []
name = None
all_cocktails = []
with open("spiders/data/cocktails.txt", 'r') as infile:
    for line in infile:
        if line.startswith("##"):
            continue #Comment, ignore it
        if line.startswith("==="):
            if(len(ingredients) > 0):
                # We're done with one cocktail and onto the next
                all_cocktails.append({'name':name, 'ingredients':ingredients, 'instructions': instructions})
                # Reset everything
                ingredients = []
                instructions = []
            name = line.strip('= \n').title()
        if line.startswith(' *'):
            ingredients.append(line.strip('* \n'))
        if line.startswith(' -'):
            instructions.append(line.strip('- \n'))
    # Flush the last one
    all_cocktails.append({'name':name, 'ingredients':ingredients, 'instructions': instructions})

with open("spiders/data/beth_skwarecki_cocktails.json", 'w') as outfile:
    json.dump(all_cocktails, outfile, indent=4)

ignored comment
ignored comment


In [248]:
# Convert to properly formatted JSON
recipes = []
name = 'liquor_com_first_run.json'
with open(f'spiders/data/{name}', 'r') as infile:
    for line in infile:
        recipes.append(json.loads(line))
with open(f'data/raw/{name}', 'w') as outfile:
    json.dump(recipes, outfile, indent=4)

In [None]:
with open('spiders/data/themixer_drinks.json', 'r') as infile:
    data = json.load(infile)
    with open('data/raw/themixer_drinks.json', 'w') as outfile:
        json.dump(data, outfile, indent=4)

In [7]:
# making sure the Kindred spider isn't saving dupe cocktails
import json
names = set()
with open("spiders/kindred.json", 'r') as infile:
    data = json.load(infile)
    print(f"Got {len(data)} in file")
    for recipe in data:
        if recipe['name'] in names:
            print(f'{recipe["name"]} is not unique')
        names.add(recipe['name'])
    print(f"Got {len(names)} unique")

Got 8439 in file
Last Caress is not unique
Martinez is not unique
Adair Hook is not unique
Brooklyn Cocktail is not unique
Lion's Tail is not unique
Negroni is not unique
Aviation Cocktail is not unique
Ramona Flowers is not unique
Penicillin is not unique
Honeymusk is not unique
Honeymusk is not unique
Insanely Good Gin & Tonic is not unique
Arracknaphobia is not unique
Last Word is not unique
Martinez is not unique
Ramona Flowers is not unique
Penicillin is not unique
Adair Hook is not unique
Bitter Elder is not unique
White Negroni is not unique
Colonel Carpano is not unique
Aviation Cocktail is not unique
Insanely Good Gin & Tonic is not unique
Negroni is not unique
Ramos Gin Fizz is not unique
Bitter Elder is not unique
Last Word is not unique
Colonel Carpano is not unique
Maple Leaf is not unique
White Negroni is not unique
Lion's Tail is not unique
John the Baptist is not unique
Brooklyn Cocktail is not unique
Jasmine is not unique
The Riviera is not unique
John the Baptist is n

In [None]:
with open("data/raw/cocktaillove.json", 'r') as infile:
    data = json.load(infile)

    with open('data/raw/cocktaillove_unesc.json', 'w') as outfile:
        json.dump(data, outfile, indent=4, ensure_ascii=False)

In [7]:
import json
import re
_ = recipe_processor("./data/raw/cocktaillove.json", unit_finder)
print(f"There are {len(units.keys())} things that could be ingredients") 
print(sorted([[k, v] for k, v in units.items()], key = lambda x: x[1], reverse=True))

There are 52 things that could be ingredients
[['oz', 1793], ['dash', 278], ['tsp', 182], ['orange', 65], ['lemon', 60], ['mint', 47], ['lime', 45], ['brandied', 25], ['grapefruit', 22], ['egg', 19], ['strawberry', 18], ['cucumber', 17], ['rinse', 15], ['pinch', 12], ['raspberry', 10], ['apple', 9], ['leaf', 8], ['cherry', 7], ['drop', 6], ['fuji', 6], ['pineapple', 6], ['cinnamon', 6], ['blackberry', 5], ['white', 4], ['slice', 4], ['curry', 4], ['bartlett', 4], ['nutmeg', 3], ['candied', 3], ['c', 3], ['granny', 2], ['celery', 2], ['fee', 2], ['coffee', 2], ['green', 2], ['cilantro', 2], ['cardamom', 2], ['anjou', 2], ['thai', 2], ['kaffir', 2], ['tangerine', 1], ['dried', 1], ['basil', 1], ['nectarine', 1], ['braeburn', 1], ['death', 1], ['dark', 1], ['luxardo', 1], ['peach', 1], ['ground', 1], ['thyme', 1], ['angostura', 1]]


In [9]:
unit_list = ['oz', 'dash', 'tsp', 'pinch', 'drop', 'slice', 'c']

def ingredient_parser(ingredient):
    match = re.search("([0-9.-]+)( )+([\S]*) (.*)", ingredient)
    if match:
        unit = lemma.lemmatize(match[3]).lower()
        if unit in unit_list:
            tmp_i = {}
            tmp_i["ingred_amount"] = match[1]
            tmp_i["ingred_unit"] = unit
            tmp_i["ingred_name"] = match[4].strip()
            return tmp_i
        else:
            return ingredient
    else:
        return ingredient

In [10]:
cleaned = recipe_processor("./data/raw/cocktaillove.json", ingredient_parser)
with open("./data/cleaned/cocktaillove.json", 'w') as outfile:
    json.dump(cleaned, outfile, indent=4)