This is a notebook that has a cleaned-up version of cleaning up the units split in a cocktail data set. The data sets are JSON, and each cocktail has a name, a set of ingredients, and a set of instructions. The ingredients each have a name, an amount, and an ingredient. The instructions are just a list of strings.

In [38]:
import re
import nltk
import json
import copy
from nltk.stem.wordnet import WordNetLemmatizer
lemma = WordNetLemmatizer()

# Convenience function for keeping a count of strings, lemmatized
# to be the base (so teaspoons becomes teaspoon)
def update_lemma(d, k):
    k = k.strip(',')
    k = lemma.lemmatize(k)
    if k in d.keys():
        d[k] += 1
    else:
        d[k] = 1

# Runs through an entire file and applies a filter to each ingredient
# Filters return the new JSON representation for the ingredient, or 
# None if they decided to delete the ingredient. 
def recipe_processor(infile, ingredient_filter):
    processed = []
    with open(infile, 'r') as recipes:
        recipe_data = json.load(recipes)
        for r in recipe_data:
            # Skip anything that doesn't have ingredients
            if "ingredients" not in r.keys():
                continue
            r_clean = copy.deepcopy(r)
            r_clean["ingredients"] = []
            for i in r['ingredients']:
                clean_i = ingredient_filter(i)
                if clean_i is not None:
                    # This is how we handle deletion of an ingredient
                    r_clean["ingredients"].append(clean_i)
            processed.append(r_clean)
    return processed

It's probably better to parse thing that look like units into fractional representation first, and then handle unit detection. The regexes in the previous implementation of `unit_finder()` were intended to do that, but missed two instances of '1/2'. 

First, let's look at the first two units of the ingredients. There will be a lot that are a number followed by a fraction, a lot that are a number followed by something else, and a few that are a number followed by "to".

In [32]:
# Given a string instruction that starts with a fraction, 
# Convert that fraction into a decimal value and replace the instruction. 
def fraction_handler(i):
    if type(i) is str:
        frac_match = re.match("([0-9]+) ([0-9]+)/([0-9]+)", i)
        if frac_match:
            amount = int(frac_match[1]) + float(frac_match[2])/float(frac_match[3])
            return {'ingred_amount': amount, 'ingred_name':i[len(frac_match[0])+1:]}
    return i
            
fracs = recipe_processor("./data/raw/beth_skwarecki_cocktails.json", fraction_handler)

with open ("./data/cleaned/beth_skwarecki_cocktails.json", 'w') as outfile:
    json.dump(fracs, outfile, indent=4)

In [33]:
# For fractions with no integer part
def fraction_handler(i):
    if type(i) is str:
        frac_match = re.match("([0-9]+)/([0-9]+)", i)
        if frac_match:
            amount = float(frac_match[1])/float(frac_match[2])
            return {'ingred_amount': amount, 'ingred_name':i[len(frac_match[0])+1:]}
    return i
            
fracs = recipe_processor("./data/cleaned/beth_skwarecki_cocktails.json", fraction_handler)

with open ("./data/cleaned/beth_skwarecki_cocktails.json", 'w') as outfile:
    json.dump(fracs, outfile, indent=4)

In [34]:
# And then whole numbers, but this does miss instances of "to"
def int_handler(i):
    if type(i) is str:
        frac_match = re.match("([0-9]+) ", i)
        if frac_match:
            amount = float(frac_match[1])
            return {'ingred_amount': amount, 'ingred_name':i[len(frac_match[0]):]}
    return i
            
fracs = recipe_processor("./data/cleaned/beth_skwarecki_cocktails.json", int_handler)

with open ("./data/cleaned/beth_skwarecki_cocktails.json", 'w') as outfile:
    json.dump(fracs, outfile, indent=4)

The file now has ingredients broken into amounts and names. The names could include amounts, so first lets handle all the ingredients that have remained as strings. 

In [41]:
# Get a count of everything after a number, which could be an ingredient
units = {}

def unit_finder(i):
    if type(i) is str:
        tok = i.split()
        update_lemma(units, tok[0].lower())
    # Doesn't actually do anything to the recipe
    return i

In [42]:
_ = recipe_processor("./data/cleaned/beth_skwarecki_cocktails.json", unit_finder)
print(f"There are {len(units.keys())} things that could be units") 
print(sorted([[k, v] for k, v in units.items()], key = lambda x: x[1], reverse=True))

There are 104 things that could be units
[['dash', 154], ['splash', 75], ['juice', 71], ['club', 48], ['chilled', 41], ['champagne', 25], ['one', 21], ['sugar', 19], ['ginger', 18], ['orange', 9], ['salt', 9], ['lime', 7], ['hot', 7], ['pinch', 6], ['lemon-lime', 4], ['iced', 4], ['tonic', 4], ['drop', 3], ['mint', 3], ['sprinkle', 3], ['tabasco', 3], ['coffee', 3], ['squeeze', 3], ['cola', 3], ['cold', 3], ['ice', 2], ['crushed', 2], ['(this', 2], ['sparkling', 2], ['powdered', 2], ['several', 2], ['grapefruit', 2], ['raspberry', 2], ['peach', 2], ['cranberry', 2], ['lemon', 2], ['grenadine', 2], ['note:', 2], ['making', 2], ['lemonade', 2], ['dab', 1], ['tomato', 1], ['ounce', 1], ['whipping', 1], ['apple', 1], ['with', 1], ['build', 1], ['bunch', 1], ['12-ounce', 1], ['bloody', 1], ['below.)', 1], ['peel', 1], ['thin', 1], ['centrate', 1], ['thawed', 1], ['slice', 1], ['apple;', 1], ['milk', 1], ['half', 1], ['burgundy', 1], ['claret', 1], ['pineapple-grapefruit', 1], ['fresh', 1], 

There are a lot of things, but lets see what they are in the actual file. "Dash" and "splash" are both singular dashes or splashes of things, so those can have the unit set to dash or splash, and the amount set to 1. I have the feeling that a dash is bigger than a splash, but this is purely vibes. "Drop" is, of course, smaller than both, and is a drop. 

"Juice" is lines of the form "Juice of N fruits", which is fairly annoying because fruits are not standardized. I'll write a little lookup table to deal with the amount and the fruit, it's basically multiplying the count of fruits by the expected juice in the fruit, which you can look up online. 

"Club" is club soda. This one is tricky, because it's "to fill", which depends on the amount of other stuff and the glassware you use. Again, there are heuristic ways to handle this (glass volume - ingredients) = "to fill", so I probably just need a utility function that can do the conversion. Interestingly, Collins and rocks glasses are the same volume, but, like Bert and Ernie, one is taller and thin, the other is shorter and round. 

"One" is in cases of "One part ingredient", which will get its own fix. 

In [43]:
one_units = ['dash','splash','pinch', 'drop', 'sprinkle', 'squeeze', 'dab']

def one_unitize(i):
    if type(i) is str:
        tok = i.split()
        unit = lemma.lemmatize(tok[0].lower())
        if unit in one_units:
            return {'ingred_amount': 1, 'ingred_name': " ".join(tok[1:]), 'ingred_unit':unit}
    return i

one_unitized = recipe_processor("./data/cleaned/beth_skwarecki_cocktails.json", one_unitize)
with open ("./data/cleaned/beth_skwarecki_cocktails.json", 'w') as outfile:
    json.dump(one_unitized, outfile, indent=4)

At this point, we have a lot of the amounts converted to numbers, but a bunch of the ingredients don't have their units seperated from the ingredient itself, so lets look at what the first word in the ingredients that have numbers, but don't have units. 

In [44]:
units = {}
def count_units(i):
    if type(i) is not str:
        if "ingred_unit" not in i.keys():
            update_lemma(units, i['ingred_name'].split()[0])

_ = recipe_processor("./data/cleaned/beth_skwarecki_cocktails.json", count_units)
print(f"There are {len(units.keys())} things that could be units") 
print(sorted([[k, v] for k, v in units.items()], key = lambda x: x[1], reverse=True))

There are 73 things that could be units
[['ounce', 2498], ['teaspoon', 313], ['dash', 86], ['tablespoon', 60], ['egg', 42], ['cup', 35], ['to', 24], ['scoop', 19], ['whole', 10], ['sugar', 10], ['mint', 9], ['or', 9], ['small', 7], ['part', 7], ['slice', 6], ['drop', 6], ['banana', 6], ['ice', 5], ['splash', 5], ['cinnamon', 5], ['lemon', 4], ['peach', 4], ['bottle', 4], ['ripe', 3], ['quart', 3], ['fifth', 3], ['lime', 3], ['fresh', 3], ['dry', 2], ['medium', 2], ['red', 2], ['chilled', 2], ['large', 2], ['brandied', 2], ['of', 2], ['sliced', 2], ['pint', 2], ['crushed', 2], ['can', 2], ['ripened', 1], ['peeled', 1], ['liter', 1], ['46-ounce', 1], ['5-ounce', 1], ['blue', 1], ['champagne', 1], ['sweet', 1], ['unsweetened', 1], ['sprig', 1], ['pear', 1], ['pitted', 1], ['almond', 1], ['paper-thin', 1], ['raspberry', 1], ['Milky', 1], ['Starlight', 1], ['Campari', 1], ['cherry', 1], ['rum-soaked', 1], ['littleneck', 1], ['raisin', 1], ['peanut', 1], ['wineglass', 1], ['thin', 1], ['lump

The ones that are actually units are 
```python
['ounce', 'teaspoon', 'dash', 'tablespoon', 'cup', 'scoop', 'part', 'drop', 'splash', 'bottle', 'quart', 'fifth', 'pint', 'can', 'liter', 'sprig', 'pound']
```

"To" is a special case that I'll address later.

In [None]:
units = ['ounce', 'teaspoon', 'dash', 'tablespoon', 'cup', 'scoop', 'part', 'drop', 'splash', 'bottle', 'quart', 'fifth', 'pint', 'can', 'liter', 'sprig', 'pound']

def replace_units(i):
    if type(i) is not str:
        if "ingred_unit" not in i.keys():
            p_unit = i['ingred_name'].split()[0]
            l_unit = lemma.lemmatize(p_unit.lower())
            if l_unit in units:
                i["ingred_unit"] = l_unit
                i["ingred_name"] = i["ingred_name"].replace(p_unit, '', 1)
    return i

unitized = recipe_processor("./data/cleaned/beth_skwarecki_cocktails.json", replace_units)
with open ("./data/cleaned/beth_skwarecki_cocktails.json", 'w') as outfile:
    json.dump(unitized, outfile, indent=4)

In [46]:
units = {}
def count_units(i):
    if type(i) is not str:
        if "ingred_unit" not in i.keys():
            update_lemma(units, i['ingred_name'].split()[0])

_ = recipe_processor("./data/cleaned/beth_skwarecki_cocktails.json", count_units)
print(f"There are {len(units.keys())} things that could be units") 
print(sorted([[k, v] for k, v in units.items()], key = lambda x: x[1], reverse=True))

There are 56 things that could be units
[['egg', 42], ['to', 24], ['whole', 10], ['sugar', 10], ['mint', 9], ['or', 9], ['small', 7], ['slice', 6], ['banana', 6], ['ice', 5], ['cinnamon', 5], ['lemon', 4], ['peach', 4], ['ripe', 3], ['lime', 3], ['fresh', 3], ['dry', 2], ['medium', 2], ['red', 2], ['chilled', 2], ['large', 2], ['brandied', 2], ['of', 2], ['sliced', 2], ['crushed', 2], ['ripened', 1], ['peeled', 1], ['46-ounce', 1], ['5-ounce', 1], ['blue', 1], ['champagne', 1], ['sweet', 1], ['unsweetened', 1], ['pear', 1], ['pitted', 1], ['almond', 1], ['paper-thin', 1], ['raspberry', 1], ['Milky', 1], ['Starlight', 1], ['Campari', 1], ['cherry', 1], ['rum-soaked', 1], ['littleneck', 1], ['raisin', 1], ['peanut', 1], ['wineglass', 1], ['thin', 1], ['lump', 1], ['gallon', 1], ['.Garnish', 1], ['peel', 1], ['curacao', 1], ['Garnish', 1], ['inch', 1], ['grapefruit', 1]]


Ok, now let's rerun that count of things that could be units, and look for the other cases:
- "To", as in "to N", which is to say, a case where there was some variation. I'm content to drop this, because it is usually "1 to 2 dashes", and if you want more of it, you can know to add more dashes. 
- "Or", as in "or N", but this is basically the same as the previous case
- "Of", as in "of ingredient". 

These are all simple to find/replace with a text editor, so I didn't bother doing anything clever to them. 

In [71]:
units = {}

def update(d, k):
    if k in d.keys():
        d[k] += 1
    else:
        d[k] = 1

def remaining_str(i):
    if type(i) is str:
        update(units, i)
    return i

_ = recipe_processor("./data/cleaned/beth_skwarecki_cocktails.json", remaining_str)
print(sorted([[k, v] for k, v in units.items()], key = lambda x: x[1], reverse=True))

[['Club soda', 44], ['Chilled champagne', 33], ['Champagne', 25], ['Ginger ale', 17], ['Sugar', 16], ['Orange juice', 6], ['Hot coffee', 6], ['Salt', 5], ['Lime wedge', 4], ['Lemon-lime soda', 4], ['Tonic water', 4], ['Salt and pepper to taste', 3], ['Sugar to taste', 3], ['Iced club soda', 3], ['Chilled brut champagne', 3], ['Orange peel', 3], ['Coffee', 3], ['Crushed ice', 2], ['Mint sprigs', 2], ['Grapefruit juice', 2], ['Club Soda', 2], ['Lime slice', 2], ['Peach slice', 2], ['Cola soda', 2], ['Cranberry juice', 2], ['Lemon slice', 2], ['Grenadine', 2], ['Note: Keep all ingredients refrigerated in advance of', 2], ['making this drink.', 2], ['Lemonade', 2], ['Cold club soda', 2], ['Ice cubes to fill blender', 1], ['Tomato juice', 1], ['(This drink can be made frothier by adding 1 1/2', 1], ['ounces heavy whipping cream. If adding heavy', 1], ['whipping cream, use a goblet as glassware.)', 1], ['Apple juice', 1], ['with this tangy one.', 1], ['Build in a collins glass with ice.', 1]

Remaining is the "Juice of whatever", and some "One part" ratio stuff.

In [62]:
for k, count in units.items():
    if k.startswith("Juice of"):
        print(f"{k}\t{count}")

Juice of 2 oranges	1
Juice of 2 lemons	1
Juice of 12 lemons	1
Juice of 1/2 lemon	19
Juice of 1/2 lime	12
Juice of 1 lime	9
Juice of 1 orange	4
Juice of 1/4 lemon	6
Juice of 1/4 orange	3
Juice of 1 lemon	6
Juice of 2 limes	2
Juice of 1 passion fruit	1
Juice of 1/4 lime	1


Ok, this is not as consistent as I would have hoped, so I'm going to go through by hand and pick out the ones that are less consistent and normalize them by hand. 

In [66]:
juice_amts = {'lime': 1, 'lemon': 1.5, 'orange': 2.5, 'passionfruit': 1}

# The passionfruit is pretty much a guess based on looking at how to juice them online

def juice_normalize(i):
    if type(i) is str and i.startswith("Juice of"):
        tok = i.split()
        amount = tok[2]
        fruit = lemma.lemmatize(tok[3])
        total = None
        if "/" in amount:
            a, b = amount.split("/")
            total = float(a)/float(b) * juice_amts[fruit]
        else:
            total = float(amount) * juice_amts[fruit]
        print(f"{total} ounces {fruit} juice")
        return {"ingred_name": f"{fruit} juice", "ingred_unit": "ounce", "ingred_amount": total}
    else:
        return i

data = recipe_processor("./data/cleaned/beth_skwarecki_cocktails.json", juice_normalize)
with open ("./data/cleaned/beth_skwarecki_cocktails.json", 'w') as outfile:
    json.dump(data, outfile, indent=4)

5.0 ounces orange juice
3.0 ounces lemon juice
18.0 ounces lemon juice
0.75 ounces lemon juice
0.5 ounces lime juice
1.0 ounces lime juice
2.5 ounces orange juice
0.75 ounces lemon juice
0.75 ounces lemon juice
0.375 ounces lemon juice
0.625 ounces orange juice
0.625 ounces orange juice
1.0 ounces lime juice
0.5 ounces lime juice
0.75 ounces lemon juice
0.375 ounces lemon juice
0.625 ounces orange juice
0.75 ounces lemon juice
0.75 ounces lemon juice
0.75 ounces lemon juice
0.5 ounces lime juice
0.75 ounces lemon juice
0.75 ounces lemon juice
0.5 ounces lime juice
0.75 ounces lemon juice
0.75 ounces lemon juice
0.375 ounces lemon juice
0.5 ounces lime juice
0.75 ounces lemon juice
1.5 ounces lemon juice
1.5 ounces lemon juice
1.0 ounces lime juice
0.5 ounces lime juice
2.0 ounces lime juice
1.5 ounces lemon juice
0.5 ounces lime juice
0.75 ounces lemon juice
0.75 ounces lemon juice
1.0 ounces lime juice
0.5 ounces lime juice
1.0 ounces lime juice
1.0 ounces lime juice
2.5 ounces orange

In [70]:
# Handle "parts" cocktails
def unpart(i):
    if type(i) is str and i.startswith("One part"):
        tok = i.split()
        amount = 1
        unit = "part"
        ingred = " ".join(tok[2:])
        return {"ingred_unit": unit, "ingred_amount": amount, "ingred_name": ingred}
    else:
        return i
    
data  = recipe_processor("./data/cleaned/beth_skwarecki_cocktails.json", unpart)
with open ("./data/cleaned/beth_skwarecki_cocktails.json", 'w') as outfile:
    json.dump(data, outfile, indent=4)

At this point, I think I'm pretty much done, in that everything that has a specified amount. I do have to go through and clean up places where there are things that are not ingredients in the ingredients, but that's straightforward. 