## Example: Recipe Database

These vectorized string operations become most useful in the process of cleaning up messy, real-world data.
Here I'll walk through an example of that, using an open recipe database compiled from various sources on the Web.
Our goal will be to parse the recipe data into ingredient lists, so we can quickly find a recipe based on some ingredients we have on hand.

The scripts used to compile this can be found at https://github.com/fictivekin/openrecipes, and the link to the current version of the database is found there as well.

As of Spring 2016, this database is about 30 MB, and can be downloaded and unzipped with these commands:

In [None]:
# !curl https://s3.amazonaws.com/openrecipes/20170107-061401-recipeitems.json.gz --output 20170107-061401-recipeitems.json.gz
# !gunzip 20170107-061401-recipeitems.json.gz

We assume you have used the cell above to get it n your home machine (or your Google Drive if you are working
in Google Colab, and we further assume you know how to read files from your Goioigle Drive if]you're on Google Colab). 

The database is in JSON format, so we will try ``pd.read_json`` to read it in as a `pandas` `DataFrame`.

In [3]:
import pandas as pd
path = '/Users/gawron/Desktop/src/sphinx/python_for_ss_extras/colab_notebooks/'\
     'python-for-social-science/pandas/datasets/20170107-061401-recipeitems.json'
try:
    recipes = pd.read_json(path)
except ValueError as e:
    print("ValueError:", e)

ValueError: Trailing data


Oops! We get a ``ValueError`` mentioning that there is "trailing data."
Searching for the text of this error on the Internet, it seems that it's due to using a file in which *each line* is itself a valid JSON, but the full file is not.
Let's check if this interpretation is true:

JMG The underlying problem comes into focus if we try
to json process a single line.
What we discovered is that a single json line does not produce the partial DataFrame
you want:

In [2]:
import json
with open(path,'r') as fh:
    line = fh.readline().strip()
json_dict = json.loads(line)

In [3]:
json_dict

{'_id': {'$oid': '5160756b96cc62079cc2db15'},
 'name': 'Drop Biscuits and Sausage Gravy',
 'ingredients': 'Biscuits\n3 cups All-purpose Flour\n2 Tablespoons Baking Powder\n1/2 teaspoon Salt\n1-1/2 stick (3/4 Cup) Cold Butter, Cut Into Pieces\n1-1/4 cup Butermilk\n SAUSAGE GRAVY\n1 pound Breakfast Sausage, Hot Or Mild\n1/3 cup All-purpose Flour\n4 cups Whole Milk\n1/2 teaspoon Seasoned Salt\n2 teaspoons Black Pepper, More To Taste',
 'url': 'http://thepioneerwoman.com/cooking/2013/03/drop-biscuits-and-sausage-gravy/',
 'image': 'http://static.thepioneerwoman.com/cooking/files/2013/03/bisgrav.jpg',
 'ts': {'$date': 1365276011104},
 'cookTime': 'PT30M',
 'source': 'thepioneerwoman',
 'recipeYield': '12',
 'datePublished': '2013-03-11',
 'prepTime': 'PT10M',
 'description': 'Late Saturday afternoon, after Marlboro Man had returned home with the soccer-playing girls, and I had returned home with the...'}

In [68]:
pd.DataFrame(json_dict)

Unnamed: 0,_id,name,ingredients,url,image,ts,cookTime,source,recipeYield,datePublished,prepTime,description
$oid,5160756b96cc62079cc2db15,Drop Biscuits and Sausage Gravy,Biscuits\n3 cups All-purpose Flour\n2 Tablespo...,http://thepioneerwoman.com/cooking/2013/03/dro...,http://static.thepioneerwoman.com/cooking/file...,,PT30M,thepioneerwoman,12,2013-03-11,PT10M,"Late Saturday afternoon, after Marlboro Man ha..."
$date,,Drop Biscuits and Sausage Gravy,Biscuits\n3 cups All-purpose Flour\n2 Tablespo...,http://thepioneerwoman.com/cooking/2013/03/dro...,http://static.thepioneerwoman.com/cooking/file...,1365276000000.0,PT30M,thepioneerwoman,12,2013-03-11,PT10M,"Late Saturday afternoon, after Marlboro Man ha..."


JMG: Note the DataFrame resulting from this json dictionary has two rows, and the index is garbled:

This is due to a `pandas`
convention. Basically the "row" `A B {a: C} {b:D}` is read as an abbreviation for two rows,
differing only in their third and fourth column values  and 
also in their index entries. So in particular `A B {a: C} {b:D}`  becomes

```
a  A B C  NaN
b  A B Nan D
```

The rather awful solution
that worked was to read the file in line by line as raw json, and throw away the dictionary wrappers, which were intended to provide typing information.

In [4]:
import json
def fix_dd (dd):
    to_do = []
    for (key,val) in dd.items():
        if isinstance(val,dict) and len(val) == 1:
            to_do.append((key, val[list(val.keys())[0]]))
        elif isinstance(val,dict)and len(val) == 0:
            print('***Zero-keyed value found***')
        elif isinstance(val,dict):
            print('***Multi-keyed value found***')
    for (key,new_val) in to_do:
        # overwrite key -> {k:v} to be key ->v
        dd[key] = new_val
    return dd

with open(path) as f:        
    recipes = pd.DataFrame(fix_dd(json.loads(line.strip())) for line in f)

In [5]:
recipes.shape

(173278, 17)

We see there are nearly 200,000 recipes, and 17 columns.
Let's take a look at one row to see what we have:

In [12]:
# JMG modified output
recipes.iloc[0]

_id                                            5160756b96cc62079cc2db15
name                                    Drop Biscuits and Sausage Gravy
ingredients           Biscuits\n3 cups All-purpose Flour\n2 Tablespo...
url                   http://thepioneerwoman.com/cooking/2013/03/dro...
image                 http://static.thepioneerwoman.com/cooking/file...
ts                                                        1365276011104
cookTime                                                          PT30M
source                                                  thepioneerwoman
recipeYield                                                          12
datePublished                                                2013-03-11
prepTime                                                          PT10M
description           Late Saturday afternoon, after Marlboro Man ha...
totalTime                                                           NaN
creator                                                         

There is a lot of information there, but much of it is in a very messy form, as is typical of data scraped from the Web.
In particular, the ingredient list is in string format; we're going to have to carefully extract the information we're interested in.
Let's start by taking a closer look at the ingredients:

In [125]:
recipes.ingredients.str.len().describe()

count    173278.000000
mean        244.617926
std         146.705285
min           0.000000
25%         147.000000
50%         221.000000
75%         314.000000
max        9067.000000
Name: ingredients, dtype: float64

The ingredient lists average 250 characters long, with a minimum of 0 and a maximum of nearly 10,000 characters!

Just out of curiousity, let's see which recipe has the longest ingredient list:

In [6]:
import numpy as np
test = np.argmax(recipes.ingredients.str.len())
testname = recipes.name[test]
testname

'Carrot Pineapple Spice &amp; Brownie Layer Cake with Whipped Cream &amp; Cream Cheese Frosting and Marzipan Carrots'

That certainly looks like an involved recipe.

#### Cleaning up the ingredients list

There is a problem with with the longest ingredients list.

Here's what the first two lines of the longest ingredients list looks like:

In [9]:
print(*recipes.ingredients[test].split('\n')[:2],sep='\n\n')

1 cup carrot juice2 cups (280 g) all purpose flour1/2 cup (65 g) almond meal or almond flour1 tablespoon baking powder1 teaspoon baking soda3/4 teaspoon fine sea salt1 teaspoon ground cinnamon1/2 teaspoon ground ginger1/2 teaspoon ground nutmeg1/4 teaspoon ground allspice1/4 teaspoon ground cloves1/4 teaspoon ground cardamom1/4 teaspoon fresh ground black pepper2 cups (400 g) white granulated sugar3/4 cup (175 mL) neutral flavored cooking oil (like sunflower)1 lbs (450 g) carrots, finely freshly grated1 cup (160 g) finely chopped fresh pineapple3 large eggs2 teaspoon vanilla extract1/2 teaspoon almond extract

2 cups (280 g) all purpose flour1/2 cup (65 g) almond meal or almond flour1 tablespoon baking powder1 teaspoon baking soda3/4 teaspoon fine sea salt1 teaspoon ground cinnamon1/2 teaspoon ground ginger1/2 teaspoon ground nutmeg1/4 teaspoon ground allspice1/4 teaspoon ground cloves1/4 teaspoon ground cardamom1/4 teaspoon fresh ground black pepper2 cups (400 g) white granulated suga

Note the second line is a **partial** repeat of the first.  That is, it's the same the same as the first after the first ingredient ("1 cup carrot juice") has been stripped away.

Similarly. the third line is the same as the second after the first ingredient in the second line has been stripped
away.  And so on.  This is probably due to a bug in some code that was supposed to parse the ingredients.

The problem is pretty weird; the fix is pretty ad hoc.

In [18]:
test33 = recipes.ingredients[test]

def check_index (i,ingreds_list):
    l_1,l_2 = ingreds_list[i],ingreds_list[i+1]
    len1, len2 = len(l_1), len(l_2)
    # if the following line is a partial copy of this one
    if (len1 > len2) and (l_1[-len2:] == l_2):
        # return the unduplicated part of this line.
        return l_1[:len1-len2]
    else:
        return l_1
    
def remove_dupes (container):
    # There are some duplicates in the recipes lists even after the trailing line problem is fixed.
    # We cant use `set` for duplicate removal because the order of well-formed
    # ingredient lists often reflects internal structure (e.g., ingredients listed
    # in order of usage or all spices go together)
    # After splitting rejoin with newlines
    seen,res = set(), []
    for elem in container:
        if elem not in seen:
            res.append(elem)
            seen.add(elem)
    return res

def remove_line_trailing_dupes (ingreds):
    ingreds_list = ingreds.split('\n')
    n= len(ingreds_list) -1
    pre = [check_index(i,ingreds_list) for i in range(n)] + ingreds_list[-1:]
    return '\n'.join(remove_dupes(pre))
  
print(test33)
print()
new_test33 = remove_line_trailing_dupes(test33)
print(new_test33)

1 cup carrot juice2 cups (280 g) all purpose flour1/2 cup (65 g) almond meal or almond flour1 tablespoon baking powder1 teaspoon baking soda3/4 teaspoon fine sea salt1 teaspoon ground cinnamon1/2 teaspoon ground ginger1/2 teaspoon ground nutmeg1/4 teaspoon ground allspice1/4 teaspoon ground cloves1/4 teaspoon ground cardamom1/4 teaspoon fresh ground black pepper2 cups (400 g) white granulated sugar3/4 cup (175 mL) neutral flavored cooking oil (like sunflower)1 lbs (450 g) carrots, finely freshly grated1 cup (160 g) finely chopped fresh pineapple3 large eggs2 teaspoon vanilla extract1/2 teaspoon almond extract
2 cups (280 g) all purpose flour1/2 cup (65 g) almond meal or almond flour1 tablespoon baking powder1 teaspoon baking soda3/4 teaspoon fine sea salt1 teaspoon ground cinnamon1/2 teaspoon ground ginger1/2 teaspoon ground nutmeg1/4 teaspoon ground allspice1/4 teaspoon ground cloves1/4 teaspoon ground cardamom1/4 teaspoon fresh ground black pepper2 cups (400 g) white granulated sugar

This trims down the size considerably.

In [19]:
print(len(test33))
print(len(new_test33))

9067
1284


Applied to the whole column, the cleanup strips off about 2 million characters.

In [None]:
new_col = recipes.ingredients.apply(remove_line_trailing_dupes)

In [20]:
print(f'{recipes.ingredients.str.len().sum():,}')
print(f'{new_col.str.len().sum():,}')

42,386,905
40,114,633


#### End cleaning up ingredients

We can also search for particular kinds of recipes, such as which recipes have the "breakfast" in them.

In [115]:
recipes.description.str.contains('[Bb]reakfast').sum()

3524

Or how many of the recipes list cinnamon as an ingredient:

In [36]:
recipes.ingredients.str.contains('[Cc]innamon').sum()

10537

We could even look to see whether any recipes misspell the ingredient as "cinamon":

In [117]:
recipes.ingredients.str.contains('[Cc]inamon').sum()

11

This is the type of essential data exploration that is possible with Pandas string tools.
It is data munging like this that Python really excels at.

### A simple recipe recommender

Let's go a bit further, and start working on a simple recipe recommendation system: given a list of ingredients, find a recipe that uses all those ingredients.
While conceptually straightforward, the task is complicated by the heterogeneity of the data: there is no easy operation, for example, to extract a clean list of ingredients from each row.
So we will cheat a bit: we'll start with a list of common ingredients, and simply search to see whether they are in each recipe's ingredient list.
For simplicity, let's just stick with herbs and spices for the time being:

In [39]:
spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley',
              'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']

We can then build a Boolean ``DataFrame`` consisting of True and False values, indicating whether this ingredient appears in the list:

In [40]:
import re
spice_df = pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE))
                             for spice in spice_list))
spice_df.head()

Unnamed: 0,salt,pepper,oregano,sage,parsley,rosemary,tarragon,thyme,paprika,cumin
0,False,False,False,True,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False
2,True,True,False,False,False,False,False,False,False,True
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False


Now, as an example, let's say we'd like to find a recipe that uses parsley, paprika, and tarragon.
We can compute this very quickly using the ``query()`` method of ``DataFrame``s, discussed in [High-Performance Pandas: ``eval()`` and ``query()``](03.12-Performance-Eval-and-Query.ipynb):

In [41]:
selection = spice_df.query('parsley & paprika & tarragon')
len(selection)

10

In [42]:
selection

Unnamed: 0,salt,pepper,oregano,sage,parsley,rosemary,tarragon,thyme,paprika,cumin
2069,False,True,False,False,True,False,True,False,True,False
74964,False,False,False,False,True,False,True,False,True,False
93768,True,True,False,True,True,False,True,False,True,False
113926,True,True,False,False,True,False,True,False,True,False
137686,True,True,False,False,True,False,True,False,True,False
140530,True,True,False,False,True,False,True,True,True,False
158475,True,True,False,False,True,False,True,False,True,True
158486,True,True,False,False,True,False,True,False,True,False
163175,True,True,True,False,True,False,True,False,True,False
165243,True,True,False,False,True,False,True,False,True,False


We find only 10 recipes with this combination; let's use the index returned by this selection to discover the names of the recipes that have this combination:

In [None]:
recipes.name[selection.index]

2069      All cremat with a Little Gem, dandelion and wa...
74964                         Lobster with Thermidor butter
93768      Burton's Southern Fried Chicken with White Gravy
113926                     Mijo's Slow Cooker Shredded Beef
137686                     Asparagus Soup with Poached Eggs
140530                                 Fried Oyster Po’boys
158475                Lamb shank tagine with herb tabbouleh
158486                 Southern fried chicken in buttermilk
163175            Fried Chicken Sliders with Pickles + Slaw
165243                        Bar Tartine Cauliflower Salad
Name: name, dtype: object

Now that we have narrowed down our recipe selection by a factor of almost 20,000, we are in a position to make a more informed decision about what we'd like to cook for dinner.

### Going further with recipes

Hopefully this example has given you a bit of a flavor (ba-dum!) for the types of data cleaning operations that are efficiently enabled by Pandas string methods.
Of course, building a very robust recipe recommendation system would require a *lot* more work!
Extracting full ingredient lists from each recipe would be an important piece of the task; unfortunately, the wide variety of formats used makes this a relatively time-consuming process.
This points to the truism that in data science, cleaning and munging of real-world data often comprises the majority of the work, and Pandas provides the tools that can help you do this efficiently.