# Scrape recipes and extract KG triples using OpenAI and LlamaIndex

This notebook demonstrates how the data in `./recipe_data/` was extracted. Extraction is generally easier by running as a script, see `scrape_recipes.py`

In [1]:
import os, sys
import nest_asyncio


open_ai_key = '...'
os.environ['OPENAI_API_KEY'] = open_ai_key

sys.path = ['/Users/walder2/kg_uq/'] + sys.path
path_to_data = '/Users/walder2/kg_uq/recipe_data'

nest_asyncio.apply()

In [9]:
from recipe_data import * 
from kg_extraction import *

  from .autonotebook import tqdm as notebook_tqdm


### Enter websites from 'allrecipes.com' to be scraped

You should also include the recipe titles in the list `recipes`. Try to follow the format provided if possible and note that the recipe ingredients and directions extracted will be passed the `recipes` to the LLM to inform which recipe is being read.  

In [3]:
websites =[
    'https://www.allrecipes.com/recipe/9174/peanut-butter-pie/',
    'https://www.allrecipes.com/recipe/12506/coconut-pie/',
    'https://www.allrecipes.com/twix-pie-recipe-7563548',
    'https://www.allrecipes.com/recipe/8487044/brownie-pie/',
    'https://www.allrecipes.com/recipe/23439/perfect-pumpkin-pie/',
    'https://www.allrecipes.com/recipe/12151/banana-cream-pie-i/'
]

recipes = [
    'peanut butter pie', 
    'cocnut pie', 
    'twix pie', 
    'brownie pie', 
    'perfect pumpkin pie', 
    'banana cream pie'
]

### Scrape the websites listed and write the html contents to txt files. 

This data will be cleaned up with a call to an LLM. For now we just grab it all. 

In [4]:
html_files = scrape_recipe_websites(websites=websites,
                                    recipes=recipes,
                                    data_dir=path_to_data,
                                    verbose=True,
                                    return_files=True)

 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
'./recipe_data/html_files/peanut_butter_pie.txt'
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Already extracted html content for peanut butter pie.

 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
'./recipe_data/html_files/cocnut_pie.txt'
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Already extracted html content for cocnut pie.

 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
'./recipe_data/html_files/twix_pie.txt'
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Already extracted html content for twix pie.

 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
'./recipe_data/html_files/brownie_pie.txt'
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Already extracted html content for brownie pie.

 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
'./recipe_data/html_files/perfect_pumpkin_pie.txt'
 - - - - - - - - - - - - - - - - - - 

### Now we are going to extract the ingredients and directions for cooking.

To see the prompts for calls to the LLM for the ingredients, looking to the function `extract_ingredients_directions`. You can easily change the prompts to refine the extraction by specifying a `PromptTemplate` to the arguments `directions_template` or `ingredients_template`. Passing `verbose = True` will print out the content as it is extracted. 

**Note**: Extractions are written out to './recipe_data/recipe_title.txt'. If the file already exists extraction is skipped, so make sure you clear the folder/file if you want to try a new extraction under the same name. 

In [5]:
txt_files = extract_recipe_content(data_dir=path_to_data, html_files=html_files, return_files=True)

Extracting content from html docs...


### Specfiy the entity types and relations. 

Below you can specify information for entity types and relation types. Have a look at https://schema.org/ for details on the entities I defined below. 


In [6]:
entity_types = {
    "recipe": 'https://schema.org/Recipe',
    "ingredient": "https://schema.org/recipeIngredient",
    "measurement": "https://schema.org/QuantitativeValue",
    "nutrition": 'https://schema.org/nutrition',
}

relation_types = {
    "hasCharacteristic": "https://schema.org/additionalProperty",
    "hasColor": "https://schema.org/color",
    "hasMeasurement": "https://schema.org/hasMeasurement",
    "cookTime": "https://schema.org/cookTime",
    "recipeInstruction": "https://schema.org/recipeInstructions"

}

### Extract the triples and context.

Below is a function for extracting the entity and realtion types specified above. The results are dumped to `./recipe_data/kg_files/recipe_title.json`. Note that you can pass in a `user_prompt` that is self defined. This prompt gives the LLM information on what its task is and an example of it. You must include a formating call that uses `entity_types` and `realtion_types`. The `system_prompt` provided information to the LLM about how it should extract information. You can see the specifictions of both in `./kg_extraction/recipe_prompts.py`

In [7]:
extract_recipe_kg(entity_types=entity_types, relation_types=relation_types, data_dir=path_to_data,
                  txt_files=txt_files)

 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
'./recipe_data/kg_files/peanut_butter_pie.json'
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Already extracted peanut_butter_pie.

 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
'./recipe_data/kg_files/cocnut_pie.json'
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Already extracted cocnut_pie.

 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
'./recipe_data/kg_files/twix_pie.json'
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Already extracted twix_pie.

 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
'./recipe_data/kg_files/brownie_pie.json'
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Already extracted brownie_pie.

 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
'./recipe_data/kg_files/perfect_pumpkin_pie.json'
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Already extracted perfect_pumpkin_pie.

 - - - - -

### Take a look at the extracted KG

The returned value from `get_recipe_kg` is a tuple containing a DataFrame

In [10]:
kg = get_recipe_kg(data_dir=path_to_data) 

In [11]:
kg

Unnamed: 0,head,head_type,relation,tail,tail_type,kg_idx,kg_name
0,apple spinach salad,recipe,hasIngredient,2 cups baby spinach leaves,ingredient,0,apple spinach salad
1,apple spinach salad,recipe,hasIngredient,"1 medium apple, sliced",ingredient,0,apple spinach salad
2,apple spinach salad,recipe,hasIngredient,2 tablespoons chopped celery,ingredient,0,apple spinach salad
3,apple spinach salad,recipe,hasIngredient,"2 tablespoons toasted PLANTERS Pecans, chopped",ingredient,0,apple spinach salad
4,apple spinach salad,recipe,hasIngredient,2 tablespoons KRAFT LIGHT DONE RIGHT! House It...,ingredient,0,apple spinach salad
...,...,...,...,...,...,...,...
796,vanilla frozen yogurt,recipe,hasMeasurement,⅔ cup,measurement,43,vanilla frozen yogurt
797,vanilla frozen yogurt,recipe,hasMeasurement,1 teaspoon,measurement,43,vanilla frozen yogurt
798,vanilla frozen yogurt,recipe,hasIngredient,nonfat Greek yogurt,ingredient,43,vanilla frozen yogurt
799,vanilla frozen yogurt,recipe,hasIngredient,3 cups white sugar,ingredient,43,vanilla frozen yogurt


To see how the entire dataset was scraped check out `scripts/scrape_recipes.py`. 