# Scrape recipes and extract KG triples using OpenAI and LlamaIndex

This notebook demonstrates how the data in `./recipe_data/` was extracted. Extraction is generally easier by running as a script, see `scrape_recipes.py`

In [None]:
import os, sys

open_ai_key = '...'
os.environ['OPENAI_API_KEY'] = open_ai_key
sys.path = ['/Users/walder2/kg_uq/'] + sys.path
path_to_data = '/Users/walder2/kg_uq/recipe_data'

In [2]:
from kg_extraction import *

  from .autonotebook import tqdm as notebook_tqdm


### Enter websites from 'allrecipes.com' to be scraped

You should also include the recipe titles in the list `recipes`. Try to follow the format provided if possible and note that the recipe ingredients and directions extracted will be passed the `recipes` to the LLM to inform which recipe is being read.  

In [3]:
websites = [
    "https://www.allrecipes.com/sweet-potato-dump-cake-recipe-8363654", 
    "https://www.allrecipes.com/chocolate-stout-cake-recipe-8426369",
    'https://www.allrecipes.com/sweet-potato-sheet-cake-recipe-8405784',
    'https://www.allrecipes.com/recipe/16607/cheesecake-brownies/',
    'https://www.allrecipes.com/recipe/211165/pumpkin-brownies/',
    'https://www.allrecipes.com/recipe/9566/fudge-brownies-i/',
    'https://www.allrecipes.com/recipe/9827/chocolate-chocolate-chip-cookies-i/',
    'https://www.allrecipes.com/recipe/26237/double-chocolate-chip-cookies/'
    
    
]
recipes = [
    'sweet potato dump cake',
    'chocolate stout cake',
    'sweet potato sheet cake',
    'cheesecake brownies',
    'pumpkin brownies',
    'fudge brownies',
    'chocolate chip cookies',
    'double chocolate chip cookies'
]

### Scrape the websites listed and write the html contents to txt files. 

This data will be cleaned up with a call to an LLM. For now we just grab it all. 

In [4]:
scrape_recipe_websites(websites, recipes, data_dir = path_to_data)

### Now we are going to extract the ingredients and directions for cooking.

To see the prompts for calls to the LLM for the ingredients, looking to the function `extract_ingredients_directions`. You can easily change the prompts to refine the extraction by specifying a `PromptTemplate` to the arguments `directions_template` or `ingredients_template`. Passing `verbose = True` will print out the content as it is extracted. 

**Note**: Extractions are written out to './recipe_data/recipe_title.txt'. If the file already exists extraction is skipped, so make sure you clear the folder/file if you want to try a new extraction under the same name. 

In [5]:
extract_ingredients_directions(data_dir = path_to_data)

Extracting ingredients and directions from html data .... 

File already exists: /Users/walder2/kg_uq/recipe_data/txt_files/cheesecake_brownies.txt
File already exists: /Users/walder2/kg_uq/recipe_data/txt_files/chocolate_chip_cookies.txt
File already exists: /Users/walder2/kg_uq/recipe_data/txt_files/chocolate_stout_cake.txt
File already exists: /Users/walder2/kg_uq/recipe_data/txt_files/double_chocolate_chip_cookies.txt
File already exists: /Users/walder2/kg_uq/recipe_data/txt_files/fudge_brownies.txt
File already exists: /Users/walder2/kg_uq/recipe_data/txt_files/pumpkin_brownies.txt
File already exists: /Users/walder2/kg_uq/recipe_data/txt_files/sweet_potato_dump_cake.txt
File already exists: /Users/walder2/kg_uq/recipe_data/txt_files/sweet_potato_sheet_cake.txt


### Specfiy the entity types and relations. 

Below you can specify information for entity types and relation types. Have a look at https://schema.org/ for details on the entities I defined below. 


In [6]:
entity_types = {
    "recipe": 'https://schema.org/Recipe',
    "ingredient": "https://schema.org/recipeIngredient",
    "measurement": "https://schema.org/QuantitativeValue",  
}

relation_types = {
    "hasCharacteristic": "https://schema.org/additionalProperty",
    "hasColor": "https://schema.org/color",
    "hasMeasurement": "https://schema.org/hasMeasurement",
    "cookTime": "https://schema.org/cookTime",
    "recipeInstruction": "https://schema.org/recipeInstructions"
    
 }

### Extract the triples and context.

Below is a function for extracting the entity and realtion types specified above. The results are dumped to `./recipe_data/kg_files/recipe_title.json`. Note that you can pass in a `user_prompt` that is self defined. This prompt gives the LLM information on what its task is and an example of it. You must include a formating call that uses `entity_types` and `realtion_types`. The `system_prompt` provided information to the LLM about how it should extract information. You can see the specifictions of both in `./kg_extraction/recipe_prompts.py`

In [7]:
extract_recipe_kg(entity_types=entity_types, relation_types=relation_types, data_dir=path_to_data)

File already exists: /Users/walder2/kg_uq/recipe_data/kg_files/cheesecake_brownies.json
File already exists: /Users/walder2/kg_uq/recipe_data/kg_files/chocolate_chip_cookies.json
File already exists: /Users/walder2/kg_uq/recipe_data/kg_files/chocolate_stout_cake.json
File already exists: /Users/walder2/kg_uq/recipe_data/kg_files/double_chocolate_chip_cookies.json
File already exists: /Users/walder2/kg_uq/recipe_data/kg_files/fudge_brownies.json
File already exists: /Users/walder2/kg_uq/recipe_data/kg_files/pumpkin_brownies.json
File already exists: /Users/walder2/kg_uq/recipe_data/kg_files/sweet_potato_dump_cake.json
File already exists: /Users/walder2/kg_uq/recipe_data/kg_files/sweet_potato_sheet_cake.json


### Take a look at the extracted KG

The returned value from `get_recipe_kg` is a tuple containing a DataFrame

In [8]:
kg = get_recipe_kg(data_dir=path_to_data) 

In [9]:
kg

Unnamed: 0,head,head_type,relation,tail,tail_type,kg_idx
0,cheesecake brownies,recipe,hasIngredient,brownie mix,ingredient,0
1,cheesecake brownies,recipe,hasIngredient,water,ingredient,0
2,cheesecake brownies,recipe,hasIngredient,vegetable oil,ingredient,0
3,cheesecake brownies,recipe,hasIngredient,eggs,ingredient,0
4,cheesecake brownies,recipe,hasIngredient,cream cheese,ingredient,0
...,...,...,...,...,...,...
136,sweet potato sheet cake,recipe,recipeInstruction,add powdered sugar and beat until well combined,instruction,7
137,sweet potato sheet cake,recipe,recipeInstruction,spread the frosting evenly over the cooled cak...,instruction,7
138,sweet potato sheet cake,recipe,hasIngredient,sweet potatoes,ingredient,7
139,sweet potato sheet cake,recipe,hasIngredient,box mix,ingredient,7
