# **Retrieval-Augmented Cooking Assistant**
## **Phase 1: Recipe Extraction and Vector Store Creation**

In this phase, we process a cookbook PDF to extract individual recipes and prepare the data for retrieval-augmented generation. We will use Python libraries such as pdfplumber, PyMuPDF (fitz), and pandas to parse the PDF content, extract the table of contents, and link each recipe to its corresponding text. The recipes are then segmented into manageable chunks and converted into LangChain documents. Finally, these documents are embedded using HuggingFace embeddings and stored in a Chroma vector store for efficient retrieval.

## Steps:

1. **Extract PDF Content**  
   - Open the cookbook PDF using pdfplumber to extract text on a page-by-page basis.  
   - Store each page’s number and text in a dictionary and convert it into a pandas DataFrame.

2. **Extract the Table of Contents**  
   - Use PyMuPDF (fitz) to extract the table of contents from the PDF.  
   - Filter the table to isolate level 2 entries, which correspond to the recipes.

3. **Merge and Organize Recipe Data**  
   - Isolate the pages containing recipes based on the identified page range (pages 23 to 449).  
   - Merge the PDF text DataFrame with the table of contents using an as-of merge to assign each page to the most recent recipe title.  
   - Group the pages by recipe title and join their text into a single, comprehensive description.

4. **Convert Recipes into LangChain Documents**  
   - Split each recipe description into smaller chunks using the RecursiveCharacterTextSplitter (with a maximum of 1000 tokens and 10% overlap) to handle lengthy recipes.  
   - Convert each chunk into a LangChain Document, tagging it with the corresponding recipe metadata.

5. **Build a Chroma Vector Store**  
   - Generate embeddings for each document using the HuggingFace MiniLM model.  
   - Store the documents and their embeddings in a Chroma vector store, enabling efficient retrieval for downstream question-answering tasks.

---

In [1]:
import pdfplumber
import fitz

import pandas as pd

In [None]:
pdf_path = "source/yotam-ottolenghi-ottolenghi-simple-a-cookbook-2018-pdf-free.pdf"

# store the page number and the text in that page in a dictionary
book_pdf = {
    'page_number': [],
    'text': []
}

# open the pdf and get the text inside it, divided by page
with pdfplumber.open(pdf_path) as pdf:
    
    book_pdf['page_number'] = [page.page_number for page in pdf.pages]
    book_pdf['text'] = [page.extract_text() for page in pdf.pages]

In [3]:
# convert to a dataframe
book_df = pd.DataFrame(book_pdf)
book_df

Unnamed: 0,page_number,text
0,1,
1,2,
2,3,
3,4,
4,5,Copyright © 2018 by Yotam Ottolenghi\nPhotogra...
...,...,...
485,486,Acknowledgments\nIt is my name that appears on...
486,487,Cornelia Staeubli and Sami Tamimi.\nI would al...
487,488,Tara Wigley\nEsme would like to thank: her hus...
488,489,YOTAM OTTOLENGHI is the author of the New York...


In [4]:
# open the table of content of the book
# the table of content is divided in level, title and page
doc = fitz.open(pdf_path)
doc_toc = pd.DataFrame(doc.get_toc(),columns=['level', 'title', 'page_number'])
doc_toc

Unnamed: 0,level,title,page_number
0,1,Title Page,2
1,1,Copyright,5
2,1,Contents,6
3,1,Introduction: Ottolenghi Simple,8
4,1,Brunch,21
...,...,...,...
161,2,Winter Feast,462
162,1,“Ottolenghi” Ingredients,463
163,1,Index,476
164,1,Acknowledgments,486


In [None]:
# by inspecting the book, we can extract that all the recipes are in level 2
recipes_toc = doc_toc[doc_toc['level'] == 2].reset_index(drop=True)
recipes_toc

Unnamed: 0,level,title,page_number
0,2,Braised Eggs with Leek and Za’atar,23
1,2,Harissa and Manchego Omeletes,26
2,2,Zucchini and Ciabatta Frittata,29
3,2,Portobello Mushrooms with Brioche and Poached ...,32
4,2,Scrambled Harissa Tofu,35
...,...,...,...
143,2,Tapas Feast,458
144,2,Middle Eastern Feast,459
145,2,Spring Lamb Feast,460
146,2,Summer Vegetarian Feast,461


In [6]:
# by inspecting the book, we can extract that all the recipes are in level 2, from page 23 to page 449
# we can therefore isolate these pages
recipes_df = book_df[book_df['page_number'].between(23, 449)].reset_index(drop=True)
recipes_df

Unnamed: 0,page_number,text
0,23,Braised eggs with leek and za’atar
1,24,This is a quick way to get a very comforting m...
2,25,"for 4–5 minutes, until most of the stock has e..."
3,26,Harissa and Manchego omeletes
4,27,I like to eat this either for brunch or for a ...
...,...,...
422,445,Anyone from Switzerland will tell you that the...
423,446,"2. Place the almond meal, granulated sugar, co..."
424,447,No-churn raspberry ice cream
425,448,This is the same recipe for ice cream used in ...


In [7]:
# we can now merge the information of the two dataframes and create a new dataframe where we will merge the pages containing each recipe
recipes_df = recipes_df.sort_values('page_number')
recipes_toc = recipes_toc.sort_values('page_number')

# Use merge_asof to assign each page to the most recent recipe from the table of contents
merged = pd.merge_asof(recipes_df, recipes_toc, on='page_number', direction='backward')

# Now group by the recipe title and join all text pieces into a single description per recipe. Recipes will be in alphabetical order
recipes_combined = (
    merged.groupby('title', sort=False)['text']
    .apply(lambda texts: ' '.join(texts))
    .reset_index()
    .rename(columns={'title': 'recipe', 'text': 'description'})
)

# Display the resulting dataframe
recipes_combined

Unnamed: 0,recipe,description
0,Braised Eggs with Leek and Za’atar,Braised eggs with leek and za’atar This is a q...
1,Harissa and Manchego Omeletes,Harissa and Manchego omeletes I like to eat th...
2,Zucchini and Ciabatta Frittata,Zucchini and ciabatta frittata This is a regul...
3,Portobello Mushrooms with Brioche and Poached ...,Portobello mushrooms with brioche and\npoached...
4,Scrambled Harissa Tofu,Scrambled harissa tofu This was brought onto o...
...,...,...
135,Spiced Apple Cake,Spiced apple cake This can either be eaten as ...
136,"Nutella, Sesame, and Hazelnut Rolls","Nutella, sesame, and hazelnut rolls Two assump..."
137,Mint and Pistachio Chocolate Fridge Cake,Mint and pistachio chocolate fridge cake This ...
138,Brunsli Chocolate Cookies,Brunsli chocolate cookies Anyone from Switzerl...


---

### Convert each recipe to a Langchain Document 

In order to optimise the retrival, let's split the dataframe so that each recipe is a self-contained chunk of information. A recipe might but long, so just to be sure let's limit the length of a chunk to be 1000 tokens, and let's allow an overlap of 10%

In [8]:
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [9]:
# first split each recipe into chunks that are shorter than a certian length, then convert the chunks into documents and append them to the list
docs = []

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

for _, row in recipes_combined.iterrows():
    chunks = splitter.split_text(row['description'])
    for chunk in chunks:
        docs.append(
            Document(
                page_content=chunk,
                metadata={"recipe": row['recipe']}
            )
        )

### Create the embeddings and the vectorstore

In [10]:
from langchain_community.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

In [12]:
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embeddings,
    persist_directory="data/recipes_vectorstore") # save the vectorstore locally