## Conversational Recipe Bot with RAG (with Pinecone Vector DB and LangChain Vector Summary)

#### Helpful resources:

https://www.pinecone.io/learn/retrieval-augmented-generation/#RAG-is-the-most-cost-effective-easy-to-implement-and-lowest-risk-path-to-higher-performance-for-GenAI-applications.

https://www.pinecone.io/learn/vector-database/

https://docs.pinecone.io/guides/get-started/build-a-rag-chatbot

    Set up the environment

In [1]:
%pip install langchain langchain_community scikit-learn langchain-ollama sentence-transformers tiktoken

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pinecone
import os
from dotenv import load_dotenv
import gradio as gr
from langchain.vectorstores import Pinecone
from langchain.schema import BaseRetriever
from typing import List, Dict, Any, Tuple
# from langchain.chains import RetrievalQA
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# from langchain.embeddings import HuggingFaceEmbeddings
from langchain_ollama import ChatOllama
# from langchain.prompts import PromptTemplate
from langchain.memory import VectorStoreRetrieverMemory
from langchain.chains import ConversationalRetrievalChain
from sentence_transformers import SentenceTransformer

What all we are importing:

*  Importing pinecone for vector store, 

*  RetrievalQA for question answering,

*  PyPDFLoader for document loading, 

*  RecursiveCharacterTextSplitter for text splitting,

*  HuggingFaceEmbeddings for embeddings, ChatOllama for conversational AI,

*  PromptTemplate for templating, 

*  ConversationBufferMemory or ConversationVectorMemory for storing conversation history,and ConversationalRetrievalChain for conversational question answering.

    Load and prepare documents
Documents can be anything, we can load a PDF or use webpages as the source also

In [None]:
# List of PDF file paths to load documents from (the below mentioned book is 102 pages)
pdf_paths = [
    "/home/recipe-sample.pdf"
]

     Split documents

In [4]:
docs = [PyPDFLoader(pdf_path).load() for pdf_path in pdf_paths]
docs_list = [item for sublist in docs for item in sublist]

# chunk size set to 1000 for better context understanding, overlap set to 200 to avoid missing context
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""],
    add_start_index=True  # Helps track document position
)
doc_splits = text_splitter.split_documents(docs_list)

  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4


     Using Pinecone as the vector store

More details here about why we chose Pinecone: https://docs.google.com/spreadsheets/d/19Usb8_hIsGk61SLxQewUZ4rWexClSa26n1jfoNekCvE/edit?gid=506308235#gid=506308235

We are creating a new instance of the Pinecone client, passing in our API key.

We are also defining the name of the index that we will be using for our conversational bot.

In [None]:
load_dotenv(dotenv_path="/home/chefly/.env")

api_key=os.getenv("PINECONE_API_KEY")
pc = pinecone.Pinecone(api_key=api_key)
index_name = "rag-conversational-bot"

Set up a multilingual sentence transformer model to convert text into embeddings (vectors of numbers)

This specific model is called "intfloat/multilingual-e5-large" and it turns text into vectors of 1024 numbers

In [None]:
import os
os.environ["SENTENCE_TRANSFORMERS_HOME"] = "/home/chefly/ai/cache"
from sentence_transformers import SentenceTransformer

embed_model = SentenceTransformer("intfloat/multilingual-e5-large")

Custom E5 Embedding Function
* This function is used to convert text into embeddings
* We use the e5 model to do this, which is a powerful multilingual model
* The function takes in a list of text strings and an optional boolean parameter
* If the boolean is true, the function will format the strings as queries
* Otherwise it will format them as passages
* The function then uses the e5 model to encode the formatted strings into embeddings
* The embeddings are then normalized and returned

In [7]:
def e5_embed(texts, is_query=False):
    prefix = "query: " if is_query else "passage: "
    formatted_texts = [prefix + text.lower().strip() for text in texts]
    return embed_model.encode(formatted_texts, normalize_embeddings=True)

Create E5 embeddings for the documents
* The E5 model is a special kind of language model that can be used to generate embeddings for text documents.
* These embeddings are a way of representing the text documents as vectors in a high-dimensional space.
* The embeddings are then used to search for the most similar documents to the user's query.

In [8]:
texts = [doc.page_content for doc in doc_splits]
metadatas = [doc.metadata for doc in doc_splits]
embeddings = e5_embed(texts, is_query=False)

    Upsert data with metadata into a Pinecone index.

* Upserting because pinecone serverless index automatically creates embeddings for the text inserted.

In [9]:
vectors = [{
    "id": f"doc_{i}",
    "values": emb.tolist(),
    "metadata": {**metadatas[i], "text": texts[i]}
} for i, emb in enumerate(embeddings)]
pc.Index(index_name).upsert(vectors=vectors)

{'upserted_count': 19}

In [10]:
# # Upsert to Pinecone (optimized batch format)
# batch_size = 100
# for i in range(0, len(texts), batch_size):
#     batch_vectors = [{
#         "id": f"doc_{i+j}",
#         "values": embeddings[i+j].tolist(),
#         "metadata": {**metadatas[i+j], "text": texts[i+j]}
#     } for j in range(min(batch_size, len(texts)-i))]
#     index.upsert(vectors=batch_vectors)

     Optional

In [11]:
# Create index if needed
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=dimension,
        metric="cosine",
        spec=pinecone.ServerlessSpec(cloud="aws", region="us-east-1")
    )

index = pc.Index(index_name)

     Load model

In [12]:
llm = ChatOllama(model="mistral-small3.1")

* This part of the code is setting up a memory buffer to store the conversation history.(Langchain memory method)
* This is used to keep track of the conversation and allow the model to access previous messages.
* The memory buffer is set to store a maximum of 2000 tokens, and will trim the oldest messages if this limit is exceeded.

In [13]:
# memory = ConversationBufferMemory(
#     memory_key="chat_history",
#     return_messages=True,
#     max_token_limit=2000,  # Trims oldest messages if exceeded (NEW)
#     input_key = "question",
#     output_key = "answer"
# )

     Prompt

Using few shot prompting for now

In [14]:
# from langchain.prompts import ChatPromptTemplate, SystemMessagePromptTemplate

# system_prompt = """You are a friendly and practical cooking assistant trained on a collection of student-friendly recipes.
# Answer user questions accurately and concisely using the knowledge from this recipe book.
# If a question is outside the book’s scope, politely respond that it's not covered in the source material.

# Use these rules:
# 1. Answer ONLY from the provided context
# 2. For memory questions, use the exact chat history below
# 3. If unsure, say "I don't know"

# Here are some example questions and answers:

# Q: I don’t like beef. Are there vegetarian options in the book?
# A: Yes, the recipe collection includes vegetarian rice and several egg-based dishes like omelette and egg fried rice.

# Q: Can I make Thai Green Curry easily?
# A: Yes. Thai green curry is made by cooking curry paste with chicken, onion, and aubergine, then adding coconut milk and simmering until cooked. It’s a simple and delicious recipe ideal for students.

# Context: {context}
# Chat History: {chat_history}"""

# prompt = ChatPromptTemplate.from_messages([
#     SystemMessagePromptTemplate.from_template(system_prompt),
#     ("human", "{question}"),
# ])

     Prompt 2 : with proper instructions

In [15]:
from langchain.prompts import ChatPromptTemplate, SystemMessagePromptTemplate

system_prompt = """You are Recipe Bot, an expert cooking assistant specializing in student-friendly recipes.
Follow these guidelines strictly:

1. Source Knowledge:
- Answer ONLY using the recipe book context
- Never invent recipes or ingredients
- For measurements, be precise (e.g., "200g mushrooms")

2. Conversation Flow:
- Maintain natural, friendly tone
- Reference previous answers when appropriate
- Acknowledge preferences from chat history
- If context is missing, say: "This isn't covered in my recipe book"

3. Special Cases:
- For substitution questions, suggest closest alternatives
- For timing questions, specify preparation vs cooking time

Examples:
Q: Can I substitute X with Y?
A: "Yes, Y works well. Use 25% less as it's more potent."

Q: How long does this take?
A: "Preparation: 15 mins, Cooking: 30 mins (total 45 mins)"

Q: I don’t like beef. Are there vegetarian options in the book?
A: Yes, the recipe collection includes vegetarian rice and several egg-based dishes like omelette and egg fried rice.

Q: Can I make Thai Green Curry easily?
A: Yes. Thai green curry is made by cooking curry paste with chicken, onion, and aubergine, then adding coconut milk and simmering until cooked. It’s a simple and delicious recipe ideal for students.

Current Context: {context}
Chat History: {chat_history}"""

prompt = ChatPromptTemplate.from_messages([
    SystemMessagePromptTemplate.from_template(system_prompt),
    ("human", "{question}"),
])

      Create retriever and chain

Resource: https://medium.com/@3rdSon/how-to-build-rag-applications-with-pinecone-serverless-openai-langchain-and-python-d4eb263424f1#ca4f

In [16]:
from langchain.schema import Document  # Add this import at the top
from langchain.schema.retriever import BaseRetriever
from langchain.embeddings.base import Embeddings
from typing import List
import pinecone

class E5Retriever(BaseRetriever):
    """Custom retriever that properly handles Pinecone index"""

    def __init__(self, index: pinecone.Index):
        # Bypass Pydantic validation by setting attribute after initialization
        super().__init__()
        object.__setattr__(self, "_index", index)

    def get_relevant_documents(self, query: str) -> List[Document]:
        query_embedding = e5_embed([query], is_query=True).tolist()[0]
        results = self._index.query(
            vector=query_embedding,
            top_k=3,
            include_metadata=True
        )
        return [
            Document(
                page_content=match.metadata["text"],
                metadata=match.metadata
            ) for match in results.matches
        ]

    async def aget_relevant_documents(self, query: str) -> List[Document]:
        return self.get_relevant_documents(query)

# Initialize with your existing index
e5_retriever = E5Retriever(index=index)  # Your actual Pinecone index


# 1. First create an Embeddings wrapper class for your E5 function
class E5EmbeddingsWrapper(Embeddings):
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """For documents/passages"""
        return e5_embed(texts, is_query=False).tolist()

    def embed_query(self, text: str) -> List[float]:
        """For queries"""
        return e5_embed([text], is_query=True).tolist()[0]

# 2. Initialize the embeddings wrapper
e5_embeddings = E5EmbeddingsWrapper()

  class E5Retriever(BaseRetriever):
  class E5Retriever(BaseRetriever):


In [17]:
# 4. Memory-specific vector store (separate from main index)
memory_index = pc.Index("conversation-memory")  # Create separate index
memory_vectorstore = Pinecone(
    index=memory_index,
    embedding=e5_embeddings,
    text_key="text"
)

  memory_vectorstore = Pinecone(


    Vector memory 
* better for longer conversations
* stores meaning and context necessary for the continuos conversation
* has one extra parameter retriver

In [18]:
memory = VectorStoreRetrieverMemory(
    retriever=memory_vectorstore.as_retriever(search_kwargs={"k": 3}),
    memory_key="chat_history",
    input_key="question",
    output_key="answer",
    return_messages=True,
    return_docs=True
)

In [19]:
# Create the conversational chain  > this chain is for conversational memory, so  replacing RetrievalQA with ConversationalRetrievalChain
qa_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=e5_retriever,
    memory=memory,
    chain_type="stuff",
    combine_docs_chain_kwargs={"prompt": prompt},    # same prompt from above
    # verbose=True,  # debugging here
    rephrase_question=True,  # Helps with follow-up questions
    get_chat_history=lambda h: h
)

In [20]:
from langchain.schema import HumanMessage, AIMessage

class RAGApplication:
    def __init__(self, qa_chain):
        self.qa_chain = qa_chain
        self.chat_history = []  # This will store (question, answer) tuples

    def run(self, question: str) -> str:
        # Convert your chat history to LangChain's expected format
        lc_history = []
        for q, a in self.chat_history:
            lc_history.append(HumanMessage(content=q))
            lc_history.append(AIMessage(content=a))

        # Call your chain with properly formatted history
        result = self.qa_chain({
            "question": question,
            "chat_history": lc_history  # Now in correct format
        })

        # Store the new interaction
        self.chat_history.append((question, result["answer"]))
        return result["answer"]

In [21]:
# Initialize your RAG application (use your existing initialization)
rag_app = RAGApplication(qa_chain)

    Simple Gradio Chat template for UI

In [22]:
def chat(message: str, history: List[Tuple[str, str]]) -> Tuple[str, List[Tuple[str, str]]]:
    """Handle chat messages"""
    response = rag_app.run(message)
    history.append((message, response))
    return "", history

with gr.Blocks(title="Recipe Bot") as demo:
    gr.Markdown("# 🍳 Recipe Bot")
    gr.Markdown("Ask me anything about recipes from docs!")

    chatbot = gr.Chatbot(height=500)
    msg = gr.Textbox(label="Your question", placeholder="Type your question here...")
    clear = gr.Button("Clear Chat")

    msg.submit(
        chat,
        inputs=[msg, chatbot],
        outputs=[msg, chatbot]
    )
    clear.click(lambda: None, None, chatbot, queue=False)

demo.launch(share=True)

Running on local URL:  http://127.0.0.1:7860


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Running on public URL: https://62e4528c03f9bd8f31.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


