<a href="https://colab.research.google.com/github/dodeeric/langchain-ai-assistant-with-hybrid-rag/blob/main/BMAE_AI_Assistant_with_hybrid_RAG_v13.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AI Assistant (LLM Chatbot) with Hybrid RAG -- With chat history
v1: Hybrid RAG: keyword search (bm25) and semantic search (vector db)


v2: With memory: 1) Reformulate the question for RAG query (contextualize_q_prompt);  2) Add previous Q and A in prompt sent to the LLM.

v3: With PDF indexation

In [None]:
!pip install --upgrade --quiet jq bs4 langchain langchain-community langchain-openai langchain-chroma langchainhub rank_bm25 pypdf

import requests, json, jq, time, bs4
from bs4 import BeautifulSoup
from google.colab import userdata
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader, JSONLoader, PyPDFLoader
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

OPENAI_API_KEY = userdata.get("OPENAI_API_KEY") # To access OpenAI LLM and embedding model via API
LANGCHAIN_API_KEY = userdata.get("LANGCHAIN_API_KEY") # To trace Langchain on Langsmith

%env OPENAI_API_KEY = $OPENAI_API_KEY
%env LANGCHAIN_API_KEY = $LANGCHAIN_API_KEY
%env LANGCHAIN_TRACING_V2 = "true"

# import dotenv
# dotenv.load_dotenv()

## Scrape

In [None]:
# Function to scrape the text and the metadata of a web page

def scrape_web_page(url):
    """
    Name: swp
    Scrape the text and the metadata of a web page
    Input: URL of the page
    Output: list of dictionaries with: url: url, metadata: metadata, text: text
    """

    filter = "two-third last" # balat / irpa
    #filter = "notice_corps media" #  belgica / kbr
    #filter = "hproduct commons-file-information-table" # commons / wikimedia: summary or description section

    # Get the page content
    loader = WebBaseLoader(
        web_paths=(url,),
        bs_kwargs=dict(
            parse_only=bs4.SoupStrainer(
                class_=(filter)
            )
        ),
    )
    text = loader.load()
    # Covert Document type into string type
    text = text[0].page_content

    # Get the metadata (open graph from Facebook, og:xxx)
    # Get the HTML code
    response = requests.get(url)
    # Transform the HTML code from a Response object type into a BeautifulSoup object type to be scraped by Beautiful Soup
    soup = BeautifulSoup(response.text, "html.parser")
    # Get the metadata fields
    metadata = {} # Empty dictionary
    # Find all the meta tags in the HTML
    meta_tags = soup.find_all("meta")
    # Loop through the meta tags
    for tag in meta_tags:
        property = tag.get("property")
        content = tag.get("content")
        # Add the property-content pair to the dictionary
        if property and content:
            metadata[property] = content

    # Build JSON string with: url: url, metadata: metadata, text: summary text
    # Create a dictionary
    page = {
        "url": url, # String
        "metadata": metadata, # Dictionary
        "text": text # String
    }

    return page # Dictionary

In [None]:
# Scrape the URLs and save the results in a JSON file

#ds1:

#file_path = "/content/drive/MyDrive/colab/balat-urls-ds2" # apyfy wcc used instead
#file_path = "/content/drive/MyDrive/colab/belgica-urls-ds2" # apyfy wcc used instead
#file_path = "/content/drive/MyDrive/colab/commons-urls-ds2"

#ds2:
#file_path = "/content/drive/MyDrive/colab/balat-urls-ds2"
#file_path = "/content/drive/MyDrive/colab/commons-urls-ds2"

#ds-test:
#file_path = "/content/drive/MyDrive/colab/commons-urls-test"
#file_path = "/content/drive/MyDrive/colab/balat-urls-test"
#file_path = "/content/drive/MyDrive/colab/belgica-urls-test"

with open(f"{file_path}.txt", "r") as urls_file:
    items = []
    for line in urls_file:
        url = line.strip()
        url = url.replace("\ufeff", "")  # Remove BOM
        item = scrape_web_page(url)
        print(item)
        items.append(item)
        #time.sleep(1)

# Save the Python list in a JSON file
# json.dump is designed to take the Python objects, not the already-JSONified string. Read docs.python.org/3/library/json.html.
with open(f"{file_path}-swp.json", "w") as json_file:
    json.dump(items, json_file) # Replaces the accentuated characters (ex: é) by its utf8 codes (ex: \u00e9)
json_file.close()

In [None]:
# Open the JSON file to check its content (will produce an error if it's not a correctly formated JSON file)
with open(f"{file_path}-swp.json", "r") as input_file:
    items_read = json.load(input_file)

## Index

In [None]:
# Open the JSON files and load each JSON item one by one in the "documents" variable (type: Document)

file_path1 = "/content/drive/MyDrive/colab/commons-urls-ds1-swp.json"
file_path2 = "/content/drive/MyDrive/colab/balat-ds1c-wcc-cheerio-ex_2024-04-06_09-05-15-262.json"
file_path3 = "/content/drive/MyDrive/colab/belgica-ds1c-wcc-cheerio-ex_2024-04-06_08-30-26-786.json"
file_path4 = "/content/drive/MyDrive/colab/commons-urls-ds2-swp.json"
file_path5 = "/content/drive/MyDrive/colab/balat-urls-ds2-swp.json"
file_paths = [file_path1, file_path2, file_path3, file_path4, file_path5]

documents = []
for file_path in file_paths:
    loader = JSONLoader(file_path=file_path, jq_schema=".[]", text_content=False)
    docs = loader.load() # Chunks (JSON items) from the JSON files; list of Documents
    documents = documents + docs # This variable contents all the JSON items

In [None]:
# Open the PDF files and load each page one by one in the "documents" variable (type: Document)

file_path1 = "/content/drive/MyDrive/colab/BPEB31_DOS4_42-55_FR_LR.pdf"
file_path2 = "/content/drive/MyDrive/colab/MD-vol1-2-3.pdf"
file_paths = [file_path1, file_path2]

for file_path in file_paths:
    loader = PyPDFLoader(file_path)
    pages = loader.load_and_split() # 1 pdf page per chunk
    documents = documents + pages

In [None]:
documents[2100].metadata

Run step 1 or step 2:

In [None]:
# STEP 1: Instanciate a Chroma DB and load the data from disk.
collection_name = "bmae-json"
embedding_model = OpenAIEmbeddings(model="text-embedding-3-large") # 3072 dimensions vectors used to embed the chunks and the questions
vector_db = Chroma(embedding_function=embedding_model, collection_name=collection_name, persist_directory="/content/drive/MyDrive/colab/chromadb2")

In [None]:
# STEP 2: ONLY TO EMBED! Instantiate a Chroma DB, embed the JSON items (documents), then save to disk.
collection_name = "bmae-json"
embedding_model = OpenAIEmbeddings(model="text-embedding-3-large") # 3072 dimensions vectors used to embed the chunks and the questions
vector_db = Chroma.from_documents(documents, embedding_model, collection_name=collection_name, persist_directory="/content/drive/MyDrive/colab/chromadb2")
# To check the Chroma vector db (sqlite3):
# $ sqlite3 chroma.sqlite3
# sqlite> .tables ===> List of the tables
# sqlite> select * from collections; ===> Name of the collection & size of the vectors
# sqlite> select * from embeddings; ===> Number of records in the db
# sqlite> select * from embedding_metadata; ===> Display json items

## Retrieve and generate

In [None]:
# LLM chatbot with a hybrid RAG chain:
# (To embed the question, the same model is used as for the data; the model is given in "vector_db".)

llm = ChatOpenAI(model="gpt-4-turbo-2024-04-09", temperature=0)

# Semantic search (vector retriever)
vector_retriever = vector_db.as_retriever(search_type="similarity", search_kwargs={"k": 3}) # Chroma DB

# Keyword search (bm25 retriever)
keyword_retriever = BM25Retriever.from_documents(documents)
keyword_retriever.k = 3

# Ensemble retriever (mix of both retrivers) -- Weights = order of the results!!! [1,0] means: all bm25 first, all vector after...
ensemble_retriever = EnsembleRetriever(retrievers=[keyword_retriever, vector_retriever], weights=[0.5, 0.5])

"""
# Without memory:

# Download prompt template: system prompt + inputs (rag_output + chat_history + question)
prompt = hub.pull("dodeeric/rag-prompt-bmae-with-history")

# Take the text content of each doc, and concatenate them in one string to pass to the prompt (context)
def format_docs_clear_text(docs):
    return "\n\n".join(doc.page_content.encode('utf-8').decode('unicode_escape') for doc in docs)

# Function to display the text content of the prompt in ai_assistant_chain
def print_and_pass(data):
    print(f"Prompt content sent to the LLM: {data}")
    return data

# Langchain chain: the LLM chatbot with hybrid RAG. Type: RunnableSequence (chain) -- How/where is the question pass to the RAG??? In LangSmith, we can see the input (question) of the 3 retreivers
ai_assistant_chain = ({"rag_output": ensemble_retriever | format_docs_clear_text, "chat_history": RunnablePassthrough(), "question": RunnablePassthrough()}
    | prompt
    #| print_and_pass
    | llm
    | StrOutputParser() # Convert to string
)
"""

from langchain.chains import create_history_aware_retriever
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.messages import HumanMessage

chat_history = []

contextualize_q_system_prompt = """
Given a chat history and the latest user question \
which might reference context in the chat history, formulate a standalone question \
which can be understood without the chat history. Do NOT answer the question, \
just reformulate it if needed and otherwise return it as is.
"""

contextualize_q_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", contextualize_q_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)

history_aware_retriever = create_history_aware_retriever(
    llm, ensemble_retriever, contextualize_q_prompt
)

qa_system_prompt = """
You are an artwork specialist. You must assist the users in finding, describing, and displaying artworks related to the Belgian monarchy. \
You first have to search answers in the "Knowledge Base". If no answers are found in the "Knowledge Base", then answer with your own knowledge. \
You have to answer in the same language as the question.
At the end of the answer:
- give a link to a web page about the artwork (see the "url" field).
- display an image of the artwork (see the "og:image" field).

Knowledge Base:

{context}
"""

qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", qa_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)

question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)

ai_assistant_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)

Query the AI Assistant:

In [None]:
question = "Pouvez-vous me montrer le tableau 'La revue des écoles' ?"

In [None]:
question = "Qui a peint ce tableau ?"

In [None]:
question = "Quelle est la dimension du tableau ?"

In [None]:
question = "Pouvez-vous me montrer un tableau de Charles Porion ?"

In [None]:
question = "Quel est la date de naissance du peintre ?"

In [None]:
question = "Quel est la date de naissance de Guy de Greef ?"

In [None]:
question = "Que possède Gertrude Baelde ?"

In [None]:
question = "Camille Van Camp a-t-il fait des croquis pour sa peinture 'La fête patriotique ' ?"

In [None]:
#answer = ai_assistant_chain.invoke(question) # Without memory

In [None]:
output = ai_assistant_chain.invoke({"input": question, "chat_history": chat_history}) # output is a dictionary. output["answer"] is in markdown format.

In [None]:
chat_history.extend([HumanMessage(content=question), output["answer"]])

In [None]:
print(output["answer"])

Le tableau "La revue des écoles en 1878" de Jan Verhas représente un événement marquant où environ 23.000 élèves des écoles bruxelloises ont défilé devant le roi Léopold II et la reine Marie-Henriette à l'occasion de leurs noces d'argent. Cette œuvre, achevée en 1880, a été exposée lors de l'Exposition historique de l'art belge et a connu un grand succès public. Le tableau montre la place des Palais à Bruxelles et inclut des portraits de figures notables de l'époque.

Pour plus d'informations sur l'œuvre, vous pouvez visiter le lien suivant : [BALaT KIK-IRPA](https://balat.kikirpa.be/object/130731)

Voici une image du tableau :
![La revue des écoles en 1878](http://balat.kikirpa.be/image/thumbnail/B213530.jpg)


In [None]:
print(chat_history)

In [None]:
# Query the vector RAG only
docs = vector_db.similarity_search(question, k=2) # List of Documents; page_content of a Document: string
print(docs)