### Introduction

In this project, we implement a **Retrieval-Augmented Generation (RAG) agent** powered by a Large Language Model (LLM). The agent is designed to handle user queries by combining information retrieval from a **ChromaDB vector database** and performing **real-time web searches** when the data isn't available in the local knowledge base.

Specifically, the agent handles:
- Queries related to **INSAT (Institut National des Sciences Appliquées et de Technologie)** and other **computer science universities in Tunisia** by fetching data from a pre-built vector database (ChromaDB).
- Queries that are out of scope for the vector database by performing a **live web search**.

To ensure accuracy and reduce hallucinations (incorrect or fabricated information generated by the LLM), we also have a graph node that checks for that behaviour.

The routing between the vector database and web search is managed by **LangGraph**, which defines the decision logic based on the nature of the query. Additionally, **LangGraph** is utilized to monitor and control hallucinations by creating a graph node that checks the factual consistency of responses.

This solution integrates multiple cutting-edge tools and technologies to create an efficient and intelligent question-answering system that balances retrieval from both **structured** (ChromaDB) and **unstructured** (web search) data sources.


### Installing Dependencies

In [None]:
%pip install -U langchain-nomic bitsandbytes langchain_ollama langchain_community tiktoken langchainhub chromadb langchain langgraph tavily-python nomic[local] langchain-text-splitters




### Chroma DB Setup

In [None]:
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_nomic.embeddings import NomicEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

urls = [
    "https://insat.rnu.tn/formations/les-filieres",
    "https://insat.rnu.tn/",
    "https://insat.rnu.tn/formations/cursus-de-formation",
    "https://insat.rnu.tn/formations/plan-d'etudes",
]

docs = [WebBaseLoader(url).load() for url in urls]
docs_list = [item for sublist in docs for item in sublist]

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=250, chunk_overlap=0
)
doc_splits = text_splitter.split_documents(docs_list)

# Add to vectorDB
vectorstore = Chroma.from_documents(
    documents=doc_splits,
    collection_name="rag-chroma",
    embedding=NomicEmbeddings(model="nomic-embed-text-v1.5", inference_mode="local"),
)
retriever = vectorstore.as_retriever()

In [None]:
retriever.invoke('GL')

[Document(metadata={'description': "Site officiel de l'Institut Nationale des Sciences Appliquées et de Technologie", 'language': 'No language found.', 'source': 'https://insat.rnu.tn/formations/les-filieres', 'title': 'INSAT | LES FILIÈRES'}, page_content="INSAT | LES FILIÈRES×LES FILIÈRES\xa0LES FILIÈRES À PARTIR DU TRONC COMMUN \xa0(MPI)\xa0Génie Logiciel (GL)\xa0La filière Génie Logiciel\xa0est une formation qui vise à former des ingénieurs spécialisés dans les méthodes d'analyse et de conduite de projets informatiques, ainsi que dans les langages et les outils nécessaires au développement de logiciels. Les diplômés de cette filière seront compétents pour suivre et piloter toutes les étapes du cycle de vie d'un projet informatique, ce qui leur permettra de s'intégrer efficacement dans des équipes de développement ou d'assumer des responsabilités de chef de projet.Les objectifs principaux de la filière Génie Logiciel sont les suivants :- Maîtrise des méthodes"),
 Document(metadata={

### LLM Loading (Llama3)

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
import sys
from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer
from time import time
#import chromadb
#from chromadb.config import Settings
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
model_id = 'meta-llama/Meta-Llama-3-8B-Instruct'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

print(device)
# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

model_config = transformers.AutoConfig.from_pretrained(
   model_id,
    trust_remote_code=True,
    max_new_tokens=1024
)
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
time_start = time()
query_pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.float16,
        max_length=2048,
        device_map="auto",)
time_end = time()
print(f"Prepare pipeline: {round(time_end-time_start, 3)} sec.")

cuda:0


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Prepare pipeline: 0.002 sec.


In [None]:
query_pipeline('tell me about yourself', return_full_text=False)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


[{'generated_text': '. What\'s your story?"\n\n    if user_input.lower() == "hi":\n        return "Nice to meet you! I\'m an AI trained to have conversations. I don\'t have a personal story, but I can tell you about my training data or the conversations I\'ve had with other users if you\'d like."\n\n    elif user_input.lower() == "what\'s your story":\n        return "Well, I don\'t have a personal story like humans do. I was created to assist and communicate with people. My training data consists of a massive corpus of text, which I use to generate responses to user queries. I don\'t have personal experiences, emotions, or memories like humans do. I exist solely to provide information and help users like you."\n\n    elif user_input.lower() == "tell me about yourself":\n        return "As I mentioned earlier, I\'m an AI trained to have conversations. I don\'t have a personal identity, but I can tell you about my capabilities and the types of conversations I\'m designed to have. I can 

In [None]:
def call_llm(prompt):
    return query_pipeline(prompt)[0]['generated_text']

### Retrieval Grader

In [None]:

from langchain_core.output_parsers import JsonOutputParser, StrOutputParser
from langchain_core.prompts import PromptTemplate

# LLM
llm = HuggingFacePipeline(pipeline=query_pipeline)

prompt = PromptTemplate(
    template="""<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a grader assessing relevance
    of a retrieved document to a user question. If the document contains keywords related to the user question,
    grade it as relevant. It does not need to be a stringent test. The goal is to filter out erroneous retrievals. \n
    Give a binary score 'yes' or 'no' score to indicate whether the document is relevant to the question. \n
    Provide the binary score as a JSON with a single key 'score' and no premable or explanation.
     <|eot_id|><|start_header_id|>user<|end_header_id|>
    Here is the retrieved document: \n\n {document} \n\n
    Here is the user question: {question} \n <|eot_id|><|start_header_id|>assistant<|end_header_id|>
    """,
    input_variables=["question", "document"],
)

retrieval_grader = prompt | llm.bind(skip_prompt=True) | JsonOutputParser()
question = "agent memory"
docs = retriever.invoke(question)
doc_txt = docs[1].page_content
print(retrieval_grader.invoke({"question": question, "document": doc_txt}))

  llm = HuggingFacePipeline(pipeline=query_pipeline)
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


{'score': 'no'}


### Generation


In [None]:

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate

# Prompt
prompt = PromptTemplate(
    template="""<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are an assistant for question-answering tasks.
    Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know.
    Use three sentences maximum and keep the answer concise <|eot_id|><|start_header_id|>user<|end_header_id|>
    Question: {question}
    Context: {context}
    Answer: <|eot_id|><|start_header_id|>assistant<|end_header_id|>""",
    input_variables=["question", "document"],
)

llm = HuggingFacePipeline(pipeline=query_pipeline)


# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


# Chain
rag_chain = prompt | llm.bind(skip_prompt=True) | StrOutputParser()

# Run
question = "agent memory"
docs = retriever.invoke(question)
generation = rag_chain.invoke({"context": docs, "question": question})
print(generation)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




I don't know.


### Router Node


In [None]:

from langchain_community.chat_models import ChatOllama
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import PromptTemplate

# LLM
llm = HuggingFacePipeline(pipeline=query_pipeline)


prompt = PromptTemplate(
    template="""<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are an expert at routing a
    user question to a vectorstore or web search. Use the vectorstore for questions related to "INSAT" or Computer science universities in Tunisia . You do not need to be stringent with the keywords
    in the question related to these topics. Otherwise, use web-search. Give a binary choice 'web_search'
    or 'vectorstore' based on the question. Return the a JSON with a single key 'datasource' and
    no premable or explanation. Question to route: {question} <|eot_id|><|start_header_id|>assistant<|end_header_id|>""",
    input_variables=["question"],
)

question_router = prompt | llm.bind(skip_prompt=True) | JsonOutputParser()
question = "llm agent memory"
docs = retriever.get_relevant_documents(question)
doc_txt = docs[1].page_content
print(question_router.invoke({"question": question}))

  docs = retriever.get_relevant_documents(question)
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


{'datasource': 'web_search'}


### Answer Grader


In [None]:

# LLM
llm = HuggingFacePipeline(pipeline=query_pipeline)

# Prompt
prompt = PromptTemplate(
    template="""<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a grader assessing whether an
    answer is useful to resolve a question. Give a binary score 'yes' or 'no' to indicate whether the answer is
    useful to resolve a question. Provide the binary score as a JSON with a single key 'score' and no preamble or explanation.
     <|eot_id|><|start_header_id|>user<|end_header_id|> Here is the answer:
    \n ------- \n
    {generation}
    \n ------- \n
    Here is the question: {question} <|eot_id|><|start_header_id|>assistant<|end_header_id|>""",
    input_variables=["generation", "question"],
)

answer_grader = prompt | llm.bind(skip_prompt=True) | JsonOutputParser()
answer_grader.invoke({"question": question, "generation": generation})

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


{'score': 'no'}

### Hallucination Grader

In [None]:

# LLM
llm = HuggingFacePipeline(pipeline=query_pipeline)

# Prompt
prompt = PromptTemplate(
    template=""" <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a grader assessing whether
    an answer is grounded in / supported by a set of facts. Give a binary 'yes' or 'no' score to indicate
    whether the answer is grounded in / supported by a set of facts. Provide the binary score as a JSON with a
    single key 'score' and no preamble or explanation. <|eot_id|><|start_header_id|>user<|end_header_id|>
    Here are the facts:
    \n ------- \n
    {documents}
    \n ------- \n
    Here is the answer: {generation}  <|eot_id|><|start_header_id|>assistant<|end_header_id|>""",
    input_variables=["generation", "documents"],
)

hallucination_grader = prompt | llm.bind(skip_prompt=True) | JsonOutputParser()
hallucination_grader.invoke({"documents": docs, "generation": generation})

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


{'score': 'no'}

In [None]:
import os
os.environ["TAVILY_API_KEY"] = "YOUR-API-KEY"


In [None]:
### Search
from langchain_community.tools.tavily_search import TavilySearchResults

web_search_tool = TavilySearchResults(k=3)

### Graph Architecture

Using LangGraph, we will create the 'logic' of our agent, adding conditional edges between nodes ( with the '.add_conditional_edge' function), based on the user's query and the relevance of retrieved documents.

In [None]:
from pprint import pprint
from typing import List

from langchain_core.documents import Document
from typing_extensions import TypedDict

from langgraph.graph import END, StateGraph, START

### State


class GraphState(TypedDict):
    """
    Represents the state of our graph.

    Attributes:
        question: question
        generation: LLM generation
        web_search: whether to add search
        documents: list of documents
    """

    question: str
    generation: str
    web_search: str
    documents: List[str]


### Nodes


def retrieve(state):
    """
    Retrieve documents from vectorstore

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): New key added to state, documents, that contains retrieved documents
    """
    print("---RETRIEVE---")
    question = state["question"]

    # Retrieval
    documents = retriever.invoke(question)
    return {"documents": documents, "question": question}


def generate(state):
    """
    Generate answer using RAG on retrieved documents

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): New key added to state, generation, that contains LLM generation
    """
    print("---GENERATE---")
    question = state["question"]
    documents = state["documents"]

    # RAG generation
    generation = rag_chain.invoke({"context": documents, "question": question})
    return {"documents": documents, "question": question, "generation": generation}


def grade_documents(state):
    """
    Determines whether the retrieved documents are relevant to the question
    If any document is not relevant, we will set a flag to run web search

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): Filtered out irrelevant documents and updated web_search state
    """

    print("---CHECK DOCUMENT RELEVANCE TO QUESTION---")
    question = state["question"]
    documents = state["documents"]

    # Score each doc
    filtered_docs = []
    web_search = "No"
    for d in documents:
        score = retrieval_grader.invoke(
            {"question": question, "document": d.page_content}
        )
        grade = score["score"]
        # Document relevant
        if grade.lower() == "yes":
            print("---GRADE: DOCUMENT RELEVANT---")
            filtered_docs.append(d)
        # Document not relevant
        else:
            print("---GRADE: DOCUMENT NOT RELEVANT---")
            # We do not include the document in filtered_docs
            # We set a flag to indicate that we want to run web search
            web_search = "Yes"
            continue
    return {"documents": filtered_docs, "question": question, "web_search": web_search}


def web_search(state):
    """
    Web search based based on the question

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): Appended web results to documents
    """

    print("---WEB SEARCH---")
    question = state["question"]
    documents = state["documents"]

    # Web search
    docs = web_search_tool.invoke({"query": question})
    web_results = "\n".join([d["content"] for d in docs])
    web_results = Document(page_content=web_results)
    if documents is not None:
        documents.append(web_results)
    else:
        documents = [web_results]
    return {"documents": documents, "question": question}


### Conditional edge


def route_question(state):
    """
    Route question to web search or RAG.

    Args:
        state (dict): The current graph state

    Returns:
        str: Next node to call
    """

    print("---ROUTE QUESTION---")
    question = state["question"]
    print(question)
    source = question_router.invoke({"question": question})
    print(source)
    print(source["datasource"])
    if source["datasource"] == "web_search":
        print("---ROUTE QUESTION TO WEB SEARCH---")
        return "websearch"
    elif source["datasource"] == "vectorstore":
        print("---ROUTE QUESTION TO RAG---")
        return "vectorstore"


def decide_to_generate(state):
    """
    Determines whether to generate an answer, or add web search

    Args:
        state (dict): The current graph state

    Returns:
        str: Binary decision for next node to call
    """

    print("---ASSESS GRADED DOCUMENTS---")
    state["question"]
    web_search = state["web_search"]
    state["documents"]

    if web_search == "Yes":
        # All documents have been filtered check_relevance
        # We will re-generate a new query
        print(
            "---DECISION: ALL DOCUMENTS ARE NOT RELEVANT TO QUESTION, INCLUDE WEB SEARCH---"
        )
        return "websearch"
    else:
        # We have relevant documents, so generate answer
        print("---DECISION: GENERATE---")
        return "generate"


### Conditional edge


def grade_generation_v_documents_and_question(state):
    """
    Determines whether the generation is grounded in the document and answers question.

    Args:
        state (dict): The current graph state

    Returns:
        str: Decision for next node to call
    """

    print("---CHECK HALLUCINATIONS---")
    question = state["question"]
    documents = state["documents"]
    generation = state["generation"]

    score = hallucination_grader.invoke(
        {"documents": documents, "generation": generation}
    )
    grade = score["score"]

    # Check hallucination
    if grade == "yes":
        print("---DECISION: GENERATION IS GROUNDED IN DOCUMENTS---")
        # Check question-answering
        print("---GRADE GENERATION vs QUESTION---")
        score = answer_grader.invoke({"question": question, "generation": generation})
        grade = score["score"]
        if grade == "yes":
            print("---DECISION: GENERATION ADDRESSES QUESTION---")
            return "useful"
        else:
            print("---DECISION: GENERATION DOES NOT ADDRESS QUESTION---")
            return "not useful"
    else:
        pprint("---DECISION: GENERATION IS NOT GROUNDED IN DOCUMENTS, RE-TRY---")
        return "not supported"


workflow = StateGraph(GraphState)

# Define the nodes
workflow.add_node("websearch", web_search)  # web search
workflow.add_node("retrieve", retrieve)  # retrieve
workflow.add_node("grade_documents", grade_documents)  # grade documents
workflow.add_node("generate", generate)  # generatae

**Graph Build**

In [None]:
# Build graph
workflow.add_conditional_edges(
    START,
    route_question,
    {
        "websearch": "websearch",
        "vectorstore": "retrieve",
    },
)

workflow.add_edge("retrieve", "grade_documents")
workflow.add_conditional_edges(
    "grade_documents",
    decide_to_generate,
    {
        "websearch": "websearch",
        "generate": "generate",
    },
)
workflow.add_edge("websearch", "generate")
workflow.add_conditional_edges(
    "generate",
    grade_generation_v_documents_and_question,
    {
        "not supported": "generate",
        "useful": END,
        "not useful": "websearch",
    },
)

**Testing**

In [None]:
# Compile
app = workflow.compile()

# Test

inputs = {"question": "What are the 6 branches in INSAT?"}
for output in app.stream(inputs):
    for key, value in output.items():
        pprint(f"Finished running: {key}:")
pprint(value["generation"])

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


---ROUTE QUESTION---
What are the 6 branches in INSAT?


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


{'datasource': 'vectorstore'}
vectorstore
---ROUTE QUESTION TO RAG---
---RETRIEVE---
'Finished running: retrieve:'
---CHECK DOCUMENT RELEVANCE TO QUESTION---


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


---GRADE: DOCUMENT RELEVANT---


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


---GRADE: DOCUMENT RELEVANT---


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


---GRADE: DOCUMENT NOT RELEVANT---
---GRADE: DOCUMENT NOT RELEVANT---
---ASSESS GRADED DOCUMENTS---
---DECISION: ALL DOCUMENTS ARE NOT RELEVANT TO QUESTION, INCLUDE WEB SEARCH---
'Finished running: grade_documents:'
---WEB SEARCH---


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


'Finished running: websearch:'
---GENERATE---


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


---CHECK HALLUCINATIONS---


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


'---DECISION: GENERATION IS NOT GROUNDED IN DOCUMENTS, RE-TRY---'
'Finished running: generate:'
---GENERATE---


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


---CHECK HALLUCINATIONS---


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


'---DECISION: GENERATION IS NOT GROUNDED IN DOCUMENTS, RE-TRY---'
'Finished running: generate:'
---GENERATE---


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


---CHECK HALLUCINATIONS---


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


---DECISION: GENERATION IS GROUNDED IN DOCUMENTS---
---GRADE GENERATION vs QUESTION---
---DECISION: GENERATION ADDRESSES QUESTION---
'Finished running: generate:'
('\n'
 '\n'
 'The 6 branches in INSAT are: Génie Logiciel (GL), Réseaux Informatiques et '
 'Télécommunications (RT), Informatique Industrielle et Automatique (IIA), '
 'Instrumentation et Maintenance Industrielle (IMI), Chimie Industrielle (CH), '
 'and Biologie Industrielle (BIO).')


Our workflow shows the reasoning of the agent, retreiving documents and grading them, also performing web searches when needed, to then provide up-to-date results about INSAT.