# Introduction

A simple **Chat-with-Multi PDFs / Docs** app for **Retrieval-Augmented Generation (RAG)**.  
It allows you to ask questions about the contents of PDFs and DOC files, and the app will provide relevant responses.

![Chatbot Diagram](../src/images/GraghWorkFlow.jpg)

This app is built using [**LangGraph**](https://www.langchain.com/langgraph) and [**LangChain**](https://www.langchain.com/).


------


## 🚀 Features covered

- **Multi-Document Support**: Reads and processes both PDF and DOC files.  
- **Text Chunking**: Splits extracted text into manageable overlapping chunks.  
- **Semantic Embeddings**: Generates vector embeddings using state-of-the-art models.  
- **Vector Database**: Stores embeddings in a scalable vector DB for fast similarity search.  
- **Detection of Language and Dialect**: Automatically detects language and Arabic dialects if applicable.  
- **Retriever Tool**: Fetches the most relevant chunks using semantic search.  
- **Graders**: Evaluate the relevance of retrieved documents before generating answers.  
- **Translate & Reasoning**: Provides translation support and reasoning over queries.  
- **Query Handling**: Processes user queries for context-aware responses.  
- **LLM-Powered Answering**: Combines retrieved context with queries to generate factual, context-aware answers.  
- **Streamlit Integration**: Fully interactive interface for uploading documents and chatting with the AI assistant.
-----



In [None]:
# set Setting
import os 
import dotenv
from langchain.chat_models import init_chat_model
os.environ['GOOGLE_API_KEY'] = dotenv.get_key(key_to_get='GOOGLE_API_KEY', dotenv_path='.env')
llm = init_chat_model(model = 'gemini-2.5-flash' , model_provider='google-genai')

# A. Preprocess documents

- In this step we read files to get the full text and split them into small chunks and embed them to store in Vectore DataBase.


##### 1) Reading Files using PyPDF2 to extract text from pdf

In [None]:
from PyPDF2 import PdfReader

def get_content_pages(files:  list[str]) -> str:
    """Read PDF or TXT files and return their combined text"""
    ftext = ""
    for file_path in files:
        if file_path.lower().endswith(".pdf"):
            pdf_loader = PdfReader(file_path)
            for page in pdf_loader.pages:
                ftext += page.extract_text() or ""  
        else:
            with open(file_path, "r", encoding="utf-8") as f:
                ftext += f.read()
        ftext += "\n"
    return ftext

##### 2) Split Full text into small chunks

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter


def split_chunks(text : str , chunk_size :int = 600, chunk_overlap : int = 50) :
    """Split to small and overla chunks"""
    splitter= RecursiveCharacterTextSplitter("\n", chunk_size =chunk_size , 
                                         chunk_overlap =chunk_overlap)
    chunks = splitter.split_text(text)

    return chunks

##### 3) Initialize Vector Database

1. Generate embeddings from chunks using the **[MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)** model from SentenceTransformers.

2. Store the embeddings in a vector database using **Chroma** from LangChain.

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import InMemoryVectorStore


def create_vectorDB(chunks : list[str] ,
                    embedding_name  : str = "sentence-transformers/all-MiniLM-L6-v2" ) : 
    """Generate and store embedding of chunks into vector database"""
    embedding_model = HuggingFaceEmbeddings(model_name = embedding_name)
    vectordb = InMemoryVectorStore.from_texts(chunks , embedding_model)
    return vectordb

##### 4) concate components together


In [None]:
def ProcessDocuments(files:list[str]) :
    """concat all together""" 
    
    content = get_content_pages(files=files)
    chunks = split_chunks(content)
    vectorDB = create_vectorDB(chunks)
    return vectorDB , chunks

In [None]:
# testing functions
pdfs = [os.path.join('data' , x) for x in os.listdir("data")]  #pdf paths
vector_db ,chunks = ProcessDocuments(pdfs) 

In [None]:
query = "what is supervise learning?" 
similar_docs = vector_db.similarity_search(query, k=3)
for i, doc in enumerate(similar_docs):
    print("---" * 15, f"*{i+1}*", "---" * 15)
    print(doc.page_content)

# B. Work-Flow-Graph

## 1) Define StateClass

In [None]:
from typing import TypedDict, List
from langchain_core.messages import BaseMessage

class MyState(TypedDict):
    messages: List[BaseMessage]
    detected_lang: str
    dialect: str

## 2) Generate a Retriever Tool using LangChain

- Use `create_retriever_tool` from **LangChain** to generate a retriever tool from the vector database.

##### 2.1) retriever_tool Function

In [None]:
from langchain.tools.retriever import create_retriever_tool 

retriver = vector_db.as_retriever()
retrivertool = create_retriever_tool(retriver , 
                                     "retriever_tool" ,
                                     "Search and return information about input context" )

print(retrivertool.invoke({"query": query}))


##### 2.2) retriever Agient

In [None]:
from langgraph.prebuilt import create_react_agent

RetriverAgent = create_react_agent(
    llm,
    tools=[retrivertool],
    name="RetriverAgent",
    prompt=(
        "You are a retriever agent.\n"
        "- Otherwise, always call the retriever_tool with the user query.\n"
        "- Return ONLY the page_content of the retrieved documents.\n"
        "- Do not summarize or rephrase.\n"
        "- Don't return repeated retrieved chunks.\n"
        "- If you didn't find similar text return 'I can't find it'."
    )
)


## 3) Detect Language and Dialect

##### 3.1) Detect Language using Prompt Template with structure ouput and Dialect if the query language is ar

#####  Detect Arabic dialect

- Using [IbrahimAmin/marbertv2-arabic-written-dialect-classifier](https://huggingface.co/IbrahimAmin/marbertv2-arabic-written-dialect-classifier)  
- The model predicts one of **5 Arabic dialects**:

| Code | Dialect | Region / Notes |
|------|---------|----------------|
| MAGHREB | Maghreb dialect | Northwest Africa (Morocco, Algeria, Tunisia, Libya, Mauritania) |
| LEV     | Levantine dialect | Lebanon, Syria, Jordan, Palestine |
| MSA     | Modern Standard Arabic | Formal Arabic (books, news, official use) |
| GLF     | Gulf dialect | Saudi Arabia, UAE, Kuwait, Bahrain, Qatar, Oman |
| EGY     | Egyptian dialect | Egypt |



In [None]:
from pydantic import Field, BaseModel
from langgraph.graph import MessagesState
from langchain_core.prompts import PromptTemplate
from langchain_core.messages import HumanMessage
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

class LanguageDetector(BaseModel):
    language: str = Field(
        description="Detected language of the question, represented in a two-character ISO 639-1 code."
    )


dialect_model_name = "IbrahimAmin/marbertv2-arabic-written-dialect-classifier"

dialect_pipeline = pipeline(
    "text-classification",
    model=AutoModelForSequenceClassification.from_pretrained(dialect_model_name),
    tokenizer=AutoTokenizer.from_pretrained(dialect_model_name),
)



def detecting_language(state: MyState):
    
   
    question = state['messages'][0].content
    
    detectionmodel = llm.with_structured_output(LanguageDetector)

    LANGUAGE_DETECTOR_TEMPLATE = "\n\n".join([
        "You are a language detector assessing to return the language of the question from a user.",
        "Here is the user question: {question}",
        "# Instructions:",
        "- Return only the two-character ISO 639-1 code for the language.",
        "- Base detection on the language of the question itself (its structure and wording), not on individual foreign words inside it.",
        "- Focus especially on the interrogative word (e.g., what, how, من, ماذا) and the main verb or auxiliary verb."
    ])


    detection_prompt = PromptTemplate(
        template=LANGUAGE_DETECTOR_TEMPLATE,
        input_variables=["question"]
    )


    prompt = detection_prompt.format(question=question)
    response: LanguageDetector = detectionmodel.invoke(prompt)


    # dialect
    dialect = None
    if response.language == "ar":
        preds = dialect_pipeline(question, top_k=None)
        if preds:
            best = max(preds, key=lambda x: x["score"])
            dialect = best["label"]
    
    
    return {
        "messages": state["messages"],
        "detected_lang": response.language,
        "dialect": dialect
    }



In [None]:
#Test
from langchain_core.messages import HumanMessage
import time

examples = [
    ("Arabic Question", MessagesState(messages=[HumanMessage(content="ما هي أنواع Transformer؟")])),
    ("English Question", MessagesState(messages=[HumanMessage(content="What are the types of Transformer?")])),
    ("French Question", MessagesState(messages=[HumanMessage(content="Qu'est-ce qu'un transformateur?")])),
    ("Spanish Question", MessagesState(messages=[HumanMessage(content="¿Cuáles son los tipos de transformadores?")])),
    ("German Question", MessagesState(messages=[HumanMessage(content="Welche Arten von Transformatoren gibt es?")])) ,
]


for name, state in examples:
    result = detecting_language(state)
    print(f"{name}: {result['detected_lang'] ,result['dialect'] }")
    print("-" * 50)
    time.sleep(5)

In [None]:
#Test
import time

examples = [
    ("EGY Example", MessagesState(messages=[HumanMessage(content="ازيك يصحبي قولي انواع l")])),
    ("LEV Example 2", MessagesState(messages=[HumanMessage(content="عامل اي يازلمي اليوم ممكن تقولي types of ML")])),
    ("GLF Example", MessagesState(messages=[HumanMessage(content="شلونك يا طويل العمر؟ ممكن تقول لي examples of ML")])),
    ("LEV Example", MessagesState(messages=[HumanMessage(content="كيفك يا زلمي؟ شو الأخبار؟ give me types of ML")])),
    ("MSA Example", MessagesState(messages=[HumanMessage(content="ما هي أنواع   ML techniques؟")]))
]



for name, state in examples:
    result = detecting_language(state)
    print(f"{name}: {result['detected_lang'] ,result['dialect'] }")
    print("-" * 50)
    time.sleep(5)

## 4) Translate Query to English

In [None]:
from langchain_core.prompts import PromptTemplate
from langchain_core.messages import HumanMessage
from langgraph.graph import MessagesState


def TranslateQuery(state :MyState):
    
    "Machine Translation to translate Queries to english text"
    
    user_messages = [m for m in state["messages"] if isinstance(m,HumanMessage)]
    question = state.get("translated_query", user_messages[-1].content)

    msg_prompt = f"""
                    You are a Machine Translation (MT) system.
                    Your task: translate the user question to English text.

                    Instructions:
                    1. Translate the question to English as accurately as possible.
                    2. Do not add explanations, comments, or extra content.
                    3. Do not attempt to clarify or modify the meaning.
                    4. Keep the original meaning exactly.

                    User question: "{question}"
                    """

    prompt_msg_template = PromptTemplate(
        template=msg_prompt,
        input_variables=['question']
    )

    resonong = llm.invoke([{'role': 'user', 'content': prompt_msg_template.format(question=question)}])

    return {'messages': [HumanMessage(content=resonong.content)]}


In [None]:
from langchain_core.messages import convert_to_messages

input_state = {
    "messages": convert_to_messages([
        {"role": "user", "content": "تقدر تفرق بين انواع transformers "}
    ])
}


response = TranslateQuery(input_state)
response["messages"][-1].pretty_print()

## 5) Grader
- Grander is a score computed from llm to determine whether the retrieved documents are relevant to the question.

In [None]:
from pydantic import BaseModel , Field
from typing import Literal
from langchain_core.messages import HumanMessage


class GraderDocument(BaseModel):
    """Grade documents using a binary score for relevance check."""
    binary_score : str = Field(description= "Relevance score: 'yes' if relevant, or 'no' if not relevant")


def GraderDocumentAgent(state: MyState)-> Literal['Chit-ChatAgent' ,'AnswerAgent'] :
    
    user_messages = [m for m in state["messages"] if isinstance(m, HumanMessage)]
    question = state.get("translated_query", user_messages[-1].content)
    context = state['messages'][-1].content

      
    gradermodel = llm.with_structured_output(GraderDocument)
    
    GRADE_PROMPT = "\n\n".join([
    "You are a grader assessing relevance of a retrieved document to a user question.",
    "Here is the retrieved document: \n\n {context}",
    "Here is the user question: {question}",
    "If the document contains keyword(s) or semantic meaning related to the user question, grade it as relevant.",
    "If the question is an introductory or personal question (e.g., greetings like 'how are you', or self-introduction like 'I am X'), always grade it as 'yes'.",
    "Give a binary score 'yes' or 'no' score to indicate whether the document is relevant to the question." ])
    
    prompt = PromptTemplate(template= GRADE_PROMPT , input_variables=['question', 'context'])
    
    
    prompt_template = prompt.format(question = question , context = context)
    response =  gradermodel.invoke(
        [HumanMessage(content=prompt_template) ]
    )

    score = response.binary_score
    
    if score == 'yes' :
        return "AnswerAgent"
    else : 
        return 'Chit-ChatAgent'
    




In [None]:
# test
from langchain_core.messages import convert_to_messages

input = {
    "messages": convert_to_messages(
        [
            {
                "role": "user",
                "content": "What does machine learning?",
            },
            
            {
                "role": "assistant",
                "content": "",
                "tool_calls": [
                    {
                        "id": "1",
                        "name": "retriever_tool",
                        "args": {"query": "Supervised learning, Unsupervised learning, Reinforcement learning."},
                    }
                ],
            },
            {"role": "tool", "content": "meow", "tool_call_id": "1"},
        ]
    )
}


GraderDocumentAgent(input)

In [None]:
from langchain_core.messages import convert_to_messages

input = {
    "messages": convert_to_messages(
        [
            {
                "role": "user",
                "content": "What does machine learning?",
            },
            
            {
                "role": "assistant",
                "content": "",
                "tool_calls": [
                    {
                        "id": "1",
                        "name": "retriever_tool",
                        "args": {"query": "Supervised learning, Unsupervised learning, Reinforcement learning."},
                    }
                ],
            },
            {"role": "tool", 
             "content": """Machine learning is a field of artificial intelligence that focuses on building models that can learn from data and make predictions or decisions without being explicitly programmed. 
                            It is commonly used for tasks likelassification, regression, and pattern recognition. """, 
            "tool_call_id": "1"},
        ]
    )
}


GraderDocumentAgent(input)

## 6) Generate answer


In [None]:
def GenerateAnswer(state: MyState) :
    """Generate an answer."""

    GENERATE_PROMPT = "\n".join([
        "You are an assistant for question-answering tasks.",
        "Use the following pieces of retrieved context to answer the question.",
        "- you must first understand the question and the context to answer correctly.",
        "Answer as many questions as possible and make it a simple temporary one.",
        "- Generate english text only.",
        "- If the question is a greeting or introductory (like 'how are you', 'what's up', 'hello', 'hi', or self-introduction like 'I am X'), do not use the context. Instead, just greet back politely and say 'How can I help you?'.",
        "- If it is a normal question, add some information from the context in your answer to make it complete, not only from the context.",
        "Question: {question}\n",
        "Context: {context}"
    ])

    user_messages = [m for m in state["messages"] if isinstance(m,HumanMessage)]
    
    question = state.get("translated_query", user_messages[-1].content)
    context = state["messages"][-1].content

    prompt = GENERATE_PROMPT.format(question=question, context=context)
    response = llm.invoke([{"role": "user", 
                            "content": prompt}])
    
    return {"messages": [response]}

In [None]:
from langchain_core.messages import convert_to_messages

input = {
    "messages": convert_to_messages(
        [
            {
                "role": "user",
                "content": "What are the main types of machine learning?",
            },
            {
                "role": "assistant",
                "content": "",
                "tool_calls": [
                    {
                        "id": "1",
                        "name": "retrieve_tool",
                        "args": {"query": "types of machine learning"},
                    }
                ],
            },
            {
                "role": "tool",
                "content": "Machine learning is commonly categorized into three main types: supervised learning, unsupervised learning, and reinforcement learning.",
                "tool_call_id": "1",
            },
        ]
    )
}

response = GenerateAnswer(input)
response["messages"][-1].pretty_print()

## 7) Chit-Chat Agent

In [None]:
from langchain_core.prompts import PromptTemplate
from langchain_core.messages import AIMessage

def chitChatAgent(state : MyState):
    user_messages= [m for m in state['messages'] if isinstance(m,HumanMessage)]
    question = state.get("translated_query", user_messages[-1].content)

    msg_prompt = f"""
        You are a Chit-Chat Assistant.
        Your task: reply politely when the context not related to  a user question so follow Instructions to senf a chit chat message.
        
        Instructions:
        1. Start by apologizing that you don't fully understand the question.
        2. Try to clarify by highlighting key words from the user's question.
        3. Use short, simple sentences to suggest that the user rephrase their question.
        5. shorts apologizing messages in first sentence only.
        6. write in english text only.
        7. don't require any language from user to write his question.
        8. ask only some question trying to understand user question
        

        User question: "{question}"
        """

    prompt_msg_template = PromptTemplate(
        template=msg_prompt,
        input_variables=['question']
    )

    resonong = llm.invoke([{'role': 'user', 'content': prompt_msg_template.format(question=question)}])

    return {'messages': [AIMessage(content=resonong.content)]}


In [None]:
from langchain_core.messages import convert_to_messages

input = {
    "messages": convert_to_messages(
        [
            {
                "role": "user",
                "content": "where salaj play in zamilk",
            },
            {
                "role": "assistant",
                "content": "",
                "tool_calls": [
                    {
                        "id": "1",
                        "name": "retrieve_tool",
                        "args": {"query": "classification of machine learning"},
                    }
                ],
            },
            {"role": "tool", "content": "Supervised, Unsupervised, Reinforcement Learning", "tool_call_id": "1"},
        ]
    )
}

response = chitChatAgent(input)
response["messages"][-1].pretty_print()

In [None]:
from langchain_core.messages import convert_to_messages

input = {
    "messages": convert_to_messages(
        [
            {
                "role": "user",
                "content": "اين كنت وين ياولا",
            }
           
        ]
    )
}

response = chitChatAgent(input)
response["messages"][-1].pretty_print()

## 8)  Translate Agent

- Translate the reasoning into the same **language** and **dialect** of the question.


In [None]:
def TranslationReasoning(state: MyState):
    """Translate the reasoning into the same **language** and **dialect** of the question."""

    context = state["messages"][-1].content
    detected_lang = state.get("detected_lang")
    dialect = state.get("dialect")


    Translatetemplate = "\n".join([
        "You are a translation agent. Your ONLY job is to translate English text into the target language below.",
        "Never answer in Spanish, French, or any other language unless it exactly matches the detected language.",
        "you must know we shortcut the two-character ISO 639-1 code for the language like ar for arabic , en for english ",
        "",
        f"Target language: {detected_lang} ",
        f"Target dialect: {dialect or 'standard'}",
        "",
        "# Instructions:",
        "- If dialect is None, translate to the language only.",
        "- If the language is not English, keep important keywords in English.",
        "- Don't explain, don't rephrase, just translate.",
        "- If target language is Arabic, mimic the dialect if possible; otherwise use Modern Standard Arabic.",
        "",
        "Text to translate to targey language : ",
        "{context}"
    ])

    TranslatePrompt = PromptTemplate(
        template=Translatetemplate,
        input_variables=["context"],
    )

    prompt = TranslatePrompt.format(context=context)

    response = llm.invoke([
        {"role": "system", "content": "You are a strict translation agent. Respond ONLY with the translated text."},
        {"role": "user", "content": prompt}
    ])
    
    
    translated_text = response.content 
    return {"messages": [AIMessage(content=translated_text)]}

In [None]:

test_state = {
    "messages": [
        HumanMessage(content="Hello Islam! I'm doing well, thank you for asking! How can I help you?")  
    ],
    "detected_lang": "ar",    
    "dialect": "EGY"        
}

result = TranslationReasoning(test_state)

print("\n================ Translation Output ================\n")
for msg in result["messages"]:
    msg.pretty_print()


# Work Flow



### 1. Language & Dialect Detection 🌐
- Detect the language of the user query.
- Detect dialects if the query is in Arabic or other supported languages.
- Translate the query to English for processing.

### 2. Database Preparation 🗂️
- Extract text from PDF/DOC files.
- Split text into smaller overlapping chunks.
- Generate embeddings and store them in the vector database.

### 3. Retrieval & Grading 🔎⚙️
- Use the retriever tool to fetch relevant chunks based on the translated query.
- Graders evaluate the relevance of the retrieved chunks.
- Decide workflow:
  - **Generate Answer** → if relevant chunks are found.
  - **Chitchat / fallback** → if no relevant context is retrieved.

### 4. Answer Generation ✨
- Combine retrieved context (if any) with the translated query.
- Pass to the LLM → produce answer in English.

### 5. Output Translation & Presentation 🌍
- Translate the generated answer back to the original language and dialect of the query.
- Display the answer in Streamlit with proper formatting.



----

In [None]:
from langgraph.graph import START, END, StateGraph
from langgraph.prebuilt import ToolNode
from langgraph.checkpoint.memory import MemorySaver
from langchain.tools import StructuredTool

workflow = StateGraph(MyState)

workflow.add_node("DetectLangAgent", detecting_language)
workflow.add_node("Translate_Query", TranslateQuery)
workflow.add_node("RetriverAgent", RetriverAgent)
workflow.add_node("Chit-ChatAgent", chitChatAgent)
workflow.add_node("AnswerAgent", GenerateAnswer)
workflow.add_node("TranslationReasoning", TranslationReasoning)

workflow.add_edge(START, "DetectLangAgent")
workflow.add_edge("DetectLangAgent", "Translate_Query")
workflow.add_edge("Translate_Query", "RetriverAgent")

workflow.add_conditional_edges(
    "RetriverAgent",
    GraderDocumentAgent,
    {
        "AnswerAgent": "AnswerAgent",
        "Chit-ChatAgent": "Chit-ChatAgent",
    }
)

workflow.add_edge("AnswerAgent", "TranslationReasoning")
workflow.add_edge("Chit-ChatAgent", "TranslationReasoning")
workflow.add_edge("TranslationReasoning", END)

graph = workflow.compile(checkpointer=MemorySaver())



In [None]:
from langchain_core.messages import HumanMessage

user_message = HumanMessage(content="اذيك ياصحبي انا اسلام عامل اي")

stream = graph.stream(
    {"messages": [user_message]},
    config ={"configurable": {"thread_id": "6"}},
    stream_mode="values"  
)

for chunk in stream:
    if "messages" in chunk:
        for msg in chunk["messages"]:
            msg.pretty_print()


In [None]:
output = graph.invoke( {"messages": [
            HumanMessage(content=" يعني اي decoding") 
        ]} ,  config={"configurable": {"thread_id": "6"}})

output['messages'][-1].pretty_print()

In [None]:
output = graph.invoke( {"messages": [
            HumanMessage(content="what is encoder decoder") 
        ]} ,  config={"configurable": {"thread_id": "6"}})

output['messages'][-1].pretty_print()

## 🔹 Conclusion  

- Demonstrates how **Chat with Multiple PDFs / Docs** pipeline can be built using **RAG (Retrieval-Augmented Generation)**.  

- Steps:  

  1. **Language & Dialect Detection** → detect the user query language and dialect, translate to English for processing.  

  2. **Database Preparation** → extract text from PDF/DOC files, split text into overlapping chunks, create embeddings, and store them in the vector database.  

  3. **Retrieval & Grading** → fetch relevant chunks using the retriever tool, grade their relevance, and decide workflow:
     - **Generate Answer** → if relevant chunks are found.
     - **Chitchat / fallback** → if no relevant context is retrieved.  

  4. **Answer Generation** → combine retrieved context with the translated query, generate the answer via LLM.  

  5. **Output Translation & Presentation** → translate the generated answer back to the original language and dialect, display it in Streamlit with proper formatting.  
---------

## 👨‍💻 Built by Eslam Sabry

concat
🔗 [LinkedIn](https://www.linkedin.com/in/eslamsabryai)   🔗 [Kaggle](https://www.kaggle.com/eslamsabryelsisi)  


