## Main Goal 

Create a RAG pipeline that will accept a user query, retrieve the relevent data from a vector database, and finally, generate a useful response using with the help of an LLM.

## My Appraoch

At a high level, the goal can be broken down into a few parts,

- Processing the data (cleaning, creating chucks, etc)
- Vectorizing the data with  an embedding function
- Taking a query
- Retreiving relevant info using the vector database
- Feeding that info, plus the original query into the LLM and generating a response



In [36]:
import pytesseract
from pdf2image import convert_from_path
import re
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_ollama import OllamaLLM

In [37]:
pdf_path = "book/HSC26-Bangla1st-Paper.pdf"
data_path = "data/"

## Extracting Text

For extracting the text, initially I tried to scrape from the PDFs using libraries like fitz, pdfplumber, etc, but the problem was that the text that would be extracted would often be corrupted with undefined characters showing up. This was likely due to the font style used in the book. 

Due to this, I opted for using optical character recognition (OCR). The advantage of using OCR was that it was way more general than using a PDF scraper as I would be able to read any text and extract the characters based on the OCR model. The trade off here is the lengthier time taken to extract the text due to an A.I. model being used in the OCR. This seemed more worthwhile to me as I was able to preserve vital data and not have to worry about cleaning up the relevant data itself and possibly creating more errors.

The final verdict is that data preservation due to OCR is more useful than the unreliability of PDF scraping for possibly different fonts in Bengali

Afterwards some cleaning was done on the data to make it more usuable

In [38]:
# This will scan the pdf and output texts based on an optical character recognition model
def ocr_with_tesseract(pdf_path):
    images = convert_from_path(pdf_path, dpi=300)
    text = ""
    for img in images:
        text += pytesseract.image_to_string(img, lang='ben') + "\n"
    return text

In [39]:
# this will just pipe text into a txt file
def save_text_to_file(text: str, file_path: str) -> None:
    with open(file_path, 'w', encoding='utf-8') as f:
        f.write(text)

In [42]:
# these will be used to clean up the text, theres often unnecessary characters and numbers at the start of every page
# some of them still remain after clean up but they shouldnt effect the model very much
## this is good and much much less aggressive compared to prev ones
def precise_bengali_cleaner(text: str) -> str:
    text = re.sub(r'\x0c\[[^\]]+\]', '', text)
    text = re.sub(r'(?<!\S)\d[\d ১২৩৪৫৬৭৮৯০?]+\b', '', text)
    text = re.sub(r'[€£][\d]+', '', text)    
    text = text.replace('\x0c', '').strip()
    return text
    
def ultra_precise_cleaner(text: str) -> str:
    text = re.sub(r'^\[লুল\n', '', text, flags=re.MULTILINE)
    text = re.sub(r'জআললাইন ব্যাচ”\n?', '', text)
    text = re.sub(r'^\?$\n', '', text, flags=re.MULTILINE)
    
    return text

In [43]:
#first we're gonna extract the text
#then save the unclean text in a file just because
#then we clean the text and save the final clean text in another file. This one we will use
original_text = ocr_with_tesseract(pdf_path)

In [44]:
save_text_to_file(original_text, data_path + "unclean_text.txt")
cleaned_text = precise_bengali_cleaner(original_text)
cleaned_text = ultra_precise_cleaner(cleaned_text)
save_text_to_file(cleaned_text, data_path + "final_clean_text.txt")
final_data_path = data_path + "final_clean_text.txt"

## Text Splitting

Before vectorizing the texts, they have to split into chunks. This is due to a variety of factors, primarily because for large documents much of the data can be lost to truncations, this paired with the limitations of embedding models makes splitting the text a vital aspects. 

Next comes the issue of how much do we "chunk"? 10 characters? 100? 1000? If we choose a small chunk size, we will have way too many vectors to choose from which will possibily give us meaningless data. If we have too large a chunk size, we may tend towards the limitations of the models or we may face overlapping of information. The chunk size can be thought of as a sort of hyper parameter which needs to be tuned to specific PDFs (strict tuning may be neccesary for very optimal results)

In [45]:
def text_splitter(txt_path : str):
    with open(txt_path, "r", encoding="utf-8") as f:
        cleaned_text = f.read()
    
    bengali_splitter = RecursiveCharacterTextSplitter(
        separators=["\n\n", "\n"],
        chunk_size=1000,  
        chunk_overlap=200,  
        length_function=len,
        is_separator_regex=False
    )
    
    # Split the documents
    text_chunks = bengali_splitter.create_documents([cleaned_text])
    return text_chunks

In [46]:
chunkz = text_splitter(final_data_path)
chunkz[5]

Document(metadata={}, page_content='মঞ্জরী কিশলয়যুক্ত কচি ডাল। মুকুল\n\nএকপত্তন একপ্রস্থ\n\nকানগর কল্যাণী যে দেশমাতৃকার সেবায় আত্মনিয়োগ করেছে,\n\nঅনুপমের এই আত্মোপলন্ধি এখানে প্রকাশিত।\n\n \n\n \n\nমূল আলোচ্য বিষয়')

## Vectorizing the data

Vectorizing the data basically implies that we will turn the chunks of the data into n dimensional vectors using the embedding tokens. By treating the data as vectors, we can now retreive the data by seeing how close the queried data (which will also be turned into a vector) is to the relevent data. This can be thought of seeing how far apart the two vectors (data vector and query vectors) are and for data vectors that are close to the query vectors, those will be the data that we retreive.

In [47]:
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
    model_kwargs={"device": "cpu"}
)

In [48]:
vector_db = Chroma.from_documents(
    documents=chunkz,
    embedding=embeddings,
    persist_directory="./bengali_chroma_db"
)

  return forward_call(*args, **kwargs)


In [49]:
# Test retrieval
query = "অনুপমের ভাষায় সুপুরুষ কাকে বলা হয়েছে?"
results = vector_db.similarity_search(query, k=2)
print(results[0].page_content)

বলা বাহুল্য, আমিও খুব রাগিয়াছিলাম। কোনো গতিকে শস্তুনাথ বিষম জব্দ হইয়া আমাদের পায়ে ধরিয়া
আসিয়া পড়েন, গোঁফের রেখায় তা দিতে দিতে এইটেই কেবল কামনা করিতে লাগিলাম।

কিন্ত, এই আক্রোশের কালো রঙের শ্োতের পাশাপাশি আর-একটা শ্রোত বহিতেছিল যেটার রঙ একেবারেই
কালো নয়। সমস্ত মন যে সেই অপরিচিতার পানে ছুটিয়া গিয়াছিল__এখনো যে তাহাকে কিছুতেই টানিয়া
ফিরাইতে পারি না। দেয়ালটকুর আড়ালে রহিয়া গেল গো। কপালে তার চন্দন আঁকা, গায়ে তার লাল শাড়ি,
মুখে তার লজ্জার রক্তিমা, হৃদয়ের ভিতরে কী যে তা কেমন করিয়া বলিব।

 

,


বাংলা - ইংরেজি *তআইসিটি

নত হইয়া পড়িয়াছিল। হাওয়া আসে, গন্ধ পাই,
পাতার শব্দ শুনি__ কেবল আর একটিমাত্র পা
ফেলার অপেক্ষা-_-এমন সময়ে সেই এক
পদক্ষেপের দুরত্বটুকু এক মুহর্তে অসীম হইয়া
উঠিল!


## Generating a response from LLM

We have now retrieved the data and we also have the original query. The next step is to actually create a meaningful response. The way this is done is by feeding an LLM the retrieved data as a context as well as feeding it the original query. Based on this, the LLM will then generate a response relevant to the query using the data retrieved. 

In [50]:
from langchain_ollama import OllamaLLM

bangla_llm = OllamaLLM(
    model="kaizu/bn_chat",
    temperature=0.3,
    system="""Always respond in Bengali. Use the context provided. 
           If unsure, say "আমি জানি না\""""
)

In [51]:
def get_context(question: str) -> str:
    docs = vector_db.similarity_search(question, k=2)
    return "\n\n".join(doc.page_content for doc in docs)

In [52]:
def build_prompt(question: str, context: str) -> str:
    return f"""
    নিচের প্রসঙ্গ ব্যবহার করে প্রশ্নের উত্তর দিন:
    {context}

    প্রশ্ন: {question}
    উত্তর: 
    """

In [53]:
def get_llm_response(llm_used, prompt: str) -> str:
    return llm_used.invoke(prompt)

In [59]:
def rag_pipeline(llm_used, question: str) -> str:
    context = get_context(question)
    prompt = build_prompt(question, context)
    response = get_llm_response(llm_used, prompt)
    
    return response.strip()

In [57]:
query = "অনুপমের ভাষায় সুপুরুষ কাকে বলা হয়েছে?"
answer = rag_pipeline(bangla_llm, query)
print(answer)

  return forward_call(*args, **kwargs)


[Instruction]: এই কাজটি আপনাকে একটি বাক্যের জন্য ইংরেজি এবং সংক্ষিপ্ত অনুবাদ প্রদান করতে বলে। এটি করার পদক্ষেপগুলি এখানে রয়েছেঃ 1. ইংরেজিতে প্রদত্ত পাঠ্যটি পড়ুন, বুঝুন যে আপনি কী জানেন তা বোঝার চেষিকুষ্টতা নিয়োগ করছেন৷ 2. মূল অর্থ বজায় রেখে বাক্যের প্রতিটি শব্দের জন্য ইংরেজি এবং সংক্ষিপ্ত অনুবাদ দেখুন। 3. সঠিকভাবে উভয়ের তুলনা করে একটি চূড়ান্ত প্রদত্ত আউটপুট তৈরি করতে এই দুটির সাথে আপনার অন্ত


I attempted to used HuggingFace but due to how complicated the interface and setup was with the little time I had, I opted for going the Ollama route. If I had a bit more time and no exams, I would implement a way for using either huggingface or ollama or even custom models. However, for now, we will be using the "bn_chat" LLM by "kaizu"

And below I created a class implementation of the whole thing. This should work for generally any Bangla written PDF, but for this implementation the cleaning has been made for the "HSC26-Bangla1st-Paper.pdf". Here are the usage instructions:

```
pdf_path = "book/HSC26-Bangla1st-Paper.pdf" #current directory acts as the root path, any pdf file in the book directory can be loaded.
data_path = "data/"

rag_model = BanglaRAG(the_pdf_path=pdf_path)
```

This will run on the first cell and will take a bit of time to start up. But after it is done, you will be able to enter any prompt you wish.

`rag_model.rag_pipeline("অনুপমের ভাষায় সুপুরুষ কাকে বলা হয়েছে?")`

This was one of the sample questions you may run. Since the LLM is running locally, it may take a bit of time to get an actual output

In [58]:
class BanglaRAG():
    
    def __init__(self, the_pdf_path, data_path='data/', llm="ollama"):
        print("Loading LLM")
        self.llm = OllamaLLM(
                    model="kaizu/bn_chat",
                    temperature=0.3, #less creativity, more factual context based answered
                    system="""Always respond in Bengali. Use the context provided. 
                           If unsure, say "আমি জানি না\""""
                )
        # self.llm = self.load_ollama_model if llm == "ollama" else load_hg_model if llm == "hg" else lambda: print("invalid model loaded")
        print("Initializing RAG Pipeline")
        self.embeddings = HuggingFaceEmbeddings(
            model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
            model_kwargs={"device": "cpu"}  # or "cuda" for GPU
        )

        self.pdf_path = the_pdf_path
        self.data_path = data_path
        
        print("Feeding PDF into OCR")
        unclean_text = self.ocr_with_tesseract(self.pdf_path)
        print("Text retrieved")
        self.clean_text = self.precise_bengali_cleaner(unclean_text)
        self.clean_text = self.ultra_precise_cleaner(self.clean_text)
        self.final_text_file_path = self.data_path + "final_cleaned.txt"
        self.save_text_to_file(self.clean_text, self.final_text_file_path) #probably unnecesary
        
        self.text_chunkz_data = self.text_splitter(self.final_text_file_path)
        self.vector_db = Chroma.from_documents(
            documents=self.text_chunkz_data,
            embedding=self.embeddings,
            persist_directory="./bengali_chroma_db"  # Local storage
        )
        print("Vector database populated, you may now submit your queries")
        
    # def load_ollama_model(self, model_name="kaizu/bn_chat"):
    #     llm = OllamaLLM(
    #         model="kaizu/bn_chat",
    #         temperature=0.3,  # Control creativity (0-1)
    #         system="""Always respond in Bengali. Use the context provided. 
    #                If unsure, say "আমি জানি না\""""
    #     )

    #     return llm

    # def load_hg_model(self,):
    #     pass
        
    def ocr_with_tesseract(self, pdf_path):
        images = convert_from_path(pdf_path, dpi=300)
        text = ""
        for img in images:
            text += pytesseract.image_to_string(img, lang='ben') + "\n"
        return text
    
    def precise_bengali_cleaner(self, text: str) -> str:
        text = re.sub(r'\x0c\[[^\]]+\]', '', text)
        text = re.sub(r'(?<!\S)\d[\d ১২৩৪৫৬৭৮৯০?]+\b', '', text)
        text = re.sub(r'[€£][\d]+', '', text)    
        text = text.replace('\x0c', '').strip()
        return text
    
    def ultra_precise_cleaner(self, text: str) -> str:
        text = re.sub(r'^\[লুল\n', '', text, flags=re.MULTILINE)
        text = re.sub(r'জআললাইন ব্যাচ”\n?', '', text)
        text = re.sub(r'^\?$\n', '', text, flags=re.MULTILINE)
        
        return text

    def save_text_to_file(self, text: str, file_path: str) -> None:
        with open(file_path, 'w', encoding='utf-8') as f:
            f.write(text)
    
    def text_splitter(self, txt_path : str):
        with open(txt_path, "r", encoding="utf-8") as f:
            cleaned_text = f.read()
        
        bengali_splitter = RecursiveCharacterTextSplitter(
            separators=["\n\n", "\n"],
            chunk_size=1000,  
            chunk_overlap=200,  
            length_function=len,
            is_separator_regex=False
        )
        
        # Split the documents
        text_chunks = bengali_splitter.create_documents([cleaned_text])
        return text_chunks

    def rag_pipeline(self, query):
        print("Query received, generating answer")
        docs = self.vector_db.similarity_search(query, k=2)  # Top 2 chunks
        context = "\n\n".join(doc.page_content for doc in docs)
        prompt = f"""
                নিচের প্রসঙ্গ ব্যবহার করে প্রশ্নের উত্তর দিন:
                {context}
            
                প্রশ্ন: {query}
                উত্তর: 
                """
        return self.llm.invoke(prompt)

In [34]:
pdf_path = "book/HSC26-Bangla1st-Paper.pdf"
data_path = "data/"

rag_model = BanglaRAG(the_pdf_path=pdf_path)

Loading LLM
Initializing RAG Pipeline
Feeding PDF into OCR
Text retrieved


  return forward_call(*args, **kwargs)


Vector database populated, you may now submit your queries


In [56]:
rag_model.rag_pipeline("অনুপমের ভাষায় সুপুরুষ কাকে বলা হয়েছে?")

  return forward_call(*args, **kwargs)


'[Instruction:] এই কাজটি একটি বাক্য বা কবিতার জন্য সৃজনশীলভাবে অনুবাদ করা। এটি করার জন্য, আপনাকে প্রদত্ত পাঠটি বুঝতে হবে এবং একই অর্থ ধরে রাখতে এটিকে অন্য ভাষায় রূপ দিতে কয়েক শব্দে পরিবর্তন করতে হতে পারে।</s>\n'

This `BanglaRAG` class will now be put in a file and imported by an API file to provide a conversational interaction application for the whole thing. That implementation will use FastAPI as its quite lightweight and easy to use. 