### Namaste! This notebook demonstrates question answering on Hindi data using Indic LLM - Airavata, a multilingual embedding model and Chroma db

Lets get started by collecting dataset, if you already have the dataset, parsed and prepared, you can skip through this part. We will be taking 5 URLs related to income tax, the url comprises of faq as well as unstructured text. The topics discussed include various sections for deduction on tax, various faqs related to ITR1 for indiviudals and various forms required.

## Crawling URLs

In [42]:
urls =['https://www.incometax.gov.in/iec/foportal/hi/help/e-filing-itr1-form-sahaj-faq',
        'https://www.incometax.gov.in/iec/foportal/hi/help/e-filing-itr4-form-sugam-faq',
       'https://navbharattimes.indiatimes.com/business/budget/budget-classroom/income-tax-sections-know-which-section-can-save-how-much-tax-here-is-all-about-income-tax-law-to-understand-budget-speech/articleshow/89141099.cms',
       'https://www.incometax.gov.in/iec/foportal/hi/help/individual/return-applicable-1',
       'https://www.zeebiz.com/hindi/personal-finance/income-tax/tax-deductions-under-section-80g-income-tax-exemption-limit-how-to-save-tax-on-donation-money-to-charitable-trusts-126529'
]


I will be using one of my favorite libraries to crawl website - [Markdown Crawler](https://github.com/paulpierre/markdown-crawler). You can install it using the command mentioned below. It parses the website into markdown format and stores them in markdown files. ANd one interesting thing, although we will be crawling only the urls, but it also has the capability to parse linked urls to website(look at the depth paramaneter).
For now lets continue with what we were doing!

In [43]:
!pip install markdown-crawler
!pip install markdownify



In [44]:
from markdown_crawler import md_crawl
def crawl_urls(urls: list, storage_folder_path: str, max_depth=0):
    """Crawl a list of URLs and store results.
    Parameters:
    - urls: URLs to crawl.
    - storage_folder_path: Location for results; folder is auto-created.
    - max_depth: Link depth to crawl; 0 (recommended) means only the listed URLs.

    """
    for url in urls:
        print(f"Crawling {url}")
        md_crawl(
            url,
            max_depth=max_depth,
            base_dir=storage_folder_path,
            is_links=True
        )

In [45]:
crawl_urls(urls= urls, storage_folder_path = './incometax_documents/')
#you do not need to make a folder intitially. Md Crawler handles that for you.

Crawling https://www.incometax.gov.in/iec/foportal/hi/help/e-filing-itr1-form-sahaj-faq
Crawling https://www.incometax.gov.in/iec/foportal/hi/help/e-filing-itr4-form-sugam-faq
Crawling https://navbharattimes.indiatimes.com/business/budget/budget-classroom/income-tax-sections-know-which-section-can-save-how-much-tax-here-is-all-about-income-tax-law-to-understand-budget-speech/articleshow/89141099.cms
Crawling https://www.incometax.gov.in/iec/foportal/hi/help/individual/return-applicable-1
Crawling https://www.zeebiz.com/hindi/personal-finance/income-tax/tax-deductions-under-section-80g-income-tax-exemption-limit-how-to-save-tax-on-donation-money-to-charitable-trusts-126529


Once the urls have been crawled and data has been stored in markdown files, its time to parse the content in those files. But before that some notes on parsing:


It is important to parse, because there is still some noise left in the markdown files, and the embedding model that we are gonna use and llm has token limit, so we have to not exceed that.

Before we move on to the code, some parameters that I need to introduce. The embedding model that we are gonna use has a token limit of 512 tokens. It truncates after that. We will look more about this embedding model and why I chose this in the coming section but for now only thing we have to know is that the limit is 512 tokens so we will try to keep section less than 512 tokens.

## Parsing and Chunking Documents

Lets first write a function to extract content out of a file. We will be use python library markdown and beautifulsoup for it. Below are commands to install them

### Parsing

In [46]:
!pip install beautifulsoup4
!pip install markdown



In [49]:
# lets first write a function to extract content out of a file
import markdown
from bs4 import BeautifulSoup

def read_markdown_file(file_path):
    """Read a Markdown file and extract its sections as headers and content."""
    # Open the markdown file and read its content
    with open(file_path, 'r', encoding='utf-8') as file:
        md_content = file.read()

    # Convert markdown to HTML
    html_content = markdown.markdown(md_content)

    # Parse HTML content
    soup = BeautifulSoup(html_content, 'html.parser')

    sections = []
    current_section = None

    # Loop through HTML tags
    for tag in soup:
        # Start a new section if a header tag is found
        if tag.name and tag.name.startswith('h'):
            if current_section:
                sections.append(current_section)
            current_section = {'header': tag.text, 'content': ''}

        # Add content to the current section
        elif current_section:
            current_section['content'] += tag.get_text() + '\n'

    # Add the last section
    if current_section:
        sections.append(current_section)

    return sections

In [50]:
#lets look at the output of one of the files:

sections = read_markdown_file('./incometax_documents/business-budget-budget-classroom-income-tax-sections-know-which-section-can-save-how-much-tax-here-is-all-about-income-tax-law-to-understand-budget-speech-articleshow-89141099-cms.md')

Oh yes! the content looks cleaner now, but the problem is sometimes the section are good,and sometiems they are note, some sections with especially with empty headers are unnecessary. So lets write a funcito to parse a section, we will pass  a particular section only if the header and content both are non empty and header does not belong to any of these - ['main navigation','navigation', 'footer'].

In [51]:
def pass_section(section):
    # List of headers to ignore
    headers_to_ignore = ['main navigation', 'navigation', 'footer', 'advertisement']

    # Check if the header is not in the ignore list and both header and content are non-empty
    if section['header'].lower() not in headers_to_ignore and section['header'].strip() and section['content'].strip():
        return True
    return False

In [111]:
passed_sections = []
import os
# Iterate through all Markdown files in the folder
for filename in os.listdir('incometax_documents'):
    if filename.endswith('.md'):
        file_path = os.path.join('incometax_documents', filename)
        # Extract sections from the current Markdown file
        all_sections = read_markdown_file(file_path)
        # Filter sections based on the pass_section function
        passed_sections.extend(section for section in all_sections if pass_section(section))

print(len(passed_sections))

23


### Chunking

**Embedding Model** <br>
The embedding model that we are gonna use is [multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base).  As mentioned on its huggingface page it supports 100 languages, although low-resource languages may see performance degradation But i have observed it performs fairly decent for Hindi. If you want better accuracy you can use try [BGE M3](https://huggingface.co/BAAI/bge-m3) as well but that is pretty resource intesive. Also OpenAI embeddings might perform well here, but lets stick to everything opensource for now. Hence a light weight but decent model. E5 it is.

Before diving into the function, let's discuss chunking. Every Retrieval-Augmented Generation (RAG) system uses two key models: an 'Embedding Model' for generating embeddings (used for retrieval) and a 'Language Model' for generating answers. Both models have token limits, so it's essential to split unstructured content accordingly.

<h3>Embedding Model Token Limit</h3>
The embedding model we’re using, multilingual-e5-base, has a token limit of 512. While the language model we’ll use has a larger token limit, we’ll focus on the embedding model for now.



<h3>Chunking Function</h3>
<p>The <code>chunk_text</code> function splits a section of text into smaller, manageable chunks while ensuring that no chunk exceeds the specified token limit (512 tokens by default). It splits the text by single newlines to create logical subsections, then further divides them into chunks, maintaining overlap to preserve context between chunks. This approach ensures that even large sections of text are processed correctly without losing important information.</p>

In [119]:
def chunk_text(section, max_tokens=512, overlap=50):
    header = section['header']
    text = section['content']

    # Split the content into smaller parts based on single newlines or other logical separators
    sections = text.split("\n\n")  # Split by single newlines first

    all_chunks = []
    current_chunk = ""

    # Process each line (or smaller logical part) within the section
    for line in sections:
        # Tokenize by splitting by spaces to count tokens (words)
        tokens = line.split()  # This treats each word as a token

        # If adding this line to the current chunk doesn't exceed the max limit
        if len(current_chunk.split()) + len(tokens) <= max_tokens:
            # Add the line to the current chunk
            current_chunk += " " + line
        else:
            # If the chunk exceeds the limit, create a new chunk with overlap
            if len(current_chunk.split()) > max_tokens - overlap:
                # Add the chunk with the overlap
                all_chunks.append({'header': header, 'content': current_chunk.strip()})
                current_chunk = " ".join(tokens[-overlap:])  # Start new chunk with overlap
            else:
                # Add the current chunk and reset it for the next line
                all_chunks.append({'header': header, 'content': current_chunk.strip()})
                current_chunk = line  # Reset chunk to the current line

    # Add the last chunk if it exists
    if current_chunk.strip():
        all_chunks.append({'header': header, 'content': current_chunk.strip()})

    return all_chunks


In [120]:
passed_sections = [chunk for section in passed_sections for chunk in chunk_text(section)]

In [121]:
len(passed_sections)

43

## Setting up Vector Store and Ingesting documents

I chose Chroma DB as I could use it in Google Collab without any hosting and it's good for experimentation. But you could also use vector stores of your choice. Here's how you install it.

In [122]:
# !pip install chromadb

This way to initiate Chroma DB creates an in-memory instance of Chroma. This is useful for testing and development, but not recommended for production use. For production you should host it, Please refer to its [documentation](https://docs.trychroma.com/deployment) for details.

In [123]:
import chromadb
chroma_client = chromadb.Client()

Chroma DB offers built-in support for open-source sentence transformers. Here's how to use it:

In [124]:
from chromadb.utils import embedding_functions
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="intfloat/multilingual-e5-base")

Let's create the collection! <br>
**Note -** We use `metadata={"hnsw:space": "cosine"}` because ChromaDB's default distance is Euclidean, but cosine distance is typically preferred for RAG purposes.

In [128]:
collection = chroma_client.create_collection(name="income_tax_hindi", embedding_function= sentence_transformer_ef, metadata={"hnsw:space": "cosine"})

In [129]:
#command for deletion in case you need to recreate it
# chroma_client.delete_collection(name="income_tax_hindi")

Ingesting the documents

In [130]:
collection.add(
    documents=[section['content'] for section in passed_sections],
    metadatas = [{'header': section['header']} for section in passed_sections],
    ids=[str(i) for i in range(len(passed_sections))]
)

In [131]:
#querying the results
docs = collection.query(
    query_texts=["सेक्शन 80 C की लिमिट क्या होती है"],
    n_results=3
)

In [132]:
docs

{'ids': [['31', '33', '27']],
 'embeddings': None,
 'documents': [['सेक्शन 80डी के अलावा इनकम टैक्स कानून में दो और सेक्शन हैं, जिनका आप स्वास्थ्य से जुड़े खर्च का लाभ उठा सकते हैं। सेक्शन 80डीडी आप पर आश्रित किसी विकलांग व्यक्ति के लिए चिकित्सा खर्च से संबंधित है। आश्रित में जीवनसाथी, बच्चे, पैरेंट्स, भाई या बहन हो सकते हैं। इनकम टैक्स में छूट इस बात पर निर्भर करता है कि आपके आश्रित की विकलांगता कितनी गंभीर है। अगर आश्रित 40 फीसदी तक विकलांग है तो टैक्स बचत के लिए 75,000 रुपये तक का मेडिकल खर्च कवर किया जा सकता है। अगर आश्रित 80 फीसदी तक विकलांग है तो टैक्स बचत के लिए 1,25,000 रुपये तक का मेडिकल खर्च कवर किया जा सकता है।',
   'अगर आप खुद 40 फीसदी से अधिक विकलांग हैं तो आप इस सेक्शन के तहत Income Tax छूट पा सकते हैं। हालांकि, सेक्शन 80यू और सेक्शन 80डीडी का लाभ एक साथ नहीं उठाया जा सकता। इस सेक्शन में भी टैक्स छूट का लाभ सेक्शन 80डीडी की तरह ही होता है। फर्क सिर्फ इतना है कि यह सेक्शन खुद की विकलांगता से जुड़ा है, जबकि सेक्शन 80डी आश्रितों से जुड़ा है।',
   'सामाजिक सुरक्षा योजनाओं या न

## Loading the geneartion Model Airavata

As mentioned, we will be using Airavta, and since it is open-source we will be using transformers and quantization techniques to load the model. You can check more about ways to load open-source LLMs here and here. A T4 GPU environment is needed in collab to run this.

In [133]:
# !pip install bitsandbytes>=0.39.0
# !pip install --upgrade accelerate transformers

In [134]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

cuda


Below functions are used for loading and using Airavata, I used them from it's [official huggingface page](https://huggingface.co/ai4bharat/Airavata
)

In [136]:
def create_prompt_with_chat_format(messages, bos="<s>", eos="</s>", add_bos=True):
    formatted_text = ""
    for message in messages:
        if message["role"] == "system":
            formatted_text += "<|system|>\n" + message["content"] + "\n"
        elif message["role"] == "user":
            formatted_text += "<|user|>\n" + message["content"] + "\n"
        elif message["role"] == "assistant":
            formatted_text += "<|assistant|>\n" + message["content"].strip() + eos + "\n"
        else:
            raise ValueError(
                "Tulu chat template only supports 'system', 'user' and 'assistant' roles. Invalid role: {}.".format(
                    message["role"]
                )
            )
    formatted_text += "<|assistant|>\n"
    formatted_text = bos + formatted_text if add_bos else formatted_text
    return formatted_text


def inference(input_prompts, model, tokenizer):
    input_prompts = [
        create_prompt_with_chat_format([{"role": "user", "content": input_prompt}], add_bos=False)
        for input_prompt in input_prompts
    ]

    encodings = tokenizer(input_prompts, padding=True, return_tensors="pt")
    encodings = encodings.to(device)

    with torch.inference_mode():
        outputs = model.generate(encodings.input_ids, do_sample=False, max_new_tokens=1024)

    output_texts = tokenizer.batch_decode(outputs.detach(), skip_special_tokens=True)

    input_prompts = [
        tokenizer.decode(tokenizer.encode(input_prompt), skip_special_tokens=True) for input_prompt in input_prompts
    ]
    output_texts = [output_text[len(input_prompt) :] for input_prompt, output_text in zip(input_prompts, output_texts)]
    return output_texts

In [137]:
model_name = "ai4bharat/Airavata"
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(model_name,  quantization_config=quantization_config, torch_dtype=torch.bfloat16)

`low_cpu_mem_usage` was None, now default to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Now the interesting part: prompt to generate the answer. Here, we create a prompt that instructs the language model to generate answers based on specific guidelines. The instructions are simple: first, the model reads and understands the question, then reviews the context provided. It uses this information to craft a clear, concise, and accurate response. If you look at it carefully, this is the Hindi version of the typical RAG prompt.

In [138]:
prompt ='''आप एक बड़े भाषा मॉडल हैं जो दिए गए संदर्भ के आधार पर सवालों का उत्तर देते हैं। नीचे दिए गए निर्देशों का पालन करें:

1. **प्रश्न पढ़ें**:
    - दिए गए सवाल को ध्यान से पढ़ें और समझें।

2. **संदर्भ पढ़ें**:
    - नीचे दिए गए संदर्भ को ध्यानपूर्वक पढ़ें और समझें।

3. **सूचना उत्पन्न करना**:
    - संदर्भ का उपयोग करते हुए, प्रश्न का विस्तृत और स्पष्ट उत्तर तैयार करें।
    - यह सुनिश्चित करें कि उत्तर सीधा, समझने में आसान और तथ्यों पर आधारित हो।

### उदाहरण:

**संदर्भ**:
    "नई दिल्ली भारत की राजधानी है और यह देश का प्रमुख राजनीतिक और प्रशासनिक केंद्र है। यह शहर ऐतिहासिक स्मारकों, संग्रहालयों और विविध संस्कृति के लिए जाना जाता है।"

**प्रश्न**:
    "भारत की राजधानी क्या है और यह क्यों महत्वपूर्ण है?"

**प्रत्याशित उत्तर**:
    "भारत की राजधानी नई दिल्ली है। यह देश का प्रमुख राजनीतिक और प्रशासनिक केंद्र है और ऐतिहासिक स्मारकों, संग्रहालयों और विविध संस्कृति के लिए जाना जाता है।"

### निर्देश:

अब, दिए गए संदर्भ और प्रश्न का उपयोग करके उत्तर दें:

**संदर्भ**:
{docs}

**प्रश्न**:
{query}

उत्तर:

'''

Putting it all together

In [139]:
def generate_answer(query):
  docs =  collection.query(
    query_texts=[query],
    n_results=3
)
  docs = [doc for doc in docs['documents'][0]]
  docs = "\n".join(docs)
  formatted_prompt = prompt.format(docs = docs,query = query)
  answers = inference([formatted_prompt], model, tokenizer)
  return answers[0]

In [140]:
questions = [
    'सेक्शन 80डीडी के तहत विकलांग आश्रित के लिए कौन से मेडिकल खर्च पर टैक्स छूट मिल सकती है?',
    'क्या सेक्शन 80यू और सेक्शन 80डीडी का लाभ एक साथ उठाया जा सकता है?',
    'सेक्शन 80 C की लिमिट क्या होती है?'
]

for question in questions:
    answer = generate_answer(question)
    print(f"Question: {question}\nAnswer: {answer}\n")



Question: सेक्शन 80डीडी के तहत विकलांग आश्रित के लिए कौन से मेडिकल खर्च पर टैक्स छूट मिल सकती है?
Answer: आश्रित व्यक्ति के लिए टैक्स छूट के लिए पात्र होने के लिए, उन्हें 40% से अधिक विकलांग होना चाहिए। यदि वे 40% से अधिक विकलांग हैं, तो वे 40,000 रुपये तक के चिकित्सा खर्च पर कर छूट प्राप्त कर सकते हैं। यदि वे 60% से अधिक विकलांग हैं, तो वे 60,000 रुपये तक के चिकित्सा खर्च पर कर छूट प्राप्त कर सकते हैं। यदि वे 80% से अधिक विकलांग हैं, तो वे 80,000 रुपये तक के चिकित्सा खर्च पर कर छूट प्राप्त कर सकते हैं।

Question: क्या सेक्शन 80यू और सेक्शन 80डीडी का लाभ एक साथ उठाया जा सकता है?
Answer: नहीं।

Question: सेक्शन 80 C की लिमिट क्या होती है?
Answer: सेक्शन 80 सी की सीमा 1.5 लाख रुपये है।

