## Install Packages

In [3]:
!pip install PyPDF2 transformers nltk
!pip install sentence-transformers
!pip install faiss-gpu
!pip install gensim
!pip install huggingface_hub
!pip install rank_bm25
!pip install accelerate
!pip install einops
!pip install bitsandbytes
!pip install peft
!pip install nltk rank_bm25
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## HuggingFace Login 🤗

In [4]:
from huggingface_hub import login
login('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')

## Import Packages

In [5]:
import os
import PyPDF2
import torch
import re
import nltk
import gensim
import numpy as np
from gensim.models import Word2Vec
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from transformers import AutoTokenizer, AutoModel, BertTokenizer, BertModel, AutoModelForSeq2SeqLM, AutoModelForQuestionAnswering, AutoModelForCausalLM
from sentence_transformers import SentenceTransformer
import faiss
import gensim.downloader as api
import string
from rank_bm25 import BM25Okapi

## GPU Configuration

In [6]:
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_flash_sdp(False)

# Training RAG For USA Visa Inquiry

## Data Preparation

Reads all PDF files from the policy_directory folder and extracts their text into a text variable.


*   Loops through all files in policy_directory.
*   For files ending with .pdf, it reads their pages using PyPDF2.PdfReader and appends the text to the documents list.



In [7]:
#Load Data
documents_directory = '/content/'
documents = []
text = ""
for file in os.listdir(documents_directory):
    if file.endswith('.pdf'):
        pdf_path = documents_directory + file
        pdf_reader = PyPDF2.PdfReader(pdf_path)
        for page_num in range(len(pdf_reader.pages)):
           page = pdf_reader.pages[page_num]
           text += page.extract_text()
        documents.append(text)

Splits the extracted text into smaller chunks for processing.

*   Each chunk is chunk_size characters long.
*   Adjacent chunks overlap by chunk_overlap characters to retain context.



In [8]:
#Creating Chunks
chunk_size = 1000
chunk_overlap = 50
chunks = [text[i:i + chunk_size] for i in range(0, len(text) - chunk_size + 1, chunk_size - chunk_overlap)]

Generates sentence embeddings for each chunk using the SentenceTransformer model.

*   all-MiniLM-L6-v2: A lightweight pre-trained model for generating sentence embeddings.
*   Converts each chunk into a dense vector representation called embedding.



In [9]:
# Load a pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Create embeddings for the chunks
embeddings = [model.encode(chunk) for chunk in chunks]
# Define the dimension of the embeddings
dimension = embeddings[0].shape[0]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Creates a FAISS index to store embeddings and perform similarity searches.

*   IndexFlatL2: A type of FAISS index optimized for L2 distance (Euclidean distance).
*   Adds the embeddings to the index for retrieval.



In [10]:
# Create a FAISS index
index = faiss.IndexFlatL2(dimension)
# Add the embeddings to the index
index.add(np.array(embeddings).astype('float32'))

## Searching and Retrieving Results

Searches for the top 10 most similar chunks to a given query

*   Encodes the query using the same all-MiniLM-L6-v2 model.
*   Searches for the closest matches in the FAISS index and retrieves corresponding chunks.



In [11]:
query = "Applicant’s Interview with a Consular Officer"
embedded_query = model.encode(query)
distances, indices = index.search(np.array([embedded_query]).astype('float32'), 10)
retrieved_examples = [chunks[i] for i in indices[0]]
retrieved_examples

['ill not be allowed during the interview, nor can anyone accompany an applicant to \nhis or her personal interview. \n \nDuring the interview the consular o fficer will question the applicant about his or her planned activity in the \nU.S.  It is at this initial stage that clear and concise information about the purpose of the travel abroad \nshould be explained and any supporting documentation submitted for review by the consular officer. \n \nFor example, delegates to an ISO or IEC technical co mmittee or subcommittee meeting being hosted in the \nU.S. might wish to bring invitation letters, as well as copies of official meeting documents, such as calling \nnotices, meeting logistical information (venue, hotel info rmation, etc.) and draft agen das to the interview.   \n[NOTE:  See Section 4 of this document for additional information .] \n \nVerbal answers must match the information in the invitation letter or any other supporting documentation \nthat had been presented to the cons

## Cosine Similarity Function

In [12]:
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

## Implementing MMR (Maximal Marginal Relevance)

Improves diversity in retrieved documents using MMR.

*   Balances relevance to the query and novelty among retrieved results.
*   Iteratively selects the chunk with the highest MMR score.



In [13]:
def retrieve_mmr(query_embedding, embeddings, k=5, lambda_=0.5):
    scores = np.dot(embeddings, query_embedding)
    selected_indices = []
    for _ in range(k):
        max_mmr_score = float('-inf')
        max_mmr_index = None
        for i in range(len(scores)):
            if i in selected_indices:
                continue
            similarity = cosine_similarity(embeddings[i], query_embedding)
            max_sim = max([cosine_similarity(embeddings[i], embeddings[j]) for j in selected_indices if j != i]) if selected_indices else 0
            mmr_score = lambda_ * similarity - (1 - lambda_) * max_sim
            if mmr_score > max_mmr_score:
                max_mmr_score = mmr_score
                max_mmr_index = i
        selected_indices.append(max_mmr_index)
    return [(chunks[i], scores[i]) for i in selected_indices]

## BM25 Reranking

Uses BM25 to rerank retrieved chunks based on the query.

*   BM25 is an information retrieval algorithm that scores documents based on term frequency and document length.
*   Ranks and returns the top k passages.



In [14]:
def bm25_rerank(query, passages, k):
    tokenized_corpus = [word_tokenize(doc) for doc in passages]
    bm25 = BM25Okapi(tokenized_corpus)
    tokenized_query = word_tokenize(query)
    doc_scores = bm25.get_scores(tokenized_query)
    passage_scores = list(zip(passages, doc_scores))
    top_k_passages = [
        passage.replace('\n', ' ').replace('�', ' ').replace('\n', ' ')
        for passage, score in passage_scores[:5]
    ]
    return top_k_passages

In [15]:
print("\nQuery Results Sorted BM25 ReRanking:\n")
sorted_examples = bm25_rerank(query, retrieved_examples, 5)
sorted_examples


Query Results Sorted BM25 ReRanking:



['ill not be allowed during the interview, nor can anyone accompany an applicant to  his or her personal interview.    During the interview the consular o fficer will question the applicant about his or her planned activity in the  U.S.  It is at this initial stage that clear and concise information about the purpose of the travel abroad  should be explained and any supporting documentation submitted for review by the consular officer.    For example, delegates to an ISO or IEC technical co mmittee or subcommittee meeting being hosted in the  U.S. might wish to bring invitation letters, as well as copies of official meeting documents, such as calling  notices, meeting logistical information (venue, hotel info rmation, etc.) and draft agen das to the interview.    [NOTE:  See Section 4 of this document for additional information .]    Verbal answers must match the information in the invitation letter or any other supporting documentation  that had been presented to the consular officer.

## Integrating Language Model (Meta-LLaMA)

In [16]:
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model1 = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]



In [28]:
model_size = model1.get_memory_footprint()
model_size = model_size / (1024*1024*1024)
# Print model size
print(f"Model size: {model_size:.2f} GB")

Model size: 14.96 GB


## Guidelines For RAG Model

In [17]:
system_prompt = """You are an expert human resource assistant that answers questions on the given queries only from the given context.
You are given the context from the US Visa System and you are required to extract out the details required from the query. provide your answers from those visa information only.
If you don't know the answer, just say "Unable To retrieve information about the particular context which you asked."""

init_reply = "Please provide the question and extracted text for which I need to produce structured answers."

## Similarity Searching Mechanism

*   This function retrieves the top k most relevant documents for a given query using the FAISS index and includes support for Maximal Marginal Relevance (MMR).
*   Similar to search_mmr, this function retrieves relevant chunks for a query but then uses BM25 to further rerank the results.

In [18]:
def search_mmr(query: str, k: int = 2):
    """Searching MMR..."""
    embedded_query = model.encode(query)
    distances, indices = index.search(np.array([embedded_query]).astype('float32'), k)
    retrieved_examples = [chunks[i] for i in indices[0]]
    retrieved_documents = [{'text': example} for example in retrieved_examples]
    scores = [distance for distance in distances[0]]
    return scores, retrieved_documents

def search_bm25(query: str, k: int = 10):
    """Searching Applying BM25..."""
    embedded_query = model.encode(query)
    distances, indices = index.search(np.array([embedded_query]).astype('float32'), k)
    retrieved_examples = [chunks[i] for i in indices[0]]
    sorted_examples = bm25_rerank(query, retrieved_examples, 5)
    return sorted_examples

Prepares the input for a conversational language model (such as Meta-LLaMA or GPT-like models) by creating a structured prompt.

In [19]:
def create_prompt_template_mistral_phi3(system_prompt, init_reply, prompt):
  return [
          {"role": "user", "content": system_prompt},
          {"role": "assistant", "content": init_reply},
          {"role": "user", "content": prompt}
      ]

## Formatting Prompts

 Prepares the prompt by appending retrieved documents for context.

*   Creates a formatted string combining the user's query with the retrieved document snippets.



In [20]:
def format_prompt_mmr(prompt, retrieved_documents, k):
    """\nFormatting MMR Prompt...\n"""
    formatted_prompt = prompt + "\n\nRetrieved Documents:\n"
    for i, doc in enumerate(retrieved_documents):
        if i >= k:
            break
        formatted_prompt += f"{i+1}. {doc['text']}\n"
    return formatted_prompt

def format_prompt_bm25(prompt, retrieved_documents, k):
    """\nFormating BM25 Prompt...\n"""
    formatted_prompt = prompt + "\nRetrieved Documents:\n"
    for i, doc in enumerate(retrieved_documents):
        if i >= k:
            break
        if isinstance(doc, dict):
            formatted_prompt += f"{i+1}. {doc['text']}\n"
        else:
            formatted_prompt += f"{i+1}. {doc}\n"
    return formatted_prompt

## Generating Response

Generates a response using a pre-trained generative model (like Meta-LLaMA).

*   The messages (formatted prompt) are tokenized using the tokenizer to prepare the model input.
*   The inputs are transferred to the GPU (to('cuda')) for faster computation.
*   The generative model (model1) generates a response based on the input.
*   The generated tokens are decoded back into human-readable text using





In [21]:
# Define function to generate response
def generate(messages):
    with torch.no_grad():
        encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
        model_inputs = encodeds.to('cuda')
        generated_ids = model1.generate(model_inputs, max_new_tokens=300, do_sample=True)
        decoded = tokenizer.batch_decode(generated_ids)
        print(decoded[0])

## RAG Chatbot Implementation

MMR-based Retrieval Chatbot

*   Implements a RAG chatbot using MMR for document retrieval.
*   Retrieves top k documents using MMR.
*   Formats the documents and generates a response using the language model.

In [23]:
# Define RAG chatbot function
def rag_chatbot_mmr(prompt: str, k: int = 5):
    """RAG-based Question Answering Chatbot Using MMR..."""
    scores, retrieved_documents = search_mmr(prompt, k)
    formatted_prompt = format_prompt_mmr(prompt, retrieved_documents, k)
    messages = create_prompt_template_mistral_phi3(system_prompt, init_reply, formatted_prompt)
    return generate(messages)


BM25-based Retrieval Chatbot

*   Implements a RAG chatbot using BM25 for document retrieval.
*   Retrieves top k documents using BM25 reranking.
*   Formats the documents and generates a response.

In [24]:
def rag_chatbot_bm25(prompt: str, k: int = 5):
    """RAG-based Question Answering Chatbot Using BM25..."""
    retrieved_documents = search_bm25(prompt, k)
    formatted_prompt = format_prompt_bm25(prompt, retrieved_documents, k)
    messages = create_prompt_template_mistral_phi3(system_prompt, init_reply,formatted_prompt )
    return generate(messages)

# Testing RAG Model For USA Visa Inquiry

## Q1 What are US Visa types for business?

In [25]:
response = rag_chatbot_mmr("What are US Visa types for business?")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>

You are an expert human resource assistant that answers questions on the given queries only from the given context.
You are given the context from the US Visa System and you are required to extract out the details required from the query. provide your answers from those visa information only.
If you don't know the answer, just say "Unable To retrieve information about the particular context which you asked.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Please provide the question and extracted text for which I need to produce structured answers.<|eot_id|><|start_header_id|>user<|end_header_id|>

What are US Visa types for business?

Retrieved Documents:
1. information:  Ameri can Embassy in Beijing  
http://www.usembassy-china.org.cn/visa Page 5 
 2 U.S. Visa Types for Business Travel 
 
The United States issues two types of visas: Immigrant and Nonimmigrant.   
 
Within the nonimmigrant classification, multiple types

In [26]:
response1 = rag_chatbot_bm25("What are US Visa types for business?")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>

You are an expert human resource assistant that answers questions on the given queries only from the given context.
You are given the context from the US Visa System and you are required to extract out the details required from the query. provide your answers from those visa information only.
If you don't know the answer, just say "Unable To retrieve information about the particular context which you asked.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Please provide the question and extracted text for which I need to produce structured answers.<|eot_id|><|start_header_id|>user<|end_header_id|>

What are US Visa types for business?
Retrieved Documents:
1. information:  Ameri can Embassy in Beijing   http://www.usembassy-china.org.cn/visa Page 5   2 U.S. Visa Types for Business Travel    The United States issues two types of visas: Immigrant and Nonimmigrant.      Within the nonimmigrant classification, multiple types 

## Q2 What is Formal Facilitation Programs?

In [27]:
response2 = rag_chatbot_mmr("What is Formal Facilitation Programs?")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>

You are an expert human resource assistant that answers questions on the given queries only from the given context.
You are given the context from the US Visa System and you are required to extract out the details required from the query. provide your answers from those visa information only.
If you don't know the answer, just say "Unable To retrieve information about the particular context which you asked.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Please provide the question and extracted text for which I need to produce structured answers.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is Formal Facilitation Programs?

Retrieved Documents:
1. common and effective facilitation initiatives in place worldwide: 
 
Formal Facilitation Programs 
Many posts have established formal facilitation programs that enroll major companies and permit their 
employees to obtain expedited appointments, and/or expedited pr

In [None]:
response3 = rag_chatbot_bm25("What is Formal Facilitation Programs?")

## Q3 Can you give Visa Application Procedure Summary?



In [None]:
response4 = rag_chatbot_mmr("Can you give Visa Application Procedure Summary?")

In [None]:
response5 = rag_chatbot_bm25("Can you give Visa Application Procedure Summary?")