<a href="https://colab.research.google.com/github/advik-7/Deep_Learning_projects/blob/main/Basic_RAG_in_Kannada.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install faiss-gpu

Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.4 kB)
Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2


In [6]:
import faiss
import numpy as np
import time
from sentence_transformers import SentenceTransformer

def read_text_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.readlines()

def vectorize_text(text_data, model):
    return model.encode(text_data, convert_to_numpy=True)

def create_faiss_index(vectors):
    index = faiss.IndexFlatL2(vectors.shape[1])
    index.add(vectors)
    return index

def adjust_query_vector(query_vector, required_dim):
    current_dim = query_vector.shape[1]
    if current_dim == required_dim:
        return query_vector
    elif current_dim < required_dim:
        padding = np.zeros((query_vector.shape[0], required_dim - current_dim), dtype=np.float32)
        return np.hstack((query_vector, padding))
    else:
        return query_vector[:, :required_dim]

def query_faiss_index(index, query_vector, k):
    query_vector = np.array(query_vector, dtype=np.float32)
    if query_vector.ndim == 1:
        query_vector = query_vector.reshape(1, -1)
    distances, indices = index.search(query_vector, k)
    return distances, indices

def retrieve_documents_batch(index, query_vector, k, text_data):
    distances, indices = query_faiss_index(index, query_vector, k)
    batch = [(text_data[idx].strip(), dist) for idx, dist in zip(indices[0], distances[0])]
    return batch

def generate_augmented_output(query, retrieved_docs_batch):
    combined_documents = "\n".join([f"Document: '{doc}' (Distance: {distance:.4f})" for doc, distance in retrieved_docs_batch])
    response_content = " ".join([doc for doc, _ in retrieved_docs_batch])
    augmented_response = f"Based on the retrieved documents, the information provided suggests: {response_content}"
    output = (
        f"Query: '{query}'\n"
        f"Combined Retrieved Documents:\n{combined_documents}\n"
        f"Augmented Response: '{augmented_response}'\n"
    )
    yield output

if __name__ == "__main__":

    model = SentenceTransformer('sentence-transformers/LaBSE')
    file_path = "/content/Kannada_RAG_practise.txt"
    text_data = read_text_file(file_path)
    vectors = vectorize_text(text_data, model)
    faiss_index = create_faiss_index(vectors)
    query_text = input("Enter a query text in Kannada: ")
    query_vector = vectorize_text([query_text], model)
    required_dim = vectors.shape[1]
    query_vector_adjusted = adjust_query_vector(query_vector, required_dim)
    k = 3
    start_time = time.time()
    retrieved_docs_batch = retrieve_documents_batch(faiss_index, query_vector_adjusted, k, text_data)
    print("\nGenerated augmented response:")
    for augmented_output in generate_augmented_output(query_text, retrieved_docs_batch):
        print(augmented_output)
    end_time = time.time()
    print(f"\nTime taken for retrieval and generation: {end_time - start_time:.4f} seconds")


modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/2.22k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/804 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.88G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/397 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/5.22M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.62M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

2_Dense/config.json:   0%|          | 0.00/114 [00:00<?, ?B/s]

Enter a query text in Kannada:  ಸಂಗೀತ

Generated augmented response:
Query: ' ಸಂಗೀತ'
Combined Retrieved Documents:
Document: 'ಸಂಗೀತವು ಭಾವನೆಗಳನ್ನು ಮೂಡಿಸುತ್ತದೆ.' (Distance: 0.8708)
Document: 'ಸಂಗೀತ ಮನಸ್ಸಿಗೆ ಶಾಂತಿಯನ್ನು ನೀಡುತ್ತದೆ.' (Distance: 0.9279)
Document: 'ಸಂಗೀತವು ಭಾವನೆಗಳಿಗೆ ಜೀವ ಕೊಡುತ್ತದೆ.' (Distance: 0.9672)
Augmented Response: 'Based on the retrieved documents, the information provided suggests: ಸಂಗೀತವು ಭಾವನೆಗಳನ್ನು ಮೂಡಿಸುತ್ತದೆ. ಸಂಗೀತ ಮನಸ್ಸಿಗೆ ಶಾಂತಿಯನ್ನು ನೀಡುತ್ತದೆ. ಸಂಗೀತವು ಭಾವನೆಗಳಿಗೆ ಜೀವ ಕೊಡುತ್ತದೆ.'


Time taken for retrieval and generation: 0.0004 seconds
