# **AlManara RAG (Retrieval-Augmented Generation)**

 **This notebook demonstrates how to create a Retrieval-Augmented Generation (RAG) system
by integrating our data . It uses Sentence Transformers for embeddings,
FAISS for indexing, and T5 for text generation.**

In [7]:
!pip install transformers sentence-transformers faiss-cpu PyPDF2 dash plotly flask-ngrok



***Step 1: Import Libraries and Define Utility Functions***

In [8]:
# Import necessary libraries
import json
import os
from PyPDF2 import PdfReader
from sentence_transformers import SentenceTransformer
import faiss
from transformers import T5Tokenizer, T5ForConditionalGeneration

In [9]:
# Function to clean text
def clean_text(text):
    """
    Cleans the text by removing newline characters and extra spaces.

    Args:
        text (str): The raw text to be cleaned.
    Returns:
        str: Cleaned text.
    """
    text = text.replace("\n", " ").replace("\r", " ")
    text = " ".join(text.split())
    return text

***Step 2: Load Data from JSON File***

In [13]:
# Path to the JSON file containing prompts and responses
json_path = '/content/Docs Folder/scraping.json'

# Load prompts and responses from the JSON file
with open("/content/Docs Folder/Scraping.json", 'r') as file:
    prompts_and_responses = json.load(file)

# Extract responses and clean them to build the corpus
corpus = [clean_text(response) for response in prompts_and_responses.values()]

print(f"Loaded {len(corpus)} responses from JSON file.")

Loaded 4 responses from JSON file.


***Step 3: Extract Text from PDF Files***

In [14]:
# Function to extract text from PDF files
def extract_text_from_pdfs(pdf_folder):
    """
    Extracts text from all PDF files in a given folder.

    Args:
        pdf_folder (str): Path to the folder containing PDF files.
    Returns:
        list: A list of cleaned text extracted from each PDF.
    """
    documents = []
    for file_name in os.listdir(pdf_folder):
        if file_name.endswith('.pdf'):
            file_path = os.path.join(pdf_folder, file_name)
            reader = PdfReader(file_path)
            text = ""
            for page in reader.pages:
                text += page.extract_text()
            documents.append(clean_text(text))
    return documents

In [15]:
# Path to the folder containing PDF files
pdf_folder = '/content/Docs Folder'

# Extract text from PDFs and add to the corpus
pdf_texts = extract_text_from_pdfs(pdf_folder)
corpus.extend(pdf_texts)

print(f"Extracted text from {len(pdf_texts)} PDF files.")

Extracted text from 4 PDF files.


 ***Step 4: Create Embeddings and Build FAISS Index
python***




In [16]:
# Initialize SentenceTransformer model for embedding generation
embedder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Generate embeddings for the corpus
embeddings = embedder.encode(corpus, convert_to_tensor=False)

# Initialize and populate FAISS index
dimension = embeddings[0].shape[0]  # Dimension of embeddings
index = faiss.IndexFlatL2(dimension)  # Using L2 norm for similarity
index.add(embeddings)  # Add embeddings to the FAISS index

print("FAISS index created successfully.")

FAISS index created successfully.


***Step 5: Define a Search Function***

In [17]:
# Function to search the FAISS index
def search(query, top_k=5):
    """
    Searches the FAISS index for the most relevant documents to the query.

    Args:
        query (str): The search query.
        top_k (int): Number of top documents to retrieve.
    Returns:
        list: A list of the most relevant documents.
    """
    query_embedding = embedder.encode([query], convert_to_tensor=False)
    distances, indices = index.search(query_embedding, top_k)
    results = [corpus[i] for i in indices[0]]
    return results

***Step 6: Load and Prepare T5 Model for Text Generation***

In [18]:
# Load T5 tokenizer and model for text generation
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')

# Function to generate a response using T5
def generate_response(context):
    """
    Generates a response using T5 summarization capabilities.

    Args:
        context (str): The context text to summarize.
    Returns:
        str: The generated response.
    """
    input_text = f"summarize: {context}"
    input_ids = tokenizer.encode(input_text, return_tensors='pt', max_length=512, truncation=True)
    output_ids = model.generate(input_ids, max_length=150, num_beams=2, length_penalty=1.2, early_stopping=True)
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)