#Flowchart of RAG based Question Answering System


            +-------------------------+
            |                         |
            |    Start of Program     |
            |                         |
            +------------+------------+
                         |
                         v
          +----------------------------------+
          |                                  |
          |       Step 1: Data for RAG       |
          |     (Web Scraping, PDF, etc.)    |
          +------------+---------------------+
                       |
                       v
          +----------------------------------+
          |                                  |
          |     Step 2: Chunking of Data     |
          |     (Sentence Tokenization)      |
          +------------+---------------------+
                       |
                       v
          +----------------------------------+
          |                                  |
          |      Step 3: Making Embeddings   |
          |      of Chunks (Sentence Vectors)|
          +------------+---------------------+
                       |
                       v
          +----------------------------------+
          |                                  |
          |     Step 4: Importing Large      |
          |     Language Model (LLM)         |
          |       and Passing Embeddings     |
          |        through LLM               |
          +------------+---------------------+
                       |
                       v
          +----------------------------------+
          |                                  |
          |      Step 5: Generating Output   |
          |      (Text Generation)           |
          +------------+---------------------+
                       |
                       v
            +---------------------------+
            |                           |
            |      End of Program       |
            |                           |
            +---------------------------+


# Step 1: DATA for RAG
We can collect data through web scrapping or we can scrape data from PDFs.


#(i)Web Scraping

In [None]:
# Install the 'requests' library which allows you to send HTTP requests in Python
!pip install requests

# Install the 'beautifulsoup4' library, a popular tool for web scraping to extract data from HTML and XML files
!pip install beautifulsoup4




In [None]:
import requests
from bs4 import BeautifulSoup

def scrape_webpage(url):
    # Send a GET request to the specified URL
    response = requests.get(url)

    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the HTML content of the page using BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find all paragraph elements (<p>) on the page
        paragraphs = soup.find_all('p')

        # Extract text from each paragraph and join them with newlines
        text_content = '\n'.join([paragraph.get_text() for paragraph in paragraphs])

        # Return the concatenated text content
        return text_content
    else:
        # Print an error message if the request failed and return None
        print("Failed to retrieve the webpage. Status code:", response.status_code)
        return None

# URL of the webpage you want to scrape
webpage_url = 'https://en.wikipedia.org/wiki/Hyderabad'

# Call the function to scrape the webpage and store the result in content1
content1 = scrape_webpage(webpage_url)


In [None]:
# Print the scraped content
print(content1)





Hyderabad (/ˈhaɪdərəbæd/ ⓘ HY-dər-ə-bad;[5] Telugu: [ˈɦaɪ̯daraːbaːd], Urdu: [ˈɦɛːdəɾaːbaːd]) is the capital and largest city of the Indian state of Telangana. It occupies 650 km2 (250 sq mi) on the Deccan Plateau along the banks of the Musi River, in the northern part of Southern India. With an average altitude of 542 m (1,778 ft), much of Hyderabad is situated on hilly terrain around artificial lakes, including the Hussain Sagar lake, predating the city's founding, in the north of the city centre. According to the 2011 census of India, Hyderabad is the fourth-most populous city in India with a population of 6.9 million residents within the city limits, and has a population of 9.7 million residents in the metropolitan region, making it the sixth-most populous metropolitan area in India. With an output of US$74 billion, Hyderabad has the fifth-largest urban economy in India.

The Qutb Shahi dynasty's Muhammad Quli Qutb Shah established Hyderabad in 1591 to extend the capital beyond 

#(ii) Data from PDF


In [None]:
# Install the 'PyPDF2' library, which allows you to work with PDF files in Python,
# such as reading and manipulating PDF documents.
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [7]:
import os
import textwrap
import PyPDF2

def extract_text_from_pdf(pdf_path, start_page=4):
    """
    Extracts text from a PDF starting from a specific page.

    Args:
    pdf_path (str): The path to the PDF file.
    start_page (int): The page number to start extracting text from (default is 4).

    Returns:
    str: The extracted text from the PDF.
    """
    text = ""
    # Open the PDF file in binary read mode
    with open(pdf_path, "rb") as file:
        # Initialize the PDF reader
        reader = PyPDF2.PdfReader(file)
        # Loop through the pages starting from the specified start page
        for page_num in range(start_page - 1, len(reader.pages)):
            page = reader.pages[page_num]
            # Extract text from the current page
            page_text = page.extract_text()
            # Remove any extra newlines and spaces
            page_text = ' '.join(page_text.split())
            # Add the cleaned text to the cumulative text string
            text += page_text + '\n'
    return text

def align_text(text):
    """
    Aligns the text to a specified width using text wrapping.

    Args:
    text (str): The text to be aligned.

    Returns:
    str: The aligned text.
    """
    wrapped_lines = []
    # Split the text into lines and wrap each line to the specified width
    for line in text.split('\n'):
        wrapped_lines.extend(textwrap.wrap(line, width=80))  # Adjust the width as needed
    return '\n'.join(wrapped_lines)

def process_pdfs(pdf_paths):
    """
    Processes a list of PDF files, extracting and aligning text from each.

    Args:
    pdf_paths (list): List of paths to the PDF files.

    Returns:
    str: The processed text from all PDFs.
    """
    all_text = ""
    # Loop through the list of PDF file paths
    for pdf_path in pdf_paths:
        # Check if the PDF file exists
        if os.path.exists(pdf_path):
            # Extract text from the PDF
            extracted_text = extract_text_from_pdf(pdf_path)
            # Align the extracted text
            text = align_text(extracted_text)
            # Add the aligned text to the cumulative text string
            all_text += f"{text}\n\n"
        else:
            # Add an error message if the PDF file is not found
            all_text += f"PDF file not found: {pdf_path}\n\n"
    return all_text

# Example usage:
pdf_paths = ["/content/A thousand splendid suns- Khalid Hussaini.pdf"]  # Replace with paths to your PDF files
content2 = process_pdfs(pdf_paths)

# Print the processed content
print(content2)

PDF file not found: /content/A thousand splendid suns- Khalid Hussaini.pdf




#Step 2: Chunking of Data

In [8]:
# Install the 'nltk' library, a leading platform for building Python programs to work with human language data
# (natural language processing). The '--user' flag installs it for the current user,
# and '-U' ensures that the latest version is installed.
!pip install --user -U nltk



In [9]:
import nltk

# Download the 'punkt' package, which includes pre-trained models for tokenizing text into sentences.
nltk.download('punkt')

# Tokenize the text content from content1 into sentences.
chunks = nltk.tokenize.sent_tokenize(content1)

# Uncomment the following line to tokenize text content from content2 instead.
# chunks = nltk.tokenize.sent_tokenize(content2)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [10]:
# Print the list of sentences obtained from tokenizing content1
print(chunks)

['\n\n\n\nHyderabad (/ˈhaɪdərəbæd/ ⓘ HY-dər-ə-bad;[5] Telugu: [ˈɦaɪ̯daraːbaːd], Urdu: [ˈɦɛːdəɾaːbaːd]) is the capital and largest city of the Indian state of Telangana.', 'It occupies 650\xa0km2 (250\xa0sq\xa0mi) on the Deccan Plateau along the banks of the Musi River, in the northern part of Southern India.', "With an average altitude of 542\xa0m (1,778\xa0ft), much of Hyderabad is situated on hilly terrain around artificial lakes, including the Hussain Sagar lake, predating the city's founding, in the north of the city centre.", 'According to the 2011 census of India, Hyderabad is the fourth-most populous city in India with a population of 6.9 million residents within the city limits, and has a population of 9.7 million residents in the metropolitan region, making it the sixth-most populous metropolitan area in India.', 'With an output of US$74\xa0billion, Hyderabad has the fifth-largest urban economy in India.', "The Qutb Shahi dynasty's Muhammad Quli Qutb Shah established Hyderabad

#Step 3: Making Embeddings of Chunks


In [11]:
# Install the 'sentence-transformers' library, which provides tools for computing dense vector representations
# of sentences and paragraphs, useful for tasks like semantic search, clustering, and classification.
!pip install sentence_transformers

Collecting sentence_transformers
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/171.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m163.8/171.5 kB[0m [31m5.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (

In [12]:
from sentence_transformers import SentenceTransformer

# Initialize a SentenceTransformer model named 'paraphrase-MiniLM-L6-v2'
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [13]:
import numpy as np

def cosine_similarity(a, b):
    """
    Compute the cosine similarity between two matrices of vectors.

    Args:
    a (numpy.ndarray): The first matrix of vectors.
    b (numpy.ndarray): The second matrix of vectors.

    Returns:
    numpy.ndarray: The cosine similarity matrix, where each element (i, j) represents
    the cosine similarity between the ith vector in 'a' and the jth vector in 'b'.
    """
    # Compute the dot product of the matrices and divide by the product of their norms
    return np.dot(a, b.T) / (np.linalg.norm(a, axis=1)[:, np.newaxis] * np.linalg.norm(b, axis=1))


In [14]:
# Encode the list of sentences using the SentenceTransformer model
embedded_data = model.encode(chunks)

# Print the embedded data
print(embedded_data)


[[ 0.3165233  -0.44545913 -0.02858389 ...  0.5207064  -0.18698347
   0.36302036]
 [ 0.2758075  -0.11801807 -0.20838661 ...  0.53775847 -0.762132
  -0.42110175]
 [ 0.4953546  -0.2165369  -0.06604673 ...  0.24094497 -0.5430151
  -0.12842937]
 ...
 [ 0.28417465  0.04141764 -0.2596645  ...  0.18713404  0.14727083
  -0.14572033]
 [ 0.45763636 -0.12355874 -0.19020283 ...  0.17763714 -0.03678588
  -0.21886373]
 [ 0.33628622  0.31005308 -0.09915163 ... -0.34929287  0.12496953
   0.5176378 ]]


#Step 4: Importing Large Language Model and passing these embeddings through LLM.

In [15]:
import google.generativeai as gen_ai

# Configure the generative AI with the provided API key
gen_ai.configure(api_key="AIzaSyCESbQuT7rszB3HsvJHbCr7to6x5FcUgo4")

# Set up the model with generation and safety settings
generation_config = {
    "temperature": 0.9,
    "top_p": 1,
    "top_k": 1,
    "max_output_tokens": 2084,
}
safety_settings = [
    {
        "category": "HARM_CATEGORY_HATE_SPEECH",
        "threshold": "BLOCK_MEDIUM_AND_ABOVE"
    },
    {
        "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
        "threshold": "BLOCK_MEDIUM_AND_ABOVE"
    },
    {
        "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
        "threshold": "BLOCK_MEDIUM_AND_ABOVE"
    }
]

# Initialize the GenerativeModel with the specified parameters
Gemini = gen_ai.GenerativeModel(model_name="gemini-1.0-pro",
                                generation_config=generation_config, safety_settings=safety_settings)


#Step 4: Generating Output

In [None]:
def run_rag():
    # Take input from the user
    ip = input()  # ip is input
    queries = [ip]

    # Encode the input query
    embedded_queries = model.encode(queries)

    # Loop through each query vector
    for i, query_vec in enumerate(embedded_queries):
        # Compute similarities between the query vector and embedded data
        similarities = cosine_similarity(query_vec[np.newaxis, :], embedded_data)

        # Get indices of top 4 most similar sentences
        top_indices = np.argsort(similarities[0])[::-1][:4]
        top_doct = [chunks[index] for index in top_indices]

        # Construct the prompt with the top context and the input query
        argumented_prompt = f'''You are an expert information system about Hyderabad. I'll give you a question and context, and you'll return the answer. Your name is LAILA. Query: {queries[i]}. Context: {top_doct[0]}'''

        # Generate content using the Gemini model with the constructed prompt
        model_output = Gemini.generate_content(contents=argumented_prompt)

        # Print the generated text
        ab = print("\n" + model_output.text)  # ab is output
    return ab

# Continuously run the RAG model for user queries
while True:
    run_rag()



**LAILA:** Hyderabad is the capital and largest city of the Indian state of Telangana. It is also the seventh-most populous city in India and the fourth-most populous city in South India. The city has a population of about 6.9 million people, while the metropolitan area is home to about 12 million people. Hyderabad is located on the banks of the Musi River and is surrounded by hills and lakes. The city has a rich history and was once the capital of the Hyderabad State. Today, it is a major center for commerce, industry, and education.

I'm sorry, but the provided context does not contain any information about the airports in Hyderabad. Please provide me with more context to help me answer your query.

Sure, here is a list of some of the popular markets in Hyderabad:

* Begum Bazaar: This market is known for its wide variety of products, including textiles, jewelry, electronics, and spices.
* Charminar: This market is located near the famous Charminar monument and is known for its bang