# <center>**Rag Pipeline**</center>


### Importing Necessary Libraries

In this cell, we import all the required libraries for the task. The libraries include:

- `sentence_transformers`: A library for generating sentence embeddings (numerical representations) of text, which is used for comparing and retrieving chunks of text.
- `faiss`: A library for efficient similarity search and clustering of dense vectors, which helps in finding the most relevant chunks of text based on a query.


In [1]:

import openai  
import my_secrets
import os
import subprocess
from sentence_transformers import SentenceTransformer  
import faiss  
import numpy as np  
import chunks


### Initializing Sentence Embeddings, FAISS Index, and Setting Up Directories

In this cell, we perform several key initializations for embedding and indexing text chunks from a PDF document:

1. **Loading Sentence Transformer Model**:

   - We use the `SentenceTransformer` model `all-MiniLM-L6-v2` to generate sentence embeddings. This model is pre-trained to create dense vector representations (embeddings) of text, which can be used for tasks like similarity search and clustering.

2. **Setting Embedding Dimension**:

   - The `dimension` is set to 384, which corresponds to the number of features in the embeddings generated by the `all-MiniLM-L6-v2` model. This is the fixed size of the vector representations.

3. **Creating FAISS Index**:
   - FAISS (Facebook AI Similarity Search) is used to efficiently search through the embeddings. We initialize a `IndexFlatL2` index, which will store the embeddings in a way that allows us to quickly compute similarity searches (in this case, using L2 distance).


In [3]:
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
dimension = 384
index = faiss.IndexFlatL2(dimension)
embeddings = []
image_dir = "extracted_images"
os.makedirs(image_dir, exist_ok=True)
MyChunks = chunks.Mychunks


### Embedding and Indexing Chunks

1. **Embedding Text from Chunks**:

   - If the embeddings are not yet created, the function loops through each chunk in the `MyChunks` collection.
   - For each chunk, it extracts the `text` field and passes it to the `embedding_model.encode()` method, which generates an embedding (a vector representation) of the text. This embedding is then appended to the `embeddings` list.

2. **Indexing the Embeddings**:
   - Each embedding is added to the FAISS index using the `index.add()` method. Since FAISS requires the embeddings to be in the form of `float32` arrays, the code converts the embedding to the appropriate format using `np.array([embedding]).astype("float32")`.
     .


In [4]:
def embed_and_index_chunks():
    global embeddings
    if embeddings:
        print("Chunks already embedded and indexed.")
        return
    for chunk in MyChunks:
        text = chunk['text']
        embedding = embedding_model.encode(text)
        embeddings.append(embedding)
        index.add(np.array([embedding]).astype("float32"))


### Retrieving Relevant Chunks Based on Query

1. **Generating Query Embedding**:

   - The input `query` (a string) is passed to the `embedding_model.encode()` method to generate an embedding (numerical representation) for the query.

2. **Searching for Closest Chunks**:

   - The generated query embedding is then passed to the FAISS index using the `index.search()` method. This performs a search to find the nearest vectors (embeddings) to the query embedding.
   - The `top_k` parameter determines how many closest results (chunks) will be retrieved. By default, it retrieves the top 10 most relevant chunks.

3. **Returning Relevant Chunks**:
   - The `indices` returned by FAISS are used to select the corresponding chunks from the `MyChunks` list. The function returns these top `k` chunks that are most similar to the query.


In [5]:
def retrieve_chunks(query, top_k=10):
    query_embedding = embedding_model.encode(query)
    distances, indices = index.search(
        np.array([query_embedding]).astype("float32"), top_k)
    return [MyChunks[i] for i in indices[0]]


### Generating Answer Based on Retrieved Chunks

In this cell, the function `generate_answer(query, retrieved_chunks)` is responsible for generating a structured prompt that can be used to answer a user's question based on the content of the retrieved chunks. Here's how it works:

1. **Creating the Context**:

   - The function begins by creating a context string that consists of the relevant chunks retrieved from the previous step. Each chunk is formatted to display the page number and the corresponding text.
   - The chunks are joined together with two newlines (`\n\n`) for better readability.

2. **Answer Generation Prompt**:
   - The function constructs a final prompt that includes:
     - The context made up of relevant chunks.
     - The user's query, which is passed as the input `query` parameter.
   - This prompt is designed to instruct a language model (e.g., OpenAI) to generate an answer based on the provided context.

The generated prompt is returned as a structured string, which will be used in subsequent steps to call a language model (like OpenAI's GPT) to generate an answer based on the content of the textbook.


In [6]:
def generate_answer(query, retrieved_chunks):
    context = "\n\n".join(
        [f"Page {chunk['page']}: {chunk['text']}" for chunk in retrieved_chunks])
    return f"Answer the question based on the following textbook content:\n\n{context}\n\nQuestion: {query}\nAnswer:"


### Running the RAG Pipeline

In this cell, we define the function `run_rag_pipeline(pdf_path, query)` to execute the full process of retrieving relevant content from the PDF and generating an answer to a user's query. The function performs the following steps:

1. **Check for Available Chunks**:

   - First, the function checks if there are any chunks in `MyChunks`. If there are no chunks, it returns the message `"No data available."`. This ensures that the pipeline doesn't run unless there is data to process.

2. **Embedding and Indexing**:

   - If chunks are available, it calls `embed_and_index_chunks()` to generate embeddings for each chunk and index them using FAISS. This prepares the chunks for efficient retrieval.

3. **Retrieve Relevant Chunks**:

   - It then calls `retrieve_chunks(query)` to fetch the most relevant chunks based on the user's query. The function uses the embeddings and FAISS indexing to perform this retrieval efficiently.

4. **Generate Answer**:
   - Finally, it calls `generate_answer(query, retrieved_chunks)` to generate a detailed response to the user's query, using the relevant chunks as context.

The final output of the function is the generated answer based on the queried content from the PDF.


In [7]:
def run_rag_pipeline(pdf_path, query):
    if not MyChunks:
        return "No data available."
    embed_and_index_chunks()
    retrieved_chunks = retrieve_chunks(query)
    return generate_answer(query, retrieved_chunks)


##### Loading the OpenAI API Key


In [8]:
def load_openai_key():
    return my_secrets.OPEN_AI_SECRET_KEY


In [9]:
def call_openai_chat(prompt, model="gpt-3.5-turbo"):
    openai.api_key = load_openai_key()
    try:
        response = openai.ChatCompletion.create(
            model=model,
            messages=[{"role": "system", "content": "You are an expert in LaTeX and Beamer presentations."},
                      {"role": "user", "content": prompt}],
            max_tokens=4000,
            temperature=0.7
        )
        return response['choices'][0]['message']['content'].strip()
    except Exception as e:
        return f"An error occurred: {e}"


In [10]:
def save_to_file(filename, content):
    with open(filename, 'w') as f:
        f.write(content)


`Compiling LaTeX Code into a PDF`


In [11]:
def compile_latex_to_pdf(tex_file):
    try:
        subprocess.run(["pdflatex", tex_file], check=True)
        pdf_file = tex_file.replace('.tex', '.pdf')
        return pdf_file if os.path.exists(pdf_file) else None
    except subprocess.CalledProcessError as e:
        print(f"Error compiling LaTeX: {e}")
        return None


`Extracting Valid LaTeX Code from Raw Response`


In [12]:
def extract_latex_code(raw_response):
    lines = raw_response.split("\n")
    in_code_block = False
    cleaned_lines = []
    for line in lines:
        if line.strip().startswith("```latex"):
            in_code_block = True
            continue
        if line.strip() == "```":
            in_code_block = False
            continue
        if in_code_block:
            cleaned_lines.append(line)
    return "\n".join(cleaned_lines)


### Main Execution: Get User's Prompt, Process the Document, Generate LaTeX Beamer Code, and Compile into PDF

This cell runs the complete process starting from getting the user's query, retrieving relevant chunks of text, generating LaTeX Beamer code, and compiling it into a PDF.

- **Step 1: Get the user's query**

- **Step 2: Retrieve relevant chunks from the textbook**

- **Step 3: Generate LaTeX Beamer code using OpenAI**

- **Step 4: Extract valid LaTeX code**

- **Step 5: Save LaTeX code to a .tex file**

- **Step 6: Compile the .tex file into a PDF**


In [13]:
user_prompt = input("Enter your prompt: ")

pdf_path = "Physics 9.pdf"
answer = run_rag_pipeline(pdf_path, user_prompt)

print(answer)


# beamer_prompt = (
#     f"Create a detailed LaTeX Beamer presentation on the following topic:\n\n"
#     f"{answer}\n\n"
#     "Include equations, bullet points, and TikZ-based diagrams to illustrate concepts. Use only TikZ to draw shapes or vectors "
#     "instead of relying on external image files. Ensure the output is ready-to-compile Beamer code."
# )
# raw_beamer_code = call_openai_chat(beamer_prompt)

# beamer_code = extract_latex_code(raw_beamer_code)

# tex_filename = "response.tex"
# save_to_file(tex_filename, beamer_code)

# pdf_filename = compile_latex_to_pdf(tex_filename)
# if pdf_filename:
#     print(f"PDF generated: {pdf_filename}")
# else:
#     print("Failed to generate PDF.")


Answer the question based on the following textbook content:

Page 58: moving then difficult to stop. Newton concluded that everybody resists to the change in its state of rest or of uniform motion in a straight line. He called this property of matter as inertia. He related the inertia of a body with its mass; greater is the mass of a body greater is its inertia. Inertia of a body is its property due to which it resists any change in its state of rest or motion. Let us perform an experiment to understand inertia. EXPERIMENT 3.1 Take a glass and cover it with a piece of cardboard. Place a coin

Page 45: Unit 2: Kinematics Physics IX

Page 191: Physics IX 191

Page 51: 51 Unit 2: Kinematics Physics IX SUMMARY A body is said to be at rest, if it does not change its position with respect to its surroundings. A body is said to be in motion, if it changes its position with respect to its surroundings. Rest and motion are always relative. There is no such thing as absolute rest or absolute mo