This notebook demonstrates a pipeline for automatically generating LaTeX Beamer presentations based on a user's query. The process involves extracting relevant content from a PDF, embedding the text chunks, retrieving the most relevant sections using a search model, and generating LaTeX code with OpenAI's GPT model. Finally, the LaTeX code is compiled into a PDF presentation. This approach integrates document processing, natural language understanding, and LaTeX formatting to produce dynamic educational content.


### Importing Necessary Libraries

In this cell, we import all the required libraries for the task. The libraries include:

- `openai`: Used to interact with OpenAI's API to generate responses for LaTeX Beamer code based on the provided query.
- `my_secrets`: A custom module where sensitive information, like API keys, is stored securely.
- `os`: Provides functions for interacting with the operating system, such as creating directories and running system commands.
- `subprocess`: Used to run external system commands, in this case, to compile LaTeX code into a PDF.
- `sentence_transformers`: A library for generating sentence embeddings (numerical representations) of text, which is used for comparing and retrieving chunks of text.
- `faiss`: A library for efficient similarity search and clustering of dense vectors, which helps in finding the most relevant chunks of text based on a query.
- `numpy`: A library used for handling arrays, which is essential for working with the embeddings.
- `chunks`: A custom module that presumably contains the `MyChunks` class or object that holds the chunked text from the PDF.

These libraries together provide the tools needed to process and search through large text documents (like PDFs), interact with OpenAI's API, and generate LaTeX Beamer presentations based on user queries.


In [None]:

import openai  # type: ignore
import my_secrets
import os
import subprocess
from sentence_transformers import SentenceTransformer  # type: ignore
import faiss  # type: ignore
import numpy as np  # type: ignore
import chunks


### Initializing Sentence Embeddings, FAISS Index, and Setting Up Directories

In this cell, we perform several key initializations for embedding and indexing text chunks from a PDF document:

1. **Loading Sentence Transformer Model**:

   - We use the `SentenceTransformer` model `all-MiniLM-L6-v2` to generate sentence embeddings. This model is pre-trained to create dense vector representations (embeddings) of text, which can be used for tasks like similarity search and clustering.

2. **Setting Embedding Dimension**:

   - The `dimension` is set to 384, which corresponds to the number of features in the embeddings generated by the `all-MiniLM-L6-v2` model. This is the fixed size of the vector representations.

3. **Creating FAISS Index**:
   - FAISS (Facebook AI Similarity Search) is used to efficiently search through the embeddings. We initialize a `IndexFlatL2` index, which will store the embeddings in a way that allows us to quickly compute similarity searches (in this case, using L2 distance).
4. **Setting Up Embedding Storage**:

   - An empty list `embeddings` is created to store the embeddings generated for each chunk of text. This list will be used later to hold the embeddings of the text chunks from the PDF.

5. **Creating Directory for Extracted Images**:

   - We define a directory called `extracted_images` to store any images that might be extracted from the PDF. The `os.makedirs` function ensures that this directory is created if it doesn't already exist.

6. **Initializing Chunks**:
   - The `MyChunks` object is assumed to be a collection of text chunks from the PDF, which will be used later to generate embeddings and perform similarity searches.

This setup is essential for embedding the text chunks from the PDF, storing them in an index, and making them ready for querying based on the user's input.


In [16]:
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
dimension = 384
index = faiss.IndexFlatL2(dimension)
embeddings = []
image_dir = "extracted_images"
os.makedirs(image_dir, exist_ok=True)
MyChunks = chunks.Mychunks


### Embedding and Indexing Chunks

In this cell, the function `embed_and_index_chunks()` is responsible for embedding the text chunks from the PDF and indexing them for efficient retrieval. Here's a breakdown of what this function does:

1. **Check if Chunks Are Already Embedded**:

   - The function first checks if the `embeddings` list is already populated. If the embeddings are already computed, the function prints `"Chunks already embedded and indexed."` and exits early, avoiding redundant work.

2. **Embedding Text from Chunks**:

   - If the embeddings are not yet created, the function loops through each chunk in the `MyChunks` collection.
   - For each chunk, it extracts the `text` field and passes it to the `embedding_model.encode()` method, which generates an embedding (a vector representation) of the text. This embedding is then appended to the `embeddings` list.

3. **Indexing the Embeddings**:
   - Each embedding is added to the FAISS index using the `index.add()` method. Since FAISS requires the embeddings to be in the form of `float32` arrays, the code converts the embedding to the appropriate format using `np.array([embedding]).astype("float32")`.

The purpose of this function is to convert the text chunks into numerical representations (embeddings), which can then be efficiently searched using FAISS. This process prepares the data for quick retrieval based on a query.


In [17]:
def embed_and_index_chunks():
    global embeddings
    if embeddings:
        print("Chunks already embedded and indexed.")
        return
    for chunk in MyChunks:
        text = chunk['text']
        embedding = embedding_model.encode(text)
        embeddings.append(embedding)
        index.add(np.array([embedding]).astype("float32"))


### Retrieving Relevant Chunks Based on Query

In this cell, the function `retrieve_chunks(query, top_k=10)` is designed to retrieve the most relevant chunks from the indexed data based on a user's query. Here's how it works:

1. **Generating Query Embedding**:

   - The input `query` (a string) is passed to the `embedding_model.encode()` method to generate an embedding (numerical representation) for the query.

2. **Searching for Closest Chunks**:

   - The generated query embedding is then passed to the FAISS index using the `index.search()` method. This performs a search to find the nearest vectors (embeddings) to the query embedding.
   - The `top_k` parameter determines how many closest results (chunks) will be retrieved. By default, it retrieves the top 10 most relevant chunks.

3. **Returning Relevant Chunks**:
   - The `indices` returned by FAISS are used to select the corresponding chunks from the `MyChunks` list. The function returns these top `k` chunks that are most similar to the query.

This function enables the system to retrieve the most relevant text chunks based on the content of a user's query by comparing the query's embedding with the precomputed embeddings in the FAISS index.


In [18]:
def retrieve_chunks(query, top_k=10):
    query_embedding = embedding_model.encode(query)
    distances, indices = index.search(
        np.array([query_embedding]).astype("float32"), top_k)
    return [MyChunks[i] for i in indices[0]]


### Generating Answer Based on Retrieved Chunks

In this cell, the function `generate_answer(query, retrieved_chunks)` is responsible for generating a structured prompt that can be used to answer a user's question based on the content of the retrieved chunks. Here's how it works:

1. **Creating the Context**:

   - The function begins by creating a context string that consists of the relevant chunks retrieved from the previous step. Each chunk is formatted to display the page number and the corresponding text.
   - The chunks are joined together with two newlines (`\n\n`) for better readability.

2. **Answer Generation Prompt**:
   - The function constructs a final prompt that includes:
     - The context made up of relevant chunks.
     - The user's query, which is passed as the input `query` parameter.
   - This prompt is designed to instruct a language model (e.g., OpenAI) to generate an answer based on the provided context.

The generated prompt is returned as a structured string, which will be used in subsequent steps to call a language model (like OpenAI's GPT) to generate an answer based on the content of the textbook.


In [19]:
def generate_answer(query, retrieved_chunks):
    context = "\n\n".join(
        [f"Page {chunk['page']}: {chunk['text']}" for chunk in retrieved_chunks])
    return f"Answer the question based on the following textbook content:\n\n{context}\n\nQuestion: {query}\nAnswer:"


### Running the RAG Pipeline

In this cell, we define the function `run_rag_pipeline(pdf_path, query)` to execute the full process of retrieving relevant content from the PDF and generating an answer to a user's query. The function performs the following steps:

1. **Check for Available Chunks**:

   - First, the function checks if there are any chunks in `MyChunks`. If there are no chunks, it returns the message `"No data available."`. This ensures that the pipeline doesn't run unless there is data to process.

2. **Embedding and Indexing**:

   - If chunks are available, it calls `embed_and_index_chunks()` to generate embeddings for each chunk and index them using FAISS. This prepares the chunks for efficient retrieval.

3. **Retrieve Relevant Chunks**:

   - It then calls `retrieve_chunks(query)` to fetch the most relevant chunks based on the user's query. The function uses the embeddings and FAISS indexing to perform this retrieval efficiently.

4. **Generate Answer**:
   - Finally, it calls `generate_answer(query, retrieved_chunks)` to generate a detailed response to the user's query, using the relevant chunks as context.

The final output of the function is the generated answer based on the queried content from the PDF.


In [20]:
def run_rag_pipeline(pdf_path, query):
    if not MyChunks:
        return "No data available."
    embed_and_index_chunks()
    retrieved_chunks = retrieve_chunks(query)
    return generate_answer(query, retrieved_chunks)


### Loading the OpenAI API Key

In this cell, we define the function `load_openai_key()` to load the API key for interacting with the OpenAI API.

- The function retrieves the API key from the `my_secrets` module, where it is securely stored in the variable `OPEN_AI_SECRET_KEY`. This key is necessary for making requests to the OpenAI API.

By keeping the key in a separate module like `my_secrets`, we can ensure that sensitive information like API keys is not hardcoded into the main code and is more secure.


In [21]:
def load_openai_key():
    return my_secrets.OPEN_AI_SECRET_KEY


### Calling OpenAI's Chat API for LaTeX Beamer Code Generation

In this cell, we define the function `call_openai_chat()`, which interacts with the OpenAI API to generate LaTeX Beamer code based on the provided prompt.

- **Input**: The function takes two parameters:
  - `prompt`: A string containing the user’s prompt that describes what kind of LaTeX Beamer code is required.
  - `model`: The model to use for generating the output, which defaults to "gpt-3.5-turbo".
- **Process**:

  - The function sets the OpenAI API key using the `load_openai_key()` function defined earlier.
  - It then sends the `prompt` to the API, along with a system message indicating the assistant is an expert in LaTeX and Beamer presentations.
  - The response from the API is expected to contain LaTeX Beamer code, which is then returned after stripping any leading/trailing whitespace.

- **Error Handling**: The function includes a try-except block to catch any errors during the API call and return an appropriate error message.

This function is essential for automatically generating LaTeX Beamer code that can later be compiled into a presentation.


In [22]:
def call_openai_chat(prompt, model="gpt-3.5-turbo"):
    openai.api_key = load_openai_key()
    try:
        response = openai.ChatCompletion.create(
            model=model,
            messages=[{"role": "system", "content": "You are an expert in LaTeX and Beamer presentations."},
                      {"role": "user", "content": prompt}],
            max_tokens=4000,
            temperature=0.7
        )
        return response['choices'][0]['message']['content'].strip()
    except Exception as e:
        return f"An error occurred: {e}"


### Saving LaTeX Code to a File

This cell defines the function `save_to_file()`, which is responsible for saving the generated LaTeX code to a `.tex` file.

- **Input**:

  - `filename`: The name of the file where the LaTeX code will be saved.
  - `content`: The LaTeX code (as a string) that needs to be written to the file.

- **Process**:

  - The function opens the specified `filename` in write mode (`'w'`).
  - It then writes the `content` (LaTeX code) into the file. If the file doesn't already exist, it will be created.

- **Purpose**: This function is used to save the LaTeX Beamer code generated by the OpenAI API so that it can later be compiled into a PDF.


In [23]:
def save_to_file(filename, content):
    with open(filename, 'w') as f:
        f.write(content)


### Compiling LaTeX Code into a PDF

This cell defines the function `compile_latex_to_pdf()`, which is responsible for compiling the saved LaTeX `.tex` file into a PDF using `pdflatex`.

- **Input**:

  - `tex_file`: The name of the `.tex` file that needs to be compiled.

- **Process**:

  - The function attempts to run the `pdflatex` command on the provided `.tex` file using the `subprocess.run()` function.
  - If the compilation is successful, it checks whether the resulting `.pdf` file exists and returns the path to the generated PDF.
  - If the compilation fails (e.g., due to syntax errors in the LaTeX code), the function catches the error and prints an error message.

- **Purpose**: This function is used to convert the LaTeX Beamer code into a viewable PDF document, making the presentation ready for display or further processing.


In [24]:
def compile_latex_to_pdf(tex_file):
    try:
        subprocess.run(["pdflatex", tex_file], check=True)
        pdf_file = tex_file.replace('.tex', '.pdf')
        return pdf_file if os.path.exists(pdf_file) else None
    except subprocess.CalledProcessError as e:
        print(f"Error compiling LaTeX: {e}")
        return None


### Extracting Valid LaTeX Code from Raw Response

This cell defines the function `extract_latex_code()`, which processes a raw response from the OpenAI API to extract only the valid LaTeX code, filtering out unnecessary parts such as code block markers.

- **Input**:

  - `raw_response`: The raw text response returned by the OpenAI API. This response may contain code block markers (i.e., ` ```latex ` and ` ``` `), which need to be filtered out.

- **Process**:

  - The function iterates through the lines of the response.
  - It identifies the lines that start with ` ```latex ` to mark the beginning of LaTeX code.
  - It then collects the LaTeX code lines until it reaches the closing code block marker ` ``` `.
  - These valid LaTeX code lines are stored in a list (`cleaned_lines`), and the final LaTeX code is returned as a single string.

- **Purpose**: This function ensures that only the LaTeX code (within the designated code block) is returned, without any extra formatting or metadata.


In [25]:
def extract_latex_code(raw_response):
    lines = raw_response.split("\n")
    in_code_block = False
    cleaned_lines = []
    for line in lines:
        if line.strip().startswith("```latex"):
            in_code_block = True
            continue
        if line.strip() == "```":
            in_code_block = False
            continue
        if in_code_block:
            cleaned_lines.append(line)
    return "\n".join(cleaned_lines)


### Main Execution: Get User's Prompt, Process the Document, Generate LaTeX Beamer Code, and Compile into PDF

This cell runs the complete process starting from getting the user's query, retrieving relevant chunks of text, generating LaTeX Beamer code, and compiling it into a PDF.

- **Step 1: Get the user's query**

  - The script prompts the user for input, asking them to enter their question or query.

- **Step 2: Retrieve relevant chunks from the textbook**

  - The `run_rag_pipeline()` function processes the provided PDF document and retrieves relevant text chunks based on the user's query.

- **Step 3: Generate LaTeX Beamer code using OpenAI**

  - A prompt is formulated to request OpenAI to generate a detailed LaTeX Beamer presentation on the topic, including equations, bullet points, and TikZ-based diagrams.
  - The `call_openai_chat()` function is used to interact with the OpenAI API and generate the LaTeX code.

- **Step 4: Extract valid LaTeX code**

  - The raw response from OpenAI is processed using `extract_latex_code()` to clean and retrieve only the valid LaTeX code.

- **Step 5: Save LaTeX code to a .tex file**

  - The extracted LaTeX code is saved to a `.tex` file using `save_to_file()`.

- **Step 6: Compile the .tex file into a PDF**
  - The `.tex` file is compiled into a PDF using the `compile_latex_to_pdf()` function.
  - If successful, the generated PDF filename is printed; otherwise, a failure message is shown.

This cell encapsulates the entire RAG pipeline and the process of transforming a user's query into a LaTeX Beamer presentation ready for compilation.


In [26]:
# Cell 12: Main execution - Get user's prompt, process the document, generate LaTeX Beamer code, and compile into PDF

# Step 1: Get the user's query
user_prompt = input("Enter your prompt: ")

# Step 2: Retrieve relevant chunks from the textbook
pdf_path = "Physics 9.pdf"
answer = run_rag_pipeline(pdf_path, user_prompt)

# Step 3: Generate LaTeX Beamer code using OpenAI
beamer_prompt = (
    f"Create a detailed LaTeX Beamer presentation on the following topic:\n\n"
    f"{answer}\n\n"
    "Include equations, bullet points, and TikZ-based diagrams to illustrate concepts. Use only TikZ to draw shapes or vectors "
    "instead of relying on external image files. Ensure the output is ready-to-compile Beamer code."
)
raw_beamer_code = call_openai_chat(beamer_prompt)

# Step 4: Extract valid LaTeX code
beamer_code = extract_latex_code(raw_beamer_code)

# Step 5: Save LaTeX code to a .tex file
tex_filename = "response.tex"
save_to_file(tex_filename, beamer_code)

# Step 6: Compile the .tex file into a PDF
pdf_filename = compile_latex_to_pdf(tex_filename)
if pdf_filename:
    print(f"PDF generated: {pdf_filename}")
else:
    print("Failed to generate PDF.")


PDF generated: response.pdf
