# RAG with Vertex AI: A Jupyter Notebook Guide

This notebook demonstrates how to build a simple Retrieval-Augmented Generation (RAG) system using Google Cloud's Vertex AI.
The RAG pipeline involves three main steps:
1. **Data Pre-processing**: Extracting text from a PDF, chunking it into sentences, and generating vector embeddings.
2. **Indexing**: Uploading the embeddings to a Vertex AI Vector Search (formerly Matching Engine) index for fast similarity searches.
3. **Retrieval & Generation**: Given a user query, we retrieve relevant sentences from the index and use a Generative AI model (Gemini Pro) to formulate a grounded response.

---

In [None]:
print("Starting the RAG pipeline setup.")
#
# ## 1. Setup and Initialization
#
# This section installs necessary Python packages, sets up the environment, and defines key variables for the project.
#

# Install required Python packages.
# `pypdf2` is used for reading PDF files.
# `google-cloud-aiplatform` is the Vertex AI SDK for interacting with Google Cloud's machine learning services.
# `google-cloud-storage` is used for uploading files to Google Cloud Storage (GCS).
!pip install pypdf2
!pip install google-cloud-aiplatform
!pip install google-cloud-storage

# Import necessary libraries.
# These libraries are essential for interacting with Google Cloud services and handling data.
from google.cloud import storage
from vertexai.language_models import TextEmbeddingModel
from google.cloud import aiplatform
import PyPDF2

import re
import os
import random
import json
import uuid

# List the files in the current directory. This is useful for confirming
# the presence of the PDF file you intend to use.
%ls

# Initialize some project variables.
# Replace "your_GCP_project_id" with your actual Google Cloud Project ID.
# The `location` is the Google Cloud region where your resources will be created.
# project="your_GCP_project_id"
location="us-central1"

# Define file paths and resource names.
# Ensure the `pdf_path` points to your document.
# The `bucket_name` must be globally unique.
pdf_path="stats.pdf"
bucket_name = "stats-content2024"
embed_file_path = "stats_embeddings.json"
sentence_file_path = "stats_sentences.json"
index_name="stats_index"

---

#
## 2. Helper Functions
#
This section defines a series of helper functions that perform the core tasks of the RAG pipeline.

# Helper function to extract sentences from a PDF file.
# It reads the PDF page by page, extracts the text, and splits it into a list of individual sentences.


In [None]:
def extract_sentences_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ""
        for page in reader.pages:
            if page.extract_text() is not None:
                text += page.extract_text() + " "
    sentences = [sentence.strip() for sentence in text.split('. ') if sentence.strip()]
    return sentences
def generate_text_embeddings(sentences) -> list:
    # aiplatform.init(project=project,location=location) # This line is commented out but is needed for authentication
    model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")
    embeddings = model.get_embeddings(sentences)
    vectors = [embedding.values for embedding in embeddings]
    return vectors
def generate_and_save_embeddings(pdf_path, sentence_file_path, embed_file_path):
    def clean_text(text):
        cleaned_text = re.sub(r'\u2022', '', text)  # Remove bullet points
        cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()  # Remove extra whitespaces and strip
        return cleaned_text
    sentences = extract_sentences_from_pdf(pdf_path)
    if sentences:
        embeddings = generate_text_embeddings(sentences)
        with open(embed_file_path, 'w') as embed_file, open(sentence_file_path, 'w') as sentence_file:
            for sentence, embedding in zip(sentences, embeddings):
                cleaned_sentence = clean_text(sentence)
                id = str(uuid.uuid4())
                embed_item = {"id": id, "embedding": embedding}
                sentence_item = {"id": id, "sentence": cleaned_sentence}
                json.dump(sentence_item, sentence_file)
                sentence_file.write('\n')
                json.dump(embed_item, embed_file)
                embed_file.write('\n')
def upload_file(bucket_name,file_path):
    storage_client = storage.Client()
    bucket = storage_client.create_bucket(bucket_name,location=location)
    blob = bucket.blob(file_path)
    blob.upload_from_filename(file_path)
def create_vector_index(bucket_name, index_name):
    lakeside_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name = index_name,
    contents_delta_uri = "gs://"+bucket_name,
    dimensions = 768,
    approximate_neighbors_count = 10,
    )
    lakeside_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name = index_name,
    public_endpoint_enabled = True
    )
    lakeside_index_endpoint.deploy_index(
    index = lakeside_index, deployed_index_id = index_name
    )

---

#
## 3. Data Processing and Index Creation
#
This section executes the functions defined above to generate embeddings,
upload them to GCS, and build the Vector Search index.

**NOTE**: This process can take a significant amount of time, especially the index creation step.

# Generate and save embeddings from the PDF.
# This creates the `stats_sentences.json` and `stats_embeddings.json` files.

In [None]:
generate_and_save_embeddings(pdf_path,sentence_file_path,embed_file_path)
# Upload the file containing the sentences to GCS.
# This file will be used later to map embedding IDs back to the original sentences.
upload_file(bucket_name,sentence_file_path)
# Create the vector index and deploy it to a public endpoint.
# The embeddings must be in a GCS bucket to be indexed.
create_vector_index(bucket_name, index_name)

---

#
# 4. Retrieval and Generation
#
This section demonstrates how to use the created Vector Search index to retrieve relevant
information and a Gemini model to generate a final, grounded answer.

from vertexai.language_models import TextEmbeddingModel
from google.cloud import aiplatform
import vertexai
from vertexai.preview.generative_models import GenerativeModel, Part
import json
import os
# Initialize Vertex AI.
# The project and location variables are used to set up the SDK environment.
# project=”YOUR_GCP_PROJECT”
location="us-central1"
# Define file path and index name.
# The `sentence_file_path` is used to load the original sentences.
# The `index_name` is the ID of the deployed Vector Search index.
sentence_file_path = "stats_sentences.json"
index_name="stats_index" #Get this from the console or the previous step
# aiplatform.init(project=project,location=location) # Uncomment and fill in project ID for a clean run.
# vertexai.init() # Uncomment to initialize Vertex AI.
# Initialize the generative model (Gemini Pro).
model = GenerativeModel("gemini-pro")
# Connect to the deployed Vector Search index endpoint.
# Replace the endpoint name with your actual index endpoint ID from the console.
lakeside_index_ep = aiplatform.MatchingEngineIndexEndpoint(index_endpoint_name="1376179539650019328")
# Function to generate text embeddings for the user query.
# This uses the same `TextEmbeddingModel` as the indexing step to ensure the
# query embedding is in the same vector space.
def generate_text_embeddings(sentences) -> list:
    model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")
    embeddings = model.get_embeddings(sentences)
    vectors = [embedding.values for embedding in embeddings]
    return vectors
# Function to retrieve the original sentences from the loaded data based on their IDs.
def generate_context(ids,data):
    concatenated_names = ''
    for id in ids:
        for entry in data:
            if entry['id'] == id:
                concatenated_names += entry['sentence'] + "\n"
    return concatenated_names.strip()
# Function to load the sentence data from a JSON file.
def load_file(sentence_file_path):
    data = []
    with open(sentence_file_path,'r') as f:
        for line in f:
            entry = json.loads(line)
            data.append(entry)
    return data

---

#
## 5. Query and Grounded Response
#
This section executes the retrieval and generation steps to answer a user query.
# Load the sentence data into memory.


In [None]:
data=load_file(sentence_file_path)
data
# Define the user query.
query=["what is correlation?"]
# Generate embeddings for the user query.
qry_emb=generate_text_embeddings(query)
# qry_emb
# Perform a similarity search on the Vector Search index.
# It retrieves the 10 nearest neighbors (most relevant sentences) to the query.
response = lakeside_index_ep.find_neighbors(
    deployed_index_id = index_name,
    queries = [qry_emb[0]],
    num_neighbors = 10
)
# Extract the IDs of the top-k matching sentences.
matching_ids = [neighbor.id for sublist in response for neighbor in sublist]
# Use the matching IDs to retrieve the full, original sentences.
context = generate_context(matching_ids,data)
# Create the final prompt by combining the retrieved context and the user's query.
# This is a key step in RAG, as it grounds the LLM's response in the provided context.
prompt=f"Based on the context delimited in backticks, answer the query. ```{context}``` {query}"
# Start a chat session with Gemini and send the grounded prompt.
# The model will use the provided context to generate an answer.
chat = model.start_chat(history=[])
response = chat.send_message(prompt)
# Print the final, grounded response from the model.
print(response.text)