<a href="https://colab.research.google.com/github/bidin485/RAG-TRY-OUTS/blob/main/Building_a_RAG_App_on_GCP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# A quick start guide for building a RAG application on Google Cloud,
# adapted for use in a Google Colab notebook.
# This code demonstrates the core steps from environment setup to a complete query.
# Before running this code, ensure you have a Google Cloud project with
# the Vertex AI API enabled.

# --- 1. SETUP FOR COLAB ---
# Install the necessary libraries in your Colab environment.
# The `!` prefix allows you to run shell commands directly from the notebook.
!pip install --upgrade google-cloud-aiplatform google-cloud-storage

# Authenticate with Google Cloud using your user account.
# This command will open a new browser tab for you to log in.
from google.colab import auth
auth.authenticate_user()

# Import the vertexai library
import vertexai

# --- 2. CONFIGURATION ---
# Define your Google Cloud project details.
PROJECT_ID = "text-to-speech-test-469811"  # Your project ID
LOCATION = "us-west1"       # Your desired region, e.g., "us-east4"
BUCKET_NAME = "trying-out-things-bucket" # Your Cloud Storage bucket name
GCS_FOLDER_PATH = "Software Engineering/" # Path to your folder
CORPUS_DISPLAY_NAME = "my-rag-corpus"

# Set the project and initialize the Vertex AI SDK.
# The `gcloud` command sets the active project for the session.
!gcloud config set project {PROJECT_ID}

print("Initializing Vertex AI...")
vertexai.init(project=PROJECT_ID, location=LOCATION)

# --- 3.A DYNAMIC FILE LISTING (RECURSIVE) ---
# Use the Google Cloud Storage client to get a list of all files recursively
# within the specified folder and all its subfolders.
try:
    print(f"Listing files recursively in folder 'gs://{BUCKET_NAME}/{GCS_FOLDER_PATH}'...")
    from google.cloud import storage
    storage_client = storage.Client(project=PROJECT_ID)
    bucket = storage_client.bucket(BUCKET_NAME)

    # The `list_blobs` function is recursive by default. It will find all files
    # that start with the specified prefix, traversing all subdirectories.
    blobs = bucket.list_blobs(prefix=GCS_FOLDER_PATH)

    GCS_FILE_PATHS = []
    for blob in blobs:
        # We only want to add files, not subdirectories or the folder itself.
        if not blob.name.endswith('/'):
            file_path = f"gs://{BUCKET_NAME}/{blob.name}"
            GCS_FILE_PATHS.append(file_path)
            print(f"  Found: {file_path}")  # Show the path as it's found

    if not GCS_FILE_PATHS:
        raise ValueError("No files found in the specified folder. Please check your folder path.")

    print(f"Found a total of {len(GCS_FILE_PATHS)} files to import.")

except Exception as e:
    print(f"An error occurred while listing files: {e}")
    # Exit or handle the error gracefully if no files are found.
    # For this example, we'll stop the script.
    # A Colab-friendly way to stop execution is to raise an error.
    raise RuntimeError("Script stopped due to file listing error.") from e

# --- 3.B CORPUS CREATION & DATA INGESTION ---
# This step creates a new RAG corpus and imports your documents from the dynamically generated list.
# The RAG Engine automatically handles chunking and embedding generation.
# This process can take a few minutes depending on the size of your documents.
from vertexai import rag
try:
    print(f"Creating a new RAG Corpus named '{CORPUS_DISPLAY_NAME}'...")
    rag_corpus = rag.create_corpus(
        display_name=CORPUS_DISPLAY_NAME
    )
    print(f"Corpus created with resource name: {rag_corpus.name}")

    print("Importing files from Cloud Storage...")
    import_op = rag.import_files(
        rag_corpus.name,
        GCS_FILE_PATHS,
        # Transformation config is optional, but good for tuning chunking.
        transformation_config=rag.TransformationConfig(
            chunking_config=rag.ChunkingConfig(
                chunk_size=512,
                chunk_overlap=100,
            ),
        ),
    )
    import_op.result()  # Wait for the import to complete.
    print("Files imported successfully. Indexing is complete.")

except Exception as e:
    print(f"An error occurred during corpus creation or file import: {e}")
    # You might want to get an existing corpus if it already exists.
    # To use an existing corpus, you can use `rag.get_corpus(name="projects/...")`
    raise RuntimeError("Script stopped due to corpus creation/import error.") from e


# --- 4. RETRIEVAL & GENERATION ---
# This is the real-time part of the RAG application.
# 4a. Create a RAG retrieval tool based on the corpus.
from vertexai.generative_models import GenerativeModel, Tool

rag_retrieval_tool = Tool.from_retrieval(
    retrieval=rag.Retrieval(
        source=rag.VertexRagStore(
            rag_resources=[
                rag.RagResource(
                    rag_corpus=rag_corpus.name
                )
            ],
        ),
    )
)

# 4b. Create a Gemini model instance and pass the tool to it.
# The model will use this tool to "look up" information from your corpus.
model = GenerativeModel(
    model_name="gemini-2.0-flash-001",
    tools=[rag_retrieval_tool]
)

# 4c. Formulate a user query and generate a response.
user_query = "Summarize the key takeaways from the documents."
print(f"\nSending query to Gemini: '{user_query}'...")

# The model will automatically use the tool to find relevant context
# before generating the final, grounded response.
response = model.generate_content(user_query)

# Print the final response from the model.
print("\n--- GENERATED RESPONSE ---")
print(response.text)
print("--------------------------")

Collecting google-cloud-storage
  Using cached google_cloud_storage-3.3.0-py3-none-any.whl.metadata (13 kB)
Updated property [core/project].
Initializing Vertex AI...
Listing files recursively in folder 'gs://trying-out-things-bucket/Software Engineering/'...
  Found: gs://trying-out-things-bucket/Software Engineering/Year 3/semester  2/exams/big data and cloud computing test.pdf
  Found: gs://trying-out-things-bucket/Software Engineering/Year 3/semester  2/exams/research methodology exam.pdf
  Found: gs://trying-out-things-bucket/Software Engineering/Year 3/semester  2/exams/research_methodology test.pdf
  Found: gs://trying-out-things-bucket/Software Engineering/Year 3/semester  2/exams/software architecture and patterns exam.pdf
  Found: gs://trying-out-things-bucket/Software Engineering/Year 3/semester  2/exams/software evolution exam.pdf
  Found: gs://trying-out-things-bucket/Software Engineering/Year 3/semester  2/exams/software testing and verification exam.pdf
  Found: gs://try

RuntimeError: Script stopped due to corpus creation/import error.