# Building a Production-Ready RAG Application on Google Cloud

**_NOTE_**: This notebook has been tested in the following environment:
* Python version = 3.12+
* Google Cloud SDK (gcloud)
* Docker

## Overview

This notebook provides a comprehensive, step-by-step guide to building and deploying a sophisticated Retrieval-Augmented Generation (RAG) application on Google Cloud. The final product is a scalable, serverless API that can answer questions about a specific document, complete with a separate, interactive Streamlit UI.

This guide is intended for developers and ML engineers looking to understand the end-to-end lifecycle of a production AI application, from data processing to a fully deployed, containerized web service.

This notebook uses a two-part architecture:
1.  An offline **Indexing Job** to process documents and create a vector store.
2.  A lightweight **Serving Application** deployed on Cloud Run.

### Objective

In this tutorial, you will learn how to:
-   Set up a Google Cloud project with the necessary APIs and permissions.
-   Create a development environment using Vertex AI Workbench.
-   Build a RAG indexing pipeline using LangChain and Vertex AI models.
-   Persist the generated index to Google Cloud Storage.
-   Containerize a Python web application using a Dockerfile.
-   Deploy the application as a scalable, serverless API on Cloud Run.
-   Create and run a separate Streamlit UI to interact with the API.

This tutorial uses the following Google Cloud services:
-   **Vertex AI** (for Gemini and Embedding Models)
-   **Cloud Run** (for serverless deployment)
-   **Cloud Build** (for container builds)
-   **Artifact Registry** (for container storage)
-   **Cloud Storage** (for data persistence)
-   **Vertex AI Workbench** (for the development environment)


### Dataset

This project uses the seminal AI research paper **"Attention Is All You Need"** as its source document. The paper is publicly available and will be downloaded directly from arXiv during the indexing process. This serves as the knowledge base for our RAG agent.

### Costs 💵

This tutorial uses billable components of Google Cloud:
* [Vertex AI](https://cloud.google.com/vertex-ai/pricing)
* [Cloud Run](https://cloud.google.com/run/pricing)
* [Cloud Storage](https://cloud.google.com/storage/pricing)
* [Cloud Build](https://cloud.google.com/build/pricing)
* [Artifact Registry](https://cloud.google.com/artifact-registry/pricing)

To generate a cost estimate based on your projected usage, use the [Pricing Calculator](https://cloud.google.com/products/calculator/). Remember to clean up all resources after completing the tutorial to avoid incurring further charges.

---
## Before you begin

### 1. Set up your Google Cloud project
Follow these steps to set up your environment.
1.  [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager).
2.  [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).
3.  [Enable all necessary APIs](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com,run.googleapis.com,cloudbuild.googleapis.com,artifactregistry.googleapis.com,storage.googleapis.com,compute.googleapis.com,notebooks.googleapis.com).
4.  Install and initialize the [Cloud SDK](https://cloud.google.com/sdk).

#### Set your project ID and Region
Update the following variables with your specific project details.

In [None]:
# --- YOUR CONFIGURATION ---
PROJECT_ID = "new-rag-project-prod"  # @param {type:"string"}
REGION = "europe-west1"  # @param {type:"string"}

# --- SET GCLOUD DEFAULTS ---
!gcloud config set project {PROJECT_ID}
!gcloud config set run/region {REGION}

### 2. Set up your Development Environment
For this project, it is highly recommended to use a **Vertex AI Workbench** instance to avoid resource limitations and dependency conflicts.

1.  [Create a new Vertex AI Workbench instance](https://console.cloud.google.com/vertex-ai/workbench/instances) in the same `REGION` as above.
2.  Once it's running, click **"OPEN JUPYTERLAB"** and open a new **Terminal**.
3.  All subsequent commands in this guide should be run from that Workbench terminal.

---
## Part 1: The Indexing Job ⚙️

This is a one-time process to create the RAG agent's "brain."

### 1. Create the project files
First, we create the directory and the necessary Python script and dependency file.

In [None]:
%%bash
# Create the project directory
mkdir -p ~/rag-project
cd ~/rag-project

# Create requirements.txt for the build script
cat > requirements.txt << EOF
langchain-google-vertexai
langchain
scikit-learn
unstructured[pdf]
pypdf
requests
numpy
EOF

# Create the build_index.py script
cat > build_index.py << EOF
import pickle
import requests
import vertexai
import numpy as np
from langchain_community.document_loaders import PyPDFLoader
from langchain_google_vertexai import VertexAIEmbeddings, VertexAI
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

print("--- Starting Index Build Job ---")

PROJECT_ID = "$(gcloud config get-value project)"
REGION = "europe-west1"
vertexai.init(project=PROJECT_ID, location=REGION)

texts = [d.page_content for d in PyPDFLoader("https://arxiv.org/pdf/1706.03762.pdf").load() if d.page_content.strip()]
print(f"Loaded {len(texts)} pages.")

summarizer = VertexAI(model_name="gemini-2.5-flash")
summaries = summarizer.batch(texts)
print(f"Generated {len(summaries)} summaries.")

embeddings = VertexAIEmbeddings(model_name="gemini-embedding-001").embed_documents(summaries)
print("Created embeddings.")

with open("summary_embeddings.pkl", "wb") as f:
    pickle.dump(np.array(embeddings), f)
with open("original_texts.pkl", "wb") as f:
    pickle.dump(texts, f)

print("--- Index Build Job Complete ---")
EOF

echo "Indexing files created successfully."

### 2. Run the Indexing Job
Now, install the dependencies into a virtual environment and run the script.

In [None]:
%%bash
cd ~/rag-project

# Create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies and run the build script
pip install -r requirements.txt
python build_index.py

### 3. Create a Cloud Storage bucket and Upload the Index
The generated `.pkl` files are now uploaded to a GCS bucket to be used by our serving application.

In [None]:
%%bash
cd ~/rag-project

# Create a unique GCS bucket
GCS_BUCKET="gs://rag-data-bucket-$(gcloud config get-value project)"
gcloud storage buckets create $GCS_BUCKET --location={REGION}

# Upload the index files
gcloud storage cp *.pkl $GCS_BUCKET/

echo "Index files uploaded to $GCS_BUCKET"

---
## Part 2: The Serving API 🚀

Now we create and deploy the lightweight server that will handle user queries.

### 1. Create the Application Files
In a separate directory, we create the server's code, dependencies, and `Dockerfile`.

In [None]:
%%bash
# Create a new directory for the server
mkdir -p ~/rag-server
cd ~/rag-server

# Create the server's requirements.txt
cat > requirements.txt << EOF
langchain-google-vertexai
langchain
scikit-learn
google-cloud-storage
langchain-community
functions-framework
numpy
EOF

# Create the Dockerfile
cat > Dockerfile << EOF
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY main.py .
CMD exec functions-framework --target=rag_http_handler
EOF

# Create the main.py server code
cat > main.py << EOF
import os
import pickle
import functools
from google.cloud import storage
import numpy as np
import vertexai
from langchain_community.vectorstores import SKLearnVectorStore
from langchain_google_vertexai import ChatVertexAI, VertexAIEmbeddings
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import HumanMessage
from langchain.prompts import PromptTemplate
import functions_framework
from flask import jsonify

PROJECT_ID = "new-rag-project-prod"
REGION = "europe-west1"
GCS_BUCKET = f"rag-data-bucket-{PROJECT_ID}"

@functools.lru_cache(maxsize=1)
def _load_rag_chain():
    print("--- Cold Start: Loading pre-built index from GCS ---")
    vertexai.init(project=PROJECT_ID, location=REGION)
    storage_client = storage.Client()
    bucket = storage_client.bucket(GCS_BUCKET)

    blob_embeddings = bucket.blob("summary_embeddings.pkl")
    blob_embeddings.download_to_filename("/tmp/summary_embeddings.pkl")
    blob_texts = bucket.blob("original_texts.pkl")
    blob_texts.download_to_filename("/tmp/original_texts.pkl")
    print("Downloaded index files.")

    with open("/tmp/summary_embeddings.pkl", "rb") as f:
        embeddings = pickle.load(f)
    with open("/tmp/original_texts.pkl", "rb") as f:
        texts = pickle.load(f)
    print("Loaded index data into memory.")

    embedding_function = VertexAIEmbeddings(model_name="gemini-embedding-001", location=REGION)
    model = ChatVertexAI(model_name="gemini-2.5-pro", temperature=0.2, location=REGION)

    vectorstore = SKLearnVectorStore(embedding=embedding_function)
    vectorstore.add_texts(texts=texts, embeddings=embeddings.tolist())

    retriever = vectorstore.as_retriever()

    template = "Answer the question based only on the following context:\n{context}\n\nQuestion: {question}"
    prompt = PromptTemplate.from_template(template)
    chain = (
        RunnableParallel(context=retriever, question=RunnablePassthrough())
        | prompt
        | model
        | StrOutputParser()
    )
    print("--- RAG Chain Initialized Successfully ---")
    return chain

@functions_framework.http
def rag_http_handler(request):
    request_json = request.get_json(silent=True)
    if not request_json or "query" not in request_json:
        return jsonify({"error": "JSON body with a 'query' key is required."}), 400
    query = request_json["query"]
    try:
        rag_chain = _load_rag_chain()
        result = rag_chain.invoke(query)
        return jsonify({"response": result}), 200
    except Exception as e:
        print(f"Error during chain invocation: {e}")
        return jsonify({"error": f"Failed to process the request: {str(e)}"}), 500
EOF

echo "Serving files created successfully."

### 2. Deploy the API to Cloud Run
Run the deployment command from the `rag-server` directory. This will build the container using your Dockerfile and deploy it.

In [None]:
%%bash
cd ~/rag-server

gcloud run deploy rag-multimodal-api \
    --source . \
    --platform managed \
    --region={REGION} \
    --allow-unauthenticated \
    --memory=2Gi \
    --clear-base-image

---
## Part 3: The Interactive UI 💬

This section is optional but demonstrates how to build a simple frontend to interact with your new API.

### 1. Create the Streamlit App

In [None]:
%%bash
mkdir -p ~/rag-ui
cd ~/rag-ui

# Create requirements.txt
cat > requirements.txt << EOF
streamlit
requests
EOF

# Create app.py
# IMPORTANT: You must manually edit this file and replace the placeholder
# with the actual URL of your deployed Cloud Run service from the step above.
cat > app.py << EOF
import streamlit as st
import requests
import json

API_URL = "YOUR_CLOUD_RUN_SERVICE_URL_HERE" # <-- PASTE YOUR URL HERE

st.set_page_config(page_title="RAG Research Assistant", page_icon="🤖")
st.title("🤖 RAG Research Assistant")
st.caption("Ask questions about the 'Attention Is All You Need' paper.")

if "messages" not in st.session_state:
    st.session_state.messages = []

def get_rag_response(query):
    headers = {"Content-Type": "application/json"}
    data = {"query": query}
    try:
        response = requests.post(API_URL, headers=headers, data=json.dumps(data), timeout=300)
        response.raise_for_status()
        return response.json().get("response", "Sorry, I couldn't get a response.")
    except requests.exceptions.RequestException as e:
        return f"An error occurred: {e}"

for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

if prompt := st.chat_input("What is your question?"):
    st.session_state.messages.append({"role": "user", "content": prompt})
    with st.chat_message("user"):
        st.markdown(prompt)
    with st.chat_message("assistant"):
        with st.spinner("Thinking..."):
            response = get_rag_response(prompt)
            st.markdown(response)
    st.session_state.messages.append({"role": "assistant", "content": response})
EOF

### 2. Run the Streamlit UI
Run these commands in the Workbench terminal to start the UI.

In [None]:
%%bash
cd ~/rag-ui

# Create and activate a virtual environment
python3 -m venv ui-env
source ui-env/bin/activate

# Install dependencies and run the app
pip install -r requirements.txt
streamlit run app.py

---
## Cleaning up

To clean up all Google Cloud resources used in this project, delete the individual resources you created.

In [None]:
# Delete the Cloud Run service
!gcloud run services delete rag-multimodal-api --region={REGION} --quiet

# Delete the GCS bucket
GCS_BUCKET="gs://rag-data-bucket-$(gcloud config get-value project)"
!gcloud storage rm -r {GCS_BUCKET} --quiet

# Delete the Vertex AI Workbench instance (from the Cloud Console)
print("Please delete the Vertex AI Workbench instance manually from the Google Cloud Console.")