# Vertex AI LLM Deployment Demonstration
This notebook walks through an end-to-end example of fine-tuning a GPT-like model, saving it, uploading to Google Cloud Storage (GCS), building a custom Docker container, and deploying to Vertex AI Endpoints.

## Prerequisites
1. A [Vertex AI Workbench](https://cloud.google.com/vertex-ai/docs/workbench) notebook (or another environment with GCP SDK configured).
2. A GCP project with Vertex AI and Artifact Registry enabled.
3. Sufficient GPU quota if you want to do GPU-based fine-tuning.
4. Python libraries: `transformers`, `torch`, `datasets`, and `google-cloud-storage`.

## Steps
1. **Install/Upgrade Libraries** (if needed)
2. **Authenticate** to GCP
3. **Set Project Variables**
4. **Fine-tune** a GPT-like model (using `distilgpt2` as an example)
5. **Export** and **upload** model artifacts to GCS
6. **Build** a custom container image with a FastAPI inference service
7. **Upload** the container to Artifact Registry
8. **Create** Vertex AI Model & Endpoint
9. **Deploy** model to the Endpoint
10. **Test** the Endpoint

In [None]:
!pip install --upgrade pip
!pip install transformers datasets torch fastapi uvicorn google-cloud-storage
!pip install google-cloud-aiplatform --upgrade
!pip install nest-asyncio  # Required if running FastAPI in the notebook for testing
!pip install --upgrade "accelerate>=0.26.0" transformers[torch]

import nest_asyncio
nest_asyncio.apply()

In [None]:
# Check if gcloud is working:
!gcloud --version

In [None]:
import os

# TODO: Update these variables to match your GCP configuration.
PROJECT_ID = "personal-439503"           # e.g., 'my-project-123'
REGION = "us-west1"                      # or choose your preferred region
BUCKET_NAME = "llm-demo-bucket2"          # e.g., 'my-vertex-bucket'
REPO_NAME = "llm-demo-repo"                # Artifact Registry repository name
IMAGE_NAME = "llm-demo-image"            # Docker image name
IMAGE_TAG = "v1"                         # Docker image tag
MODEL_DISPLAY_NAME = "llm-demo-model"    # Display name in Vertex AI
ENDPOINT_DISPLAY_NAME = "llm-demo-endpoint"    # Display name in Vertex AI

# Ensure your environment is set to the correct project:
!gcloud config set project {PROJECT_ID}

## 1. Fine-Tune the Model
Here, we'll just do a very quick demonstration using `distilgpt2`. In reality, you'd have your own dataset and more robust training loop.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from datasets import load_dataset
import torch

# Example: using distilgpt2 for a quick demonstration
model_name = "distilbert/distilgpt2"
# model_name = "meta-llama/Llama-3.3-70B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# ADD: configure tokenizer to use a padding token
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.resize_token_embeddings(len(tokenizer))

# ADD: This data collator automatically sets labels = input_ids for causal LM
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # Important for GPT-style (causal) models
)

# Prepare a toy dataset
# We'll just use the 'wikitext' dataset for demonstration, but you'll likely have your own.
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True)

def remove_empty(example):
    return len(example["input_ids"]) > 0

tokenized_datasets = dataset.map(tokenize_function, batched=True, num_proc=1, remove_columns=["text"])

# ADD: filter out empty examples
tokenized_datasets["train"] = tokenized_datasets["train"].filter(remove_empty)
tokenized_datasets["validation"] = tokenized_datasets["validation"].filter(remove_empty)

In [None]:
# Training configuration
training_args = TrainingArguments(
    output_dir="./" + MODEL_DISPLAY_NAME,
    overwrite_output_dir=True,
    num_train_epochs=1,  # For demo, just 1 epoch
    per_device_train_batch_size=2,
    logging_steps=10,
    save_steps=50,
    save_total_limit=1,
    do_eval=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"].select(range(1000)),  # Take a subset for quick demo
    tokenizer=tokenizer,
    data_collator=data_collator # ADD: added
)

trainer.train()

## 2. Save the Fine-Tuned Model
We’ll save the model and tokenizer locally in the notebook environment.

In [None]:
save_path = "./" + MODEL_DISPLAY_NAME
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print("Model and tokenizer saved to", save_path)

## 3. Upload Model Artifacts to GCS
We’ll store our model in a GCS bucket so our container can load it in Vertex AI.

In [None]:
GCS_PATH = f"gs://{BUCKET_NAME}"
print(f"GCS path: {GCS_PATH}")
!gsutil cp -r ./llm-demo-model "{GCS_PATH}"

## 4. Create Inference Script
We need a Python script (e.g. `inference.py`) that loads the model and serves predictions via FastAPI. We'll write it to the notebook filesystem so we can build a container with it.

In [None]:
print(f"Bucket name: {BUCKET_NAME}")
print(f"Model name: {MODEL_DISPLAY_NAME}")

In [None]:
inference_script = """\
import os
import logging
from typing import List
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from google.cloud import storage

app = FastAPI()

# Environment variable for GCS path
LOCAL_MODEL_DIR="llm-demo-model"

# Set device to CPU explicitly
device = torch.device("cpu")

# confirm loading the model works
tokenizer = AutoTokenizer.from_pretrained(LOCAL_MODEL_DIR)
model = AutoModelForCausalLM.from_pretrained(LOCAL_MODEL_DIR)
model.eval()

class InferenceInstance(BaseModel):
    prompt: str
    max_tokens: int = 50  # Default value if not provided

class InferenceRequest(BaseModel):
    instances: List[InferenceInstance]

class PredictionOutput(BaseModel):
    generated_text: str

class PredictionResponse(BaseModel):
    predictions: List[PredictionOutput]

@app.post("/predict", response_model=PredictionResponse)
def predict(request: InferenceRequest):
    try:
        print(f"Received prediction request: {request.json()}")
        if not request.instances:
            raise HTTPException(status_code=400, detail="No instances provided.")

        predictions = []
        for instance in request.instances:
            prompt = instance.prompt
            max_tokens = instance.max_tokens

            print(f"Received prompt: {prompt} with max_tokens: {max_tokens}")

            inputs = tokenizer.encode(prompt, return_tensors="pt")
            with torch.no_grad():
                outputs = model.generate(
                    inputs,
                    max_new_tokens=max_tokens,
                    do_sample=True,
                    top_p=0.9,
                    temperature=0.8,
                )

            generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
            print(f"Generated text: {generated_text}")
            predictions.append(PredictionOutput(generated_text=generated_text))

        return PredictionResponse(predictions=predictions)
    except Exception as e:
        print(f"Error during prediction: {e}")
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
def health():
    return {"status": "ok"}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8080)
"""

with open("inference.py", "w") as f:
    f.write(inference_script)

print("inference.py created.")

## 5. Create a Dockerfile
We’ll build a Docker image that:
1. Installs Python & dependencies
2. Copies `inference.py`
3. Runs our FastAPI server on port 8080

In [None]:
dockerfile_contents = """\
# Use a standard Python runtime as a parent image (CPU-only)
FROM python:3.9-slim

# Set environment variables to prevent Python from writing pyc files and buffering stdout
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1

# Prevent interactive prompts during package installation
ENV DEBIAN_FRONTEND=noninteractive
# Set the timezone to UTC (you can change this as needed)
ENV TZ=UTC

# Install system dependencies
RUN apt-get update && apt-get install -y \
    curl \
    gnupg \
    apt-transport-https \
    ca-certificates \
    git \
    software-properties-common \
    tzdata \
    && rm -rf /var/lib/apt/lists/*

# Install Python 3.9's pip (already included in python:3.9-slim, but ensure it's up-to-date)
RUN python -m pip install --upgrade pip

# Install CPU-only PyTorch and other Python dependencies
RUN pip install torch==2.0.1+cpu torchvision==0.15.2+cpu torchaudio==2.0.2+cpu --index-url https://download.pytorch.org/whl/cpu
RUN pip install transformers fastapi uvicorn google-cloud-storage

# Copy inference script
WORKDIR /app
COPY inference.py /app/inference.py
COPY llm-demo-model /app/llm-demo-model

# Expose port 8080
EXPOSE 8080

# Define the default command to run your application
CMD ["uvicorn", "inference:app", "--host", "0.0.0.0", "--port", "8080"]
"""

with open("Dockerfile", "w") as f:
    f.write(dockerfile_contents)

print("Dockerfile created.")

## 5.5. Test the inference script and Docker image

Let's test this before deploying.

In [None]:
# test model loading
from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = "./llm-demo-model"

try:
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(model_path)
    print("Model loaded successfully.")
except Exception as e:
    print(f"Error loading model: {e}")

In [None]:
# Build the Docker image (replace image name/tag as needed)
#!docker build -t llm-demo-inference:latest .

# Run the container in the background, mapping container port 8080 to host port 8081
#!docker run -d -p 8081:8080 --name test-inference llm-demo-inference:latest

# Test the /predict endpoint using curl (adjust JSON payload as needed)
#!curl -X POST -H "Content-Type: application/json" -d '{"instances":[{"prompt":"What is the capital of Oregon?","max_tokens":250}]}' http://localhost:8081/predict

# (Optional) Check container logs if needed
#!docker logs test-inference

# # (Optional) Clean up: stop and remove the container
!docker stop test-inference
!docker rm test-inference

## 6. Build & Push Docker Image to Artifact Registry
We'll use [Cloud Build](https://cloud.google.com/build) to build and push the image. Make sure you have an [Artifact Registry repository](https://cloud.google.com/artifact-registry/docs/create-repo) created.

In [None]:
AR_REPO = f"{REGION}-docker.pkg.dev/{PROJECT_ID}/{REPO_NAME}/{IMAGE_NAME}:{IMAGE_TAG}"

print("Building and pushing image to:", AR_REPO)
!gcloud builds submit --tag {AR_REPO} .

## 7. Create a Vertex AI Model Resource
We register our container image with Vertex AI as a custom model.

In [None]:
# enable permissions
# gcloud artifacts repositories add-iam-policy-binding llm-demo-repo \
#   --project="personal-439503" \
#   --location="us-west1" \
#   --member="serviceAccount:service-228086471049@gcp-sa-aiplatform.iam.gserviceaccount.com" \
#   --role="roles/artifactregistry.reader"

!gcloud ai models upload \
   --region={REGION} \
   --display-name={MODEL_DISPLAY_NAME} \
   --container-image-uri={AR_REPO} \
   --container-health-route=/health \
   --container-predict-route=/predict

From the output of the above command, note the **`modelId`** (MODEL_ID). We’ll use it in the next step. If you miss it, you can retrieve it via:
```bash
gcloud ai models list --region=us-central1
```
or from the **Vertex AI -> Models** page in the Console.

In [None]:
# TODO: Replace with your actual MODEL_ID from the previous step
MODEL_ID = "701791883730354176"  # e.g. 123456789
print("Using MODEL_ID:", MODEL_ID)

## 8. Create an Endpoint & Deploy the Model
We create a Vertex AI Endpoint, then deploy our model to that Endpoint.

In [None]:
!gcloud ai endpoints create --region={REGION} --display-name={ENDPOINT_DISPLAY_NAME}

Again, note the **`endpointId`** from the output, or retrieve it later via:
```bash
gcloud ai endpoints list --region=us-central1
```

In [None]:
# TODO: Replace with your actual ENDPOINT_ID from the previous command
ENDPOINT_ID = "8528678600193343488"  # e.g. 987654321
print("Using ENDPOINT_ID:", ENDPOINT_ID)

In [None]:
!gcloud ai endpoints deploy-model {ENDPOINT_ID} \
  --region={REGION} \
  --model={MODEL_ID} \
  --display-name="llm-demo-deployment"

## 9. Test the Endpoint
Once deployment is successful, we can call the Endpoint using the Python Vertex AI SDK.

In [None]:
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=REGION)

endpoint = aiplatform.Endpoint(
    endpoint_name=f"projects/{PROJECT_ID}/locations/{REGION}/endpoints/{ENDPOINT_ID}"
)

instances = [{"prompt": "What's the capital of California?","max_tokens": 400}]

response = endpoint.predict(instances)
print(response)

If everything worked correctly, your custom container should respond with generated text from your fine-tuned model.

## Clean Up (Optional)
You might want to:
1. Stop or delete your endpoint to save costs: `gcloud ai endpoints delete ENDPOINT_ID --region=REGION`
2. Remove the model: `gcloud ai models delete MODEL_ID --region=REGION`
3. Delete the GCS files if you no longer need them: `gsutil rm -r gs://YOUR_BUCKET/gpt4o-mini-finetuned`