<a href="https://colab.research.google.com/github/ahmadluay9/ADK-Training/blob/main/10_mcp_toolbox_embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

In [None]:
# @markdown Please fill in the value below with your GCP project ID and then run the cell.

# Please fill in these values.
project_id = "your-project-id"  # @param {type:"string"}
location = 'us-central1' # @param {type:"string"}


# Quick input validations.
assert project_id, "‚ö†Ô∏è Please provide a Google Cloud project ID"

# Configure gcloud.
!gcloud config set project {project_id}

In [None]:
# Run this and allow access through the pop-up
from google.colab import auth

auth.authenticate_user(project_id=project_id)

In [None]:
%%shell

# 1. Install prerequisites and the common configuration package
sudo apt-get update -y
sudo apt-get install -y postgresql-common ca-certificates curl gnupg lsb-release

# 2. Run the official Postgres repository setup script
# This adds the repo that contains pgvector
yes | sudo /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh

# 3. Install PostgreSQL 16 and the pgvector extension
sudo apt-get update -y
sudo apt-get install -y postgresql-16 postgresql-16-pgvector

# 4. Start the service
sudo service postgresql start

# 5. Setup Database, User, and Vector Table
sudo -u postgres psql << EOF
-- Create user and db
CREATE USER toolbox_user WITH PASSWORD 'my-password';
CREATE DATABASE toolbox_db;
GRANT ALL PRIVILEGES ON DATABASE toolbox_db TO toolbox_user;
ALTER DATABASE toolbox_db OWNER TO toolbox_user;

-- Switch to the new database
\c toolbox_db

-- CRITICAL: Enable the vector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create the table
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding vector(768)
);

-- Grant permissions to the user so they can insert/select
GRANT ALL PRIVILEGES ON TABLE documents TO toolbox_user;
GRANT ALL PRIVILEGES ON SEQUENCE documents_id_seq TO toolbox_user;
EOF

In [None]:
# Check that postgres is running
!sudo lsof -i :5432

> **Tip:** For a real application, it‚Äôs best to follow the principle of least permission and only grant the privileges your application needs.



## Optional: Enable Vertex AI API for Google Cloud

If you're using a model hosted on **Vertex AI**, run the following command to enable the API:

```bash
!gcloud services enable aiplatform.googleapis.com


## Step 2: Install and configure MCP Toolbox

In this section, we will
1. Download the latest version of the MCP toolbox binary.
2. Create an MCP toolbox config file.
3. Start an MCP toolbox server using the config file.



Download the [latest](https://github.com/googleapis/genai-toolbox/releases) version of MCP Toolbox as a binary.

In [None]:
version = "0.27.0" # x-release-please-version
! curl -O https://storage.googleapis.com/genai-toolbox/v{version}/linux/amd64/toolbox

# Make the binary executable
! chmod +x toolbox

In [None]:
TOOLBOX_BINARY_PATH = "/content/toolbox"
SERVER_PORT = 5000

> Note: To include a literal dollar sign (e.g., $1) as part of your SQL statement within the Python string for tools.yml, you must escape both the backslash and the dollar sign. Use \\\$1 in Python to output \$1 in the tools.yml file.

> Note: You can also set up Colab secrets to store any sensitive information like passwords. You can easily add secrets through the left panel:

<img src="https://services.google.com/fh/files/misc/colab_secret.png" alt="Colab Secrets" width="400"/>


Create a tools file with the following functions:

- `Database Connection (sources)`: `Includes details for connecting to our hotels database.`
- `Tool Definitions (tools)`: `Defines five tools for database interaction:`
  - `search-hotels-by-name`
  - `search-hotels-by-location`
  - `book-hotel`
  - `update-hotel`
  - `cancel-hotel`

Our application will leverage these tools to interact with the hotels database.

For detailed configuration options, please refer to the [MCP Toolbox documentation](https://googleapis.github.io/genai-toolbox/getting-started/configure/).



In [None]:
# environment configuration
import os

try:
    # The SDK uses this ID for usage tracking and billing
    os.environ['GOOGLE_CLOUD_PROJECT'] = project_id

    # Defines the region where Vertex AI resources are hosted
    os.environ['GOOGLE_CLOUD_LOCATION'] = location

    # Directs the SDK to use Vertex AI infrastructure instead of the public Gemini API
    os.environ['GOOGLE_GENAI_USE_VERTEXAI'] = "1"

    print(f"‚úÖ Environment configured for project: {project_id} in {location}")

except Exception as e:
    print(f"‚ùå Configuration Error: {e}")

In [None]:
# --- Create tools.yml ---
tools_file_name = "tools.yml"

file_content = r"""
kind: sources
name: my-pg-source
type: postgres
host: 127.0.0.1
port: 5432
database: toolbox_db
user: toolbox_user
password: my-password
---
kind: embeddingModels
name: gemini-embedder
type: gemini
model: gemini-embedding-001
dimension: 768
---
kind: tools
name: add-document
type: postgres-sql
source: my-pg-source
description: Save text and auto-generate its vector embedding.
statement: |
  INSERT INTO documents (content, embedding)
  VALUES (\$1, \$2);
parameters:
  - name: content
    type: string
    description: The text content.
  - name: vector_embedding
    type: string
    valueFromParam: content
    description: Auto-generated vector embedding from content.
    embeddedBy: gemini-embedder
---
kind: tools
name: search-documents
type: postgres-sql
source: my-pg-source
description: Semantic search for documents.
statement: |
  SELECT content, embedding <-> \$1 AS distance
  FROM documents
  ORDER BY distance ASC
  LIMIT 3;
parameters:
  - name: search_query
    type: string
    description: The search topic.
    embeddedBy: gemini-embedder
---
kind: toolsets
name: my-vector-toolset
tools:
  - add-document
  - search-documents
"""

In [None]:
# Write the file content into the tools file.
! echo "{file_content}" > "{tools_file_name}"

In [None]:
TOOLS_FILE_PATH = f"/content/{tools_file_name}"

In [None]:
# Start an MCP toolbox server
! nohup {TOOLBOX_BINARY_PATH} --tools-file {TOOLS_FILE_PATH} -p {SERVER_PORT} > toolbox.log 2>&1 &

In [None]:
# Check if MCP toolbox is running
!sudo lsof -i :{SERVER_PORT}

In [None]:
# Print logs if it fails to ensure we see the error
! sleep 2 && cat toolbox.log

In [None]:
!pip install toolbox-core -q

In [None]:
import asyncio
from toolbox_core import ToolboxClient

async def ingest_data():
    # Connect to the local toolbox server
    async with ToolboxClient("http://127.0.0.1:5000") as toolbox:

        # 1. Load the ingestion tool
        print("Loading tool...")
        add_doc_tool = await toolbox.load_tool("add-document")

        # 2. Define some sample data
        reviews = [
            "The hotel room was spacious and clean, with a great view of the ocean.",
            "Service was terrible. The front desk was rude and the check-in took forever.",
            "Breakfast was delicious, lots of options including fresh fruit and pastries.",
            "The location is perfect, right in the city center near all the landmarks.",
            "The bed was uncomfortable and the air conditioning was broken."
        ]

        # 3. Loop through and ingest
        print("Ingesting data...")
        for review in reviews:
            try:
                # We ONLY pass 'content'.
                # The Toolbox automatically embeds this and fills the hidden 'vector_embedding' param.
                await add_doc_tool(content=review)
                print(f"‚úÖ Ingested: {review[:40]}...")
            except Exception as e:
                print(f"‚ùå Error ingesting: {e}")

# Run the async function in Colab
await ingest_data()

In [None]:
async def search_data():
    async with ToolboxClient("http://127.0.0.1:5000") as toolbox:
        # Load the search tool
        search_tool = await toolbox.load_tool("search-documents")

        # Search for a concept (not necessarily exact words)
        query = "bad experience with staff"
        print(f"üîç Searching for: '{query}'\n")

        # The toolbox converts this query to a vector and compares it
        # against the vectors we just ingested.
        results = await search_tool(search_query=query)

        print("Results found:")
        print(results)

await search_data()

### 1. The "Distance" Metric
The most important number here is `distance`.
*   **0.0** means two texts have the **exact same meaning**.
*   **Higher numbers** mean the meanings are further apart.
*   The results are sorted `ORDER BY distance ASC` (Ascending), meaning the **best matches come first**.

---

### 2. Result Analysis

#### **Rank #1: The Perfect Match**
> **Content:** *"Service was terrible. The front desk was rude and the check-in took forever."*
> **Distance:** `0.27` (Lowest / Best)

*   **Why it won:** Even though your search query ("bad experience with staff") did **not** contain the words "front desk", "rude", or "check-in", the embedding model (Gemini) understands that:
    *   "Front desk" is a type of **Staff**.
    *   "Rude" is a type of **Bad Experience**.
*   **Conclusion:** The vector for this review points in almost the exact same direction as your query vector.

#### **Rank #2: The Thematic Match**
> **Content:** *"The bed was uncomfortable and the air conditioning was broken."*
> **Distance:** `0.34` (Medium)

*   **Why it's second:** This is also a **negative review** ("bad experience").
*   **Why it's not first:** It talks about *facilities* (bed, AC), not *people* (staff).
*   **Conclusion:** The model sees that the *sentiment* (unhappy) matches your query, but the *topic* (room vs. staff) is different, so the distance is slightly higher.

#### **Rank #3: The "Filler" Match**
> **Content:** *"The hotel room was spacious and clean, with a great view of the ocean."*
> **Distance:** `0.48` (Highest / Worst)

*   **Why it's here:** Your SQL query asked for `LIMIT 3`. Since you likely only ingested 5 items total, the database *had* to return a 3rd result.
*   **Why it's last:** This is a **positive** review about facilities. It is semantically very far away from "bad experience with staff," so it has the highest distance score.

### Summary
You have successfully built a system that searches by **meaning**, not just keywords. If this were a standard keyword search, Result #1 might have failed because it didn't contain the exact word "staff".