# BigQuery Remote LLM API Test

**Goal:** This notebook provides a comprehensive test for using Google BigQuery to call a remote Large Language Model (like Gemini) using the `ML.GENERATE_TEXT` function.

**What it does:**
1.  Checks dependencies and authenticates with Google Cloud.
2.  Initializes a BigQuery client.
3.  Creates a BigQuery dataset to house a remote model.
4.  Creates a remote model that connects to a Vertex AI LLM endpoint (`gemini-1.0-pro`).
5.  Queries a public Ethereum dataset and passes transaction data to the LLM for summarization.
6.  Displays the generated summary.
7.  Provides an optional cleanup step to delete the created resources.

## 1. Pre-computation Setup

Before running this notebook, you must perform a few one-time setup steps in your Google Cloud project.

### Required APIs
Ensure the following APIs are **enabled** in your project:
1.  **BigQuery API**: `gcloud services enable bigquery.googleapis.com`
2.  **BigQuery Connection API**: `gcloud services enable bigqueryconnection.googleapis.com`
3.  **Vertex AI API**: `gcloud services enable aiplatform.googleapis.com`

### BigQuery Connection
You need a `CLOUD_RESOURCE` connection in BigQuery that allows BigQuery to communicate with Vertex AI. 

1.  **Create the connection** (using the gcloud CLI):
    ```bash
    gcloud bq connections create bq-llm-connection \
        --location=US \
        --project_id="YOUR_PROJECT_ID" \
        --connection_type=CLOUD_RESOURCE
    ```
2.  **Get the Service Account**: After creating the connection, retrieve the service account associated with it.
    ```bash
    gcloud bq connections describe bq-llm-connection --location=US --project_id="YOUR_PROJECT_ID"
    ```
    Look for the `serviceAccountId` in the output (e.g., `bqcx-1234-...@gcp-sa-bigquery-condel.iam.gserviceaccount.com`).

3.  **Grant Permissions**: Grant the `Vertex AI User` role to this service account so it can invoke the LLM.
    ```bash
    gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
        --member='serviceAccount:YOUR_CONNECTION_SERVICE_ACCOUNT' \
        --role='roles/aiplatform.user'
    ```

### Authentication
* **In Colab**: The first code cell will prompt you to authenticate with your Google account.
* **In a local Jupyter environment**: Authenticate via the gcloud CLI before starting the notebook:
    ```bash
    gcloud auth application-default login
    ```

## 2. Configuration

Set your Google Cloud project details here. The `LOCATION` is set to `US` because the public Ethereum dataset we are querying resides in the US multi-region. Cross-region queries between a model and source data are not allowed.

In [None]:
import sys
from pathlib import Path
parent_dir = Path("..").resolve()
if str(parent_dir) not in sys.path:
	sys.path.insert(0, str(parent_dir))
	print(f"Added {parent_dir} to Python path.")
else:
	print(f"{parent_dir} already in Python path.")

In [None]:
import os
import sys

BQ_CONNECTION_ID = "bq-llm-connection" 


## 3. Dependency Checks
This cell ensures that the required Python libraries are installed and prints their versions.

In [None]:
# Sanity check: kernel, packages, and optional .env
print(f"Python: {sys.version.split()[0]}")

try:
    import jupyter, ipykernel  # noqa: F401
    print("Jupyter + ipykernel: OK")
except Exception as e:
    print("Jupyter import failed:", e)

try:
    from dotenv import load_dotenv
    # In a local environment, you can use a .env file for credentials
    if load_dotenv():
        print("python-dotenv: loaded .env")
except ImportError:
    print("python-dotenv not installed, skipping .env load.")
    
# Verify BigQuery + DataFrame stack
try:
    import google.cloud.bigquery as bq
    import pandas as pd
    import pyarrow as pa
    import google.cloud.bigquery_storage as bqstorage

    print("-" * 20)
    print("Required packages:")
    print("google-cloud-bigquery:", bq.__version__)
    print("pandas:", pd.__version__)
    print("pyarrow:", pa.__version__)
    print("bigquery-storage:", bqstorage.__version__)
except ImportError as e:
    print(f"Missing a required package: {e.name}. Please install it.")
    print("pip install google-cloud-bigquery pandas pyarrow db-dtypes google-cloud-bigquery-storage")

## 4. Initialize BigQuery Client

Now we create the BigQuery client object. This will use the credentials configured in the previous steps to interact with your project.

In [None]:
from google.cloud import bigquery
import re
from google.auth.exceptions import DefaultCredentialsError
import os

client = None
PROJECT_ID = os.getenv("GOOGLE_CLOUD_PROJECT")
LOCATION = os.getenv("GOOGLE_CLOUD_LOCATION")

# Basic validation for project id
if not PROJECT_ID or not re.fullmatch(r"[a-z][a-z0-9-]{4,61}[a-z0-9]", PROJECT_ID):
    print(f"Error: Project ID is missing or looks invalid: '{PROJECT_ID}'")
    print("Please set the GOOGLE_CLOUD_PROJECT variable in the configuration cell above.")
else:
    try:
        client = bigquery.Client(project=PROJECT_ID, location=LOCATION)
        print(f"BigQuery client created for project: {client.project} in location: {client.location}")
    except DefaultCredentialsError:
        print("Authentication failed: Google Application Default Credentials not found.")
        print("Please complete the authentication steps in section 1.")
    except Exception as e:
        print(f"Failed to create BigQuery client: {e}")

## 5. Create Remote LLM Model in BigQuery

This is the core of the setup. We execute a `CREATE MODEL` query in BigQuery.

* **`CREATE OR REPLACE MODEL`**: This DDL is idempotent. Running it again will update the model if it already exists.
* **`REMOTE WITH CONNECTION`**: This tells BigQuery that the model's logic is external and should be accessed via the specified connection.
* **`OPTIONS`**: We specify that the remote service is a Cloud AI LLM and provide the model name (`gemini-2.5-pro`).

In [None]:
from google.api_core import exceptions as gax_exceptions

model_ready = False
if client:
    # A writable dataset in your project where the model will be stored.
    dataset_id = f"{PROJECT_ID}.bq_llm_testing"
    dataset = bigquery.Dataset(dataset_id)
    dataset.location = LOCATION
    try:
        client.create_dataset(dataset, exists_ok=True)
        print(f"Dataset ensured: {dataset_id}")
    except Exception as e:
        print(f"Failed to create dataset: {e}")

    # Build the full connection resource string
    connection_resource = f"projects/{PROJECT_ID}/locations/{LOCATION}/connections/{BQ_CONNECTION_ID}"

    # Create (or replace) the remote LLM model in your project
    summarizer_query = f"""
    CREATE OR REPLACE MODEL `{dataset_id}.transaction_summarizer`
    REMOTE WITH CONNECTION `{connection_resource}`
    OPTIONS (
      remote_service_type = 'CLOUD_AI_LARGE_LANGUAGE_MODEL_V1',
      endpoint = 'gemini-2.5-pro'
    );
    """

    try:
        print("Creating remote model... (this may take a moment)")
        client.query(summarizer_query).result()
        print(f"Remote LLM model ensured: {dataset_id}.transaction_summarizer")
        model_ready = True
    except gax_exceptions.NotFound as e:
        print(f"Connection not found: {connection_resource}")
        print("Tip: Ensure you created the BigQuery connection in the correct location with the correct name.")
    except gax_exceptions.Forbidden as e:
        print("Permission error when creating model.")
        print("Tip: Ensure the connection's service account has the 'Vertex AI User' role.")
        print(f"Details: {e}")
    except Exception as e:
        print(f"Failed to create remote model: {e}")
else:
    print("BigQuery client not available. Skipping model creation.")

## 📜 6. Summarize a Transaction with `ML.GENERATE_TEXT`

With the model in place, we can now use the `ML.GENERATE_TEXT` function.

1.  **Source Data**: We select a single transaction from the public `crypto_ethereum.transactions` table. We specifically look for a transaction that starts with `0x7ff36ab5`, which is the function signature for `swapExactETHForTokens` on Uniswap, making for an interesting summary.
2.  **Prompt Engineering**: We create a `prompt` by concatenating our instructions with the raw transaction input data.
3.  **`ML.GENERATE_TEXT` call**: We pass our model and the prompt data to the function.
4.  **Parameters**: We set `temperature` to a low value for more deterministic output and set a `max_output_tokens` limit.
5.  **Result**: The function returns a struct, from which we extract the generated text content.

In [None]:
from IPython.display import display, Markdown

if model_ready:
    # Compose the text-generation query
    source_table = "bigquery-public-data.crypto_ethereum.transactions"
    dataset_id = f"{PROJECT_ID}.bq_llm_testing"
    
    query = f"""
    WITH transactions_to_summarize AS (
      SELECT
        `hash`,
        input AS transaction_input
      FROM
        `{source_table}`
      WHERE STARTS_WITH(input, '0x7ff36ab5') -- Function: swapExactETHForTokens
      AND receipt_status = 1 -- Succeeded
      LIMIT 1
    )
    SELECT
      t.hash,
      --ml_generate_text_result['predictions'][0]['content'] AS summary

      ml_generate_text_result as summary,
  	  JSON_EXTRACT_SCALAR(ml_generate_text_result, '$.candidates[0].content.parts[0].text') AS actual_summary_text      
      
    FROM
      ML.GENERATE_TEXT(
        MODEL `{dataset_id}.transaction_summarizer`,
        (SELECT
          CONCAT(
            'Explain this Ethereum transaction input data in plain English. ',
            'What is the likely function call and what are its parameters? Be concise. Input: ',
            transaction_input
          ) AS prompt,
          `hash`
         FROM transactions_to_summarize
        ),
        STRUCT(
          0.2 AS temperature,
          5120 AS max_output_tokens
        )
      ) AS t;
    """

    try:
        print("Querying public data and calling remote LLM...")
        df = client.query(query).to_dataframe()

        if not df.empty:
            tx_hash = df.loc[0, 'hash']
            summary = df.loc[0, 'summary']
            display(Markdown(f"### Summary for Transaction: `{tx_hash}`"))
            display(Markdown(summary))
        else:
            print("Query returned no results. Could not find a matching transaction to summarize.")

    except gax_exceptions.Forbidden as e:
        print("Permission error when calling ML.GENERATE_TEXT.")
        print("Tip: Double-check the permissions for the connection's service account.")
        print(f"Details: {e}")
    except Exception as e:
        print(f" Query failed: {e}")
else:
    print("Skipping text generation because the remote model is not available.")

## 7. Optional Cleanup

Run the following cell to delete the BigQuery model and dataset created during this test. This helps keep your project clean.

In [None]:
if client:
    dataset_id = f"{PROJECT_ID}.bq_llm_testing"
    model_id = f"{dataset_id}.transaction_summarizer"

    print("Cleaning up resources...")
    
    # Drop the model
    try:
        client.query(f"DROP MODEL IF EXISTS `{model_id}`").result()
        print(f"Model dropped: {model_id}")
    except Exception as e:
        print(f"Could not drop model: {e}")

    # Drop the dataset (cascade=True deletes contents as well)
    try:
        client.delete_dataset(dataset_id, delete_contents=True, not_found_ok=True)
        print(f"Dataset dropped: {dataset_id}")
    except Exception as e:
        print(f"Could not drop dataset: {e}")
else:
    print("BigQuery client not available. Skipping cleanup.")