# Gemini in BigQuery
This notebook is a step-by-step introduction on how to use Gemini and other models in BigQuery with BQML.

* **Step 1**: Create an external connection
* **Step 2**: Get connection details
* **Step 3**: Assign role Vertex AI User to service account
* **Step 4**: Create a dataset to store your models [optional]
* **Step 5**: Create the model
* **Step 6**: Use Gemini Flash with structured data
* **Step 7** Use Gemini Flash with unstructured data
* **Step 8**: Work with embeddings

## How to use this notebook

First start with the **🥸 Initialization** in the next cell. Update the entries to your settings, e.g. change the region to `EU`.



*   ***Project ID***: the GCP project id. Initial value is derived from your current selection. (`project_`)
*   ***Region***: the location for your data and configuration in BigQuery. (`region_`)
*   ***Dataset***: name of the dataset. This is the container for your tables and created models. (`dataset_`)
*   ***Model name***: the model name in BigQuery. (`model_`)
*   ***Bucket***: the storage location for unstructured data, e.g. images or PDF documents. (`bucket_`, `bucket_name_`, `bucket_file_`)
*   ***User***: your current user, for information only
*   ***SA***: is the service account for the remote connection to Vertex AI and Cloud Storage. You will create it in the next steps, no manual entry required. (`saccount_`)

These fields are connected with internal variables which will be used in the next command templates, e.g. `project_` whenever the project id is needed. **Changes are automatically applied and no re-execution is necessary.**

Some of the command templates don't show the resulting command with expanded variables. If you want to see the template without execution, run the next cell marked with 🐞.

Some of the steps are optional, e.g. if you want to use an existing dataset, you don't have to create it. Just use the existing name in the dataset text-field.

In [1]:
# @title 🥸 Initialization (only execute once at the very beginning)
# python imports
import json
import os
import re
import ipywidgets as widgets
from IPython.display import display, Markdown


if not 'initialized_' in globals():
    global initialized_
    initialized_ = True

# these are the required services
required_services = [
    'aiplatform.googleapis.com',
    'cloudaicompanion.googleapis.com',
    'dataplex.googleapis.com',
    'compute.googleapis.com',
    'dataform.googleapis.com',
    'bigqueryconnection.googleapis.com'
]
filter_list = [f"(config.name:{service} AND state:ENABLED) OR " for service in required_services]
filter = "".join(filter_list)[:-4]

# PROJECT
if not 'project_' in globals():
    global project_
    project_ = os.environ['GOOGLE_CLOUD_PROJECT']

def on_project_change(change):
    if change['type'] == 'change' and change['name'] == 'value':
        global project_
        project_ = change['new']

project__ = widgets.Text(
    value= project_,
    placeholder='Project ID',
    description='Project ID:',
    disabled=False
)
project__.observe(on_project_change)
display(project__)

# REGION
if not 'region_' in globals():
    global region_
    region_ = os.environ['GOOGLE_CLOUD_REGION']

def on_region_change(change):
    if change['type'] == 'change' and change['name'] == 'value':
        global region_
        region_ = change['new']

region__ = widgets.Text(
    value= region_,
    placeholder='Region',
    description='Region :',
    disabled=False
)
region__.observe(on_region_change)

display(region__)

# DATASET
if not 'dataset_' in globals():
    global dataset_
    dataset_ = f"demo_ds"

def on_dataset_change(change):
    if change['type'] == 'change' and change['name'] == 'value':
        global dataset_
        dataset_ = change['new']

dataset__ = widgets.Text(
    value= dataset_,
    description='Dataset',
    disabled=False
)
dataset__.observe(on_dataset_change)
display(dataset__)

# CONNECTION
if not 'connection_' in globals():
    global connection_
    connection_ = f"my-connection"

def on_connection_change(change):
    if change['type'] == 'change' and change['name'] == 'value':
        global connection_
        connection_ = change['new']

connection__ = widgets.Text(
    value= connection_,
    description='Connection',
    disabled=False
)
connection__.observe(on_connection_change)
display(connection__)

# MODEL
if not 'model_' in globals():
    global model_
    model_ = "gemini-flash"

def on_model_change(change):
    if change['type'] == 'change' and change['name'] == 'value':
        global model_
        model_ = change['new']

model__ = widgets.Text(
    value= model_,
    description='Model name',
    disabled=False
)
model__.observe(on_model_change)
display(model__)

# BUCKET
def update_bucket_info(bucket):
  bre = re.search("gs://(.*?)/(.*)", bucket)
  if bre:
    global bucket_, bucket_name_, bucket_file_
    bucket_ = bucket
    bucket_name_ = bre.group(1)
    bucket_file_ = bre.group(2)

if not 'bucket_' in globals():
    update_bucket_info("gs://vertexit-golden/videos/*")

def on_bucket_change(change):
  if change['type'] == 'change' and change['name'] == 'value':
        update_bucket_info(change['new'])

bucket__ = widgets.Text(
    value= bucket_,
    description='Bucket',
    disabled=False
)
bucket__.observe(on_bucket_change)
display(bucket__)
update_bucket_info(bucket_)

# get the current user account
result = !gcloud auth list --filter="status:ACTIVE" --format="value(account)"
user_ = widgets.Text(
    value= result.nlstr,
    placeholder='User',
    description='User :',
    disabled=True
)
display(user_)

# service account
if not 'saccount_' in globals():
    global saccount_
    saccount_ = 'undefined'

def on_saccount_change(change):
    if change['type'] == 'change' and change['name'] == 'value':
        global saccount_
        saccount_ = change['new']

saccount__ = widgets.Text(
    value= saccount_,
    description='SA',
    disabled=False
)
saccount__.observe(on_saccount_change)
display(saccount__)

def extract_service_account(s):
    print(s.nlstr)
    g = re.search(r'{ *"serviceAccountId" *: *"([^"]+)"', s.nlstr)
    global saccount_
    if g:
      saccount_ = g.group(1)
      saccount__.value = saccount_
    else:
      saccount_ = 'unavailable'

def vars_dict():
  return {"project_": project_,
   "region_": region_,
   "connection_": connection_,
   "model_": model_,
   "saccount_": saccount_,
   "dataset_":dataset_,
   "bucket_":bucket_}

def cell_magic_wrapper(line, query):
    from google.cloud.bigquery.magics.magics import _cell_magic
    q = query.format(**vars_dict())
    print(q)
    return _cell_magic(line, q)

# this is a hack for a variable substituion in queries
ip = get_ipython()
ip.register_magic_function(cell_magic_wrapper, magic_kind="cell", magic_name="bigquery")

class StopExecution(Exception):
    def _render_traceback_(self):
        return []

Text(value='vertexit', description='Project ID:', placeholder='Project ID')

Text(value='us-central1', description='Region :', placeholder='Region')

Text(value='demo_ds', description='Dataset')

Text(value='my-connection', description='Connection')

Text(value='gemini-flash', description='Model name')

Text(value='gs://vertexit-golden/videos/*', description='Bucket')

Text(value='pinky@bungenstock.altostrat.com', description='User :', disabled=True, placeholder='User')

Text(value='undefined', description='SA')

## Check services
Not all required servcies are activated by default, hence we have to activate them. Click the following link and follow the process:
[Activate APIs](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com,cloudaicompanion.googleapis.com,dataplex.googleapis.com,compute.googleapis.com,dataform.googleapis.com,bigqueryconnection.googleapis.com)


The following APIs should now be activated:
* Vertex AI API
* Gemini for Google Cloud API
* Cloud Dataplex API
* Compute Engine API
* Dataform API
* BigQuery Connection API

The next code checks, if the related APIs are really activated.

In [2]:
# @title 🐡 Run the check (optional)

# get the activated services
result = !gcloud services list --enabled --filter="$filter" --format="json(name)"
json_result = 42
try:
  json_result = json.loads(result.nlstr)
except Exception as e:
  print(result.nlstr)
  raise StopExecution()
activated_services = [re.search('([^\/]+$)',service["name"]).group(0) for service in json_result]
activated_services_map = dict.fromkeys(activated_services,True)
service_map = {name: name in activated_services_map for name in required_services}
for name,enabled in service_map.items():
    print(name.ljust(64, ' '), "🟢" if enabled else "🔴  << PLEASE ACTIVATE BEFORE PROCEEDING")

aiplatform.googleapis.com                                        🟢
cloudaicompanion.googleapis.com                                  🟢
dataplex.googleapis.com                                          🟢
compute.googleapis.com                                           🟢
dataform.googleapis.com                                          🟢
bigqueryconnection.googleapis.com                                🟢


# **Step 1**: Create an external connection
We require an external connection to the Gemini API. BigQuery distinguishes between multi-regions (```us``` and ```eu```) and single regions (```us-central1``` and ```europe-west1```).

In [None]:
!bq --project_id="{project_}" --location="{region_}" mk --connection --connection_type=CLOUD_RESOURCE "{connection_}"

In [None]:
# @title 🐞
print(f'!bq --project_id="{project_}" --location="{region_}" mk --connection --connection_type=CLOUD_RESOURCE "{connection_}"')

# **Step 2**: Get connection details (update service account variable)

Get more details about the created connection, e.g. the related service account. Updates the `saccount_` variable.

In [9]:
# execute the command
result = !bq --project_id="{project_}" --location="{region_}" show --connection "{project_}.{region_}.{connection_}"
extract_service_account(result)

Connection vertexit.us-central1.my-connection

                   name                    friendlyName    description    Last modified         type        hasCredential                                            properties                                            
 ---------------------------------------- --------------- ------------- ----------------- ---------------- --------------- ----------------------------------------------------------------------------------------------- 
  623827730347.us-central1.my-connection   My Connection                 03 Apr 14:51:14   CLOUD_RESOURCE   False           {"serviceAccountId": "bqcx-623827730347-daeq@gcp-sa-bigquery-condel.iam.gserviceaccount.com"}  



In [None]:
# @title 🐞
print(f'!bq --project_id="{project_}" --location="{region_}" show --connection "{project_}.{region_}.{connection_}"')

# **Step 3**: Assign role Vertex AI User to service account

The created BigQuery connection uses a service account to access the Vertex AI APIs. This is the reason why we have to assign the **Vertex AI User** (*roles/aiplatform.user*) to it:


In [None]:
!gcloud projects add-iam-policy-binding "{project_}" --role=roles/aiplatform.user --condition="None" --member "serviceAccount:{saccount_}"

In [None]:
# @title 🐞
print(f'!gcloud projects add-iam-policy-binding "{project_}" --role=roles/aiplatform.user --condition="None" --member "serviceAccount:{saccount_}"')

# **Step 4**: Create a dataset to store your models [optional]
The dataset is the level where ai-models are stored. Either create a new dataset or use an existing one.

In [5]:
!bq --project_id="{project_}" --location="{region_}" mk --dataset "{project_}:{dataset_}"

Dataset 'vertexit:demo_ds' successfully created.


In [None]:
# @title 🐞
print(f'!bq --project_id="{project_}" --location="{region_}" mk --dataset "{project_}:{dataset_}"')

# **Step 5**: Create the model in BigQuery
It can take some time to propagate the new permissions for the service account in the system. If you run into an error with error code 400, you should retry it every 30 seconds until it works.

In [7]:
%%bigquery
CREATE OR REPLACE MODEL `{project_}.{dataset_}.{model_}`
REMOTE WITH CONNECTION `{project_}.{region_}.{connection_}`
OPTIONS(endpoint = 'gemini-1.5-flash');


Query is running:   0%|          |

In [None]:
# @title 🐞
print(f"""%%bigquery
CREATE OR REPLACE MODEL `{project_}.{dataset_}.{model_}`
REMOTE WITH CONNECTION `{project_}.{region_}.{connection_}`
OPTIONS(endpoint = 'gemini-1.5-flash');""")

# **Step 6**: Use Gemini Flash with structured data
The next cell will download some artitcles by id from the website [stackoverflow](https://stackoverflow.com/questions). It extracts some information like the questions in combination with the answers and writes the data to a Parquet file with the name `stackoverflow.parquet`.

In [3]:
# @title Fetch content from stackoverflow
import bs4 as bs
import pandas as pd
import requests
from ipywidgets import IntProgress

# these are ids of stackoverflow articles
ids = ["868496", "439573", "5029840",
       "5618878", "12453580", "9461241",
       "80476", "36854940", "53271918", "51907035"]
rows = []
prog = IntProgress(min=0, max=len(ids)) # instantiate the bar
display(prog)
for id in ids:
  res = requests.get(f"https://stackoverflow.com/questions/{id}")
  soup = bs.BeautifulSoup(res.text, 'html.parser')
  row = {
      "id": id,
      "title": soup.find('title').text,
      "url": soup.find("meta", property="og:url")["content"],
      "description": soup.find("meta", property="og:description")["content"],
      "question": " ".join(soup.find(id='question').stripped_strings),
      "answers": " ".join(soup.find(id='answers').stripped_strings)
  }
  rows.append(row)
  prog.value += 1
df = pd.DataFrame(rows)
df.to_parquet("stackoverflow.parquet")
df.head()

IntProgress(value=0, max=10)

Unnamed: 0,id,title,url,description,question,answers
0,868496,How to convert char to integer in C? - Stack O...,https://stackoverflow.com/questions/868496/how...,Possible Duplicates:\n How to convert a singl...,97 This question already has answers here : Cl...,2 Answers 2 Sorted by: Reset to default Highes...
1,439573,c++ - How to convert a single char into an int...,https://stackoverflow.com/questions/439573/how...,"I have a string of digits, e.g. ""123456789"", a...",72 This question already has answers here : Co...,11 Answers 11 Sorted by: Reset to default High...
2,5029840,Convert char to int in C and C++ - Stack Overflow,https://stackoverflow.com/questions/5029840/co...,How do I convert a char to an int in C and C++?,595 How do I convert a char to an int in C and...,14 Answers 14 Sorted by: Reset to default High...
3,5618878,python - How to convert list to string - Stack...,https://stackoverflow.com/questions/5618878/ho...,How can I convert a list to a string using Pyt...,1209 This question already has answers here : ...,3 Answers 3 Sorted by: Reset to default Highes...
4,12453580,python - How to concatenate (join) items in a ...,https://stackoverflow.com/questions/12453580/h...,How do I concatenate a list of strings into a ...,1215 How do I concatenate a list of strings in...,12 Answers 12 Sorted by: Reset to default High...


Load the created file `stackoverflow.parquet` to BigQuery:

In [6]:
!bq load --replace=true --source_format=PARQUET {project_}:{dataset_}.stackoverflow stackoverflow.parquet

Upload complete.
Waiting on bqjob_r3fac938da890b62f_00000191db9d108d_1 ... (1s) Current status: DONE   


Run the following query to generate the answers to the questions with Gemini:

In [8]:
%%bigquery
WITH selected AS (
  SELECT CONCAT('Answer the following question from stackoverflow: ', question) AS prompt
  FROM `{project_}.{dataset_}.stackoverflow` LIMIT 5
)
SELECT ml_generate_text_llm_result
FROM
  ML.GENERATE_TEXT(
    MODEL `{project_}.{dataset_}.{model_}`,
    TABLE selected,
    STRUCT(
      0.2 AS temperature,
      1024 AS max_output_tokens,
      TRUE AS FLATTEN_JSON_OUTPUT)
  );

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,ml_generate_text_llm_result
0,The code you provided is attempting to convert...
1,"You're absolutely right, using `atoi` to conve..."
2,"The question on Stack Overflow is ambiguous, a..."
3,The Stack Overflow question you provided asks ...
4,"```python\nmy_list = ['this', 'is', 'a', 'sent..."


In [9]:
# @title 🐞
print(f"""%%bigquery
WITH selected AS (
  SELECT CONCAT('Return a list of sentences in this article that cite a statistic: ', body) AS prompt
  FROM `bigquery-public-data.bbc_news.fulltext` LIMIT 5
)
SELECT ml_generate_text_llm_result
FROM
  ML.GENERATE_TEXT(
    MODEL `{project_}.{dataset_}.{model_}`,
    TABLE selected,
    STRUCT(
      0.2 AS temperature,
      1024 AS max_output_tokens,
      TRUE AS FLATTEN_JSON_OUTPUT)
  );""")

%%bigquery
WITH selected AS (
  SELECT CONCAT('Return a list of sentences in this article that cite a statistic: ', body) AS prompt
  FROM `bigquery-public-data.bbc_news.fulltext` LIMIT 5
)
SELECT ml_generate_text_llm_result
FROM
  ML.GENERATE_TEXT(
    MODEL `vertexit.demo_ds.gemini-flash`,
    TABLE selected,
    STRUCT(
      0.2 AS temperature,
      1024 AS max_output_tokens,
      TRUE AS FLATTEN_JSON_OUTPUT)
  );


# **Step 7** Use Gemini Flash with unstructured data
This example demonstrates how you can use unstructured data like video, audio, PDFs in BigQuery. First we have to create an object table in BigQuery. This object table contains metadata of objects stored in Cloud Storage.

## Generate an object table

In [None]:
%%bigquery
CREATE OR REPLACE EXTERNAL TABLE `{project_}.{dataset_}.object_table`
WITH CONNECTION `{project_}.{region_}.{connection_}`
OPTIONS(
  object_metadata = 'SIMPLE',
  uris = ['{bucket_}']
);

In [None]:
# @title 🐞
print(f"""%%bigquery
CREATE OR REPLACE EXTERNAL TABLE `{project_}.{dataset_}.object_table`
WITH CONNECTION `{project_}.{region_}.{connection_}`
OPTIONS(
  object_metadata = 'SIMPLE',
  uris = [{bucket_}]
);""")

## Assign role Object Viewer to service account

In [None]:
!gsutil iam ch serviceAccount:{saccount_}:objectViewer gs://{bucket_name_}

In [None]:
# @title 🐞
print(f'!gsutil iam ch serviceAccount:{saccount_}:objectViewer gs://{bucket_name_}')

## Check the content of the object table [optional]

In [None]:
%%bigquery
SELECT * FROM `{project_}.{dataset_}.object_table` LIMIT 5;

In [None]:
%%bigquery
SELECT * FROM EXTERNAL_OBJECT_TRANSFORM(TABLE `{project_}.{dataset_}.object_table`, ['SIGNED_URL']);

In [None]:
# @title 🐞
print(f"""%%bigquery
SELECT * FROM `{project_}.{dataset_}.object_table` LIMIT 5;""")

## Summarize the **videos**

In [None]:
%%bigquery
SELECT ml_generate_text_llm_result, ml_generate_text_status, signed_url  FROM
ML.GENERATE_TEXT(
  MODEL `{project_}.{dataset_}.{model_}`,
  TABLE `{project_}.{dataset_}.object_table`,
  STRUCT(0.2 AS temperature,
  'Erzeuge eine Zusammenfassung des Videos' AS PROMPT,
  TRUE AS FLATTEN_JSON_OUTPUT)) result
JOIN EXTERNAL_OBJECT_TRANSFORM(
  TABLE `{project_}.{dataset_}.object_table`, ['SIGNED_URL']
) transformed ON result.uri = transformed.uri;

In [7]:
# @title 🐞
print(f"""%%bigquery
SELECT * FROM
ML.GENERATE_TEXT(
  MODEL `{project_}.{dataset_}.{model_}`,
  TABLE `{project_}.{dataset_}.object_table`,
  STRUCT(0.2 AS temperature,
  'Erzeuge eine Zusammenfassung des Videos' AS PROMPT,
  TRUE AS FLATTEN_JSON_OUTPUT)) result
JOIN EXTERNAL_OBJECT_TRANSFORM(
  TABLE `{project_}.{dataset_}.object_table`, ['SIGNED_URL']
) transformed ON result.uri = transformed.uri;
""")

%%bigquery
SELECT * FROM
ML.GENERATE_TEXT(
  MODEL `vertexit.demos.gemini-flash`,
  TABLE `vertexit.demos.object_table`,
  STRUCT(0.2 AS temperature,
  'Erzeuge eine Zusammenfassung des Videos' AS PROMPT,
  TRUE AS FLATTEN_JSON_OUTPUT)) result
JOIN EXTERNAL_OBJECT_TRANSFORM(
  TABLE `vertexit.demos.object_table`, ['SIGNED_URL']
) transformed ON result.uri = transformed.uri;



# **Step 8**: Work with embeddings

## Create a model for embeddings

In [17]:
%%bigquery
CREATE OR REPLACE MODEL `{project_}.{dataset_}.embedding_model`
REMOTE WITH CONNECTION `{project_}.{region_}.{connection_}`
OPTIONS(endpoint = 'text-embedding-004');

Query is running:   0%|          |

## Calculate the embeddings

In [63]:
%%bigquery
CREATE OR REPLACE TABLE `{project_}.{dataset_}.embeddings` AS (
SELECT id, title, ml_generate_embedding_result AS embedding
FROM
  ML.GENERATE_EMBEDDING(
    MODEL `{project_}.{dataset_}.embedding_model`,
    (SELECT id, title, question AS content
     FROM `vertexit.demos.stackoverflow` LIMIT 10)
  ))

Query is running:   0%|          |

## Compare articles to measure similarity

In [67]:
%%bigquery
SELECT query.id, query.title, base.id AS `base id`, base.title AS `base title`, distance
FROM
  VECTOR_SEARCH(
    TABLE `vertexit.demos.embeddings`,
    'embedding',
    (SELECT id, title, embedding FROM vertexit.demos.embeddings limit 10),
    'embedding',
    top_k => 3)
WHERE
  distance > 0.0 AND distance < 0.7
ORDER BY distance;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,id,title,base id,base title,distance
0,439573,c++ - How to convert a single char into an int...,5029840,Convert char to int in C and C++ - Stack Overflow,0.563774
1,5029840,Convert char to int in C and C++ - Stack Overflow,439573,c++ - How to convert a single char into an int...,0.563774
2,5029840,Convert char to int in C and C++ - Stack Overflow,868496,How to convert char to integer in C? - Stack O...,0.573804
3,868496,How to convert char to integer in C? - Stack O...,5029840,Convert char to int in C and C++ - Stack Overflow,0.573804
4,439573,c++ - How to convert a single char into an int...,868496,How to convert char to integer in C? - Stack O...,0.632461
5,868496,How to convert char to integer in C? - Stack O...,439573,c++ - How to convert a single char into an int...,0.632461
6,5618878,python - How to convert list to string - Stack...,12453580,python - How to concatenate (join) items in a ...,0.637497
7,12453580,python - How to concatenate (join) items in a ...,5618878,python - How to convert list to string - Stack...,0.637497


## Find an article by text input

In [74]:
%%bigquery
SELECT query.question, base.id, base.title, distance
FROM
  VECTOR_SEARCH(
    TABLE `vertexit.demos.embeddings`,
    'embedding',
    (SELECT content AS question, ml_generate_embedding_result AS embedding
     FROM ML.GENERATE_EMBEDDING(
      MODEL `{project_}.{dataset_}.embedding_model`,
      (SELECT "I have a list of strings and want to concatenate them in python" AS content))
    ),
    'embedding',
    top_k => 3)
ORDER BY distance;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,question,id,title,distance
0,I have a list of strings and want to concatena...,12453580,python - How to concatenate (join) items in a ...,0.564545
1,I have a list of strings and want to concatena...,5618878,python - How to convert list to string - Stack...,0.661938
2,I have a list of strings and want to concatena...,80476,How can I concatenate two arrays in Java? - St...,0.886315


# Vector index

In [25]:
%%bigquery
CREATE OR REPLACE VECTOR INDEX embedding_index
ON `vertexit.demos.embeddings`(embedding)
STORING(id)
OPTIONS (index_type = 'IVF')

CREATE OR REPLACE VECTOR INDEX embedding_index
ON `vertexit.demos.embeddings`(embedding)
STORING(id)
OPTIONS (index_type = 'IVF')

Executing query with job ID: 13d153a7-4b79-44e1-9a71-d896027fe10e
Query executing: 0.33s


ERROR:
 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/vertexit/queries/13d153a7-4b79-44e1-9a71-d896027fe10e?maxResults=0&location=us-central1&prettyPrint=false: Total rows 10 is smaller than min allowed 5000 for CREATE VECTOR INDEX query with the IVF index type. Please use VECTOR_SEARCH table-valued function directly to perform the similarity search.

Location: us-central1
Job ID: 13d153a7-4b79-44e1-9a71-d896027fe10e

