# Prepare CEO Project Plots for Vector Search in BigQuery

This BQ table prep workflow will take a CEO Plots/Samples table and convert it to BQ table(s) ready for vector search. 
Depending on size of the CEO table and unpredictable EE/BQ latency we would expect the data prep process to take several seconds to several minutes. 

Therefore, this would be configured as its own API end-point (probably a Cloud Run service) so that when CEO project admins choose to enable plot similarity tool, the CEO backend client will kick off the 'BQ table prep' job and can check back later that the BQ table is ready to use.

In [1]:
import os
import ee
print(ee.__version__)
from utils import *

project = "collect-earth-online"
dataset='sim_search_test'

# Set the credentials and project
ee.Initialize(project=project,
                  opt_url="https://earthengine-highvolume.googleapis.com"
              )
ee.data.setWorkloadTag("efm-bq-setup")

1.5.14


### Demonstrate CEO plot -> BQ table workflow for one CEO file

The data engineering (DE) service would need to be provided the CEO plot data somehow, some ideas I have to do that:
* either directly in a GET request (if we could pass simplistic plotid and geo information of plot table in the request params itself) 
    * likely non-optimal for larger data tables
* via publisher/subscriber service message (a GCS upload triggers a publisher message that the DE service subscribes to)
    * never actually used Google Cloud Pub/Sub but would be cool to learn, could be slick
* hybrid of the two: CEO backend uploads the plot db table to GCS, and once done, passes the GCS file object path to the DE service in the request params
    * probably more scalable than option 1 but clunkier solution than option 2

for demonstration we just load a plot .csv locally

In [2]:
file =  "../plots/ceo-Okayama-Tile-5-v2-plot-data-2025-07-22.csv"#"../plots/ceo-Estonia_SBP_carbonmonitoring_upto2021_v2-plot-data-2025-07-22.csv"
try:
    plot_fc = plot_to_fc(file)
except Exception as e:
    raise FileNotFoundError(f"read error: {e}")
    
row_count = plot_fc.size().getInfo()
print(f"plot count: {row_count}")

plot count: 100


#### Here we are making an exported BQ table containing embeddings of a given EFM data year. The user can then do plot similarity search based on a specific year.

#### More advanced implementation that is out of scope for now: two-date change similarity (probably using dot-product? would have to do more research)

In [None]:
# can create individual sim search tables for any of the available years covered by EFM [2017-2024]
years = [2022] 

fc_embeddings = efm_plot_agg(plot_fc,years) # export EFM image data (n=64 bands) to each feature in collection
    
for i,yr_embed in enumerate(fc_embeddings):
    year_tag = str(years[i])
    new_table = f"{os.path.basename(file).split('.')[0]}_{year_tag}"
    table = export_to_bq(yr_embed, # export the featurecollection to BQ table
                         project,
                         dataset,
                         new_table,
                         year_tag,
                         wait=True,
                         dry_run=False)
    
    try:
        pp_table = postprocess_bq(project,dataset,table,wait=True) # fix the schema of the exported table to contain one 'embedding' column containing a 1x64 array
    except Exception as e:
        print(f"Couldn't post-process BQ table. reason: {e}")
    
    # apparently creating a VECTOR index on a BQ table with < 5k rows is not allowed (bc its unnecessary i guess?)
    # so we could have a row count check, and if row_count > 5k, perform the vector_index fn as last part of postprocessing
    if row_count > 5000:
        print("Creating Vector Index for large (n>5k) table")
        vector_index(project,dataset,table+"_pp",embedding_col='embedding',wait=True)
    
    # break

Exporting collect-earth-online.sim_search_test.ceo-Okayama-Tile-5-v2-plot-data-2025-07-22_2022_2022_437
polling for task: CU7S4Y52J324T5RWCQHROAT4
CU7S4Y52J324T5RWCQHROAT4:READY [sleeping 0.25 mins] 
CU7S4Y52J324T5RWCQHROAT4:RUNNING [sleeping 0.25 mins] 
CU7S4Y52J324T5RWCQHROAT4:COMPLETED
collect-earth-online.sim_search_test.ceo-Okayama-Tile-5-v2-plot-data-2025-07-22_2022_2022_437
collect-earth-online.sim_search_test.ceo-Okayama-Tile-5-v2-plot-data-2025-07-22_2022_2022_437_pp
Creating processed table: ceo-Okayama-Tile-5-v2-plot-data-2025-07-22_2022_2022_437_pp
Dropping original source table: ceo-Okayama-Tile-5-v2-plot-data-2025-07-22_2022_2022_437


#### After above completes, we can check that the table exists later (like when the admin publishes their CEO project)

In [5]:
table_exists(project,dataset,table+"_pp")

Table ceo-Okayama-Tile-5-v2-plot-data-2025-07-22_2022_2022_437_pp exists.


True