### We have EarthGenome's Patch Embeddings accessible as GeoParquet files on Source.Coop. 

### In previous [notebook](/src/earthgenome_embeddings_bq_vectorsearch.ipynb) we load individual .gpq files into a BigQuery table to enable vector search on it.

### Now we want to try aggregating Google's EFM data (pixel-based [10m]) so we can enable the same kind of vector search use cases in the same format (patch embeddings).

### This notebook is a heavy EE Batch Export'er, so use with care. 

In [1]:
import ee
print(ee.__version__)
import geemap
from pprint import pprint 
from google.cloud.bigquery import Client

project = "g4g-eaas"
# Set the credentials and project
ee.Initialize(project=project,
                  opt_url="https://earthengine-highvolume.googleapis.com"
              )


1.5.14


Load 2024 Google EFM image

In [2]:
# load Google EFM embedding image over EG data coverage
efm = ee.ImageCollection("GOOGLE/SATELLITE_EMBEDDING/V1/ANNUAL")
efm = efm.filterDate('2024-01-01', '2024-12-31') # EG embeddings are from 2024 so we match with EFM

We want our Google EFM "patch embeddings" to match the Earthgenome's tiling as closely as possible so that results are at least comparable from the standpoint of 'what geographic area did the model see'

So to do this, we will aggregate EFM imagery to each Earthgenome embedding records' geometries

the unit of analysis here is the Earthgenome embeddings table record

In [None]:
query = """
SELECT 
    tile, 
    COUNT(*) AS row_count
FROM 
    `g4g-eaas.embeddings_sea.earthgenome_cambodia_v1`
GROUP BY 
    tile
ORDER BY 
    row_count DESC
"""
client = Client()
query_job = client.query(query)
records_tile = {}
for row in query_job:
    # print(row)
    records_tile[row['tile']] = row['row_count']
print("record count by tile:")
pprint(records_tile)


record count by tile:
{'47PRP': 234973,
 '47PRQ': 234942,
 '47PRR': 234927,
 '48PTA': 235319,
 '48PTR': 235437,
 '48PTS': 235396,
 '48PTT': 235393,
 '48PTU': 235383,
 '48PTV': 235335,
 '48PUA': 235520,
 '48PUB': 235520,
 '48PUS': 235644,
 '48PUT': 235638,
 '48PUU': 235624,
 '48PUV': 235520,
 '48PVA': 235698,
 '48PVB': 235674,
 '48PVS': 235758,
 '48PVT': 235748,
 '48PVU': 235720,
 '48PVV': 235709,
 '48PWA': 235708,
 '48PWS': 235737,
 '48PWT': 235712,
 '48PWU': 235727,
 '48PWV': 235699,
 '48PXA': 235520,
 '48PXB': 235516,
 '48PXS': 235606,
 '48PXT': 235520,
 '48PXU': 235520,
 '48PXV': 235520,
 '48PYA': 235296,
 '48PYB': 235267,
 '48PYU': 235331,
 '48PYV': 235318}


In [None]:
import numpy as np
total_records = int(np.array(list(records_tile.values())).sum())
print(TOTAL_RECORDS)

8477875


In [None]:
# GEE reducer agg fn
def reduce_nested(img, 
                   fc, 
                   reducer:ee.Reducer,
                   scale:int, 
                   crs:str, 
                   crs_transform:ee.List,
                   best_effort:bool, 
                   maxPixels:int, 
                   tileScale:int
                   ):
    def reduce(f):
        reduced = img.reduceRegion(reducer, 
                               f.geometry(), 
                               scale, 
                               crs, 
                               crs_transform, 
                               best_effort, 
                               maxPixels, 
                               tileScale
                               )	
        return f.set(reduced)
    all_reduced = fc.map(reduce)
    return all_reduced


We will export "Google EFM Patch Agg" records in chunks, to a new BigQuery table

Notice that we're chunking it upon BQ read-in, limiting operations on the ee.FeatureCollection itself..

In [None]:
import math
cambodia = (ee.FeatureCollection("FAO/GAUL/2015/level0").filterMetadata('ADM0_NAME',"equals",'Cambodia')
            .geometry().buffer(30e3))

dryrun=True # change if you're ready

PROJECT = "g4g-eaas"
DATASET = "embeddings_sea"
IN_TABLE = "embeddings_sea.earthgenome_cambodia_v1"
OUT_TABLE = "google_efm_cambodia_v2_method_mapRR_tile_chunks"

CHUNK = 50000
chunks = math.ceil(total_records / chunk)
print(f"Total records: {TOTAL_RECORDS}, chunks: {chunks}")
for i in range(chunks):
        
        offset = int(i * chunk)
        
        query = f"""
        SELECT 
        eg.id as id, 
        eg.tile as tile,
        eg.geometry as geometry,
        FROM 
        `{PROJECT}.{DATASET}.{IN_TABLE}` as eg
        ORDER BY 
        tile
        DESC
        LIMIT {chunk}
        OFFSET {offset};
        """

        fc = (ee.FeatureCollection.runBigQuery(query,'geometry')
        .filterBounds(cambodia)
        .map(lambda f: f.setGeometry(f.geometry().buffer(160).bounds())) # EG embedding inputs were 320 X 320 m images (afaict)
        )

        # average EFM to eg patches
        efm_patch_embed_mapRR = reduce_nested(efm.mosaic(),
                           fc,
                           reducer=ee.Reducer.mean(),
                           scale=10,
                           crs='EPSG:4326',
                           crs_transform=None,
                           best_effort=True,
                           maxPixels=1e13,
                           tileScale=16
                          )
        taskBQ = ee.batch.Export.table.toBigQuery(
                collection=efm_patch_embed_mapRR,
                description=f"google_efm_patch_agg_bq_{offset}_{int(offset + chunk)}",
                table='{PROJECT}.{DATASET}.{OUT_TABLE}',
                append=True,
        )
        
        if dryrun:
                print(f"Would Export chunk {offset}_{int(offset + chunk)} ({i+1}/{chunks})")
        else:
                print(f"Exporting chunk {offset}_{int(offset + chunk)} ({i+1}/{chunks})")


        

Total records: 8477875, chunks: 170
Exporting chunk 8000000_8050000 (161/170)
Exporting chunk 8050000_8100000 (162/170)
Exporting chunk 8100000_8150000 (163/170)
Exporting chunk 8150000_8200000 (164/170)
Exporting chunk 8200000_8250000 (165/170)
Exporting chunk 8250000_8300000 (166/170)
Exporting chunk 8300000_8350000 (167/170)
Exporting chunk 8350000_8400000 (168/170)
Exporting chunk 8400000_8450000 (169/170)
Exporting chunk 8450000_8500000 (170/170)
