# BOT Cache Builder

Pre-computes DICOM Basic Offset Tables (BOT) for every file in the pixels
table and persists them to Lakebase.

**Why run this?**

The DICOMweb server resolves frame byte offsets in a 3-tier cache:
1. In-memory BOT cache (microseconds) — fastest, lost on restart
2. Lakebase persistent cache (milliseconds) — survives restarts
3. On-demand BOT computation (seconds) — only on first access

By running this job after every ingest, tier-3 computation is avoided
entirely: the server always finds the BOT in Lakebase on first access.

**Startup preload**

The `CachePriorityScorer` at the end of this notebook shows how to
inspect which files will be front-loaded into the in-memory cache on
the next server restart, ranked by a weighted score of:
- Recency of last access (`last_used_at`, 1-day half-life)
- Recency of insertion (`inserted_at`, 1-week half-life)
- Access frequency (`access_count`, log-scaled)

In [0]:
%pip install psycopg2-binary pydicom databricks-sdk==0.88 --quiet
dbutils.library.restartPython()

In [0]:
dbutils.widgets.text("table", "main.pixels_solacc.object_catalog",
                     "UC pixels table (catalog.schema.table)")
dbutils.widgets.text("volume", "main.pixels_solacc.pixels_volume",
                     "UC pixels volume (catalog.schema.volume)")
dbutils.widgets.text("lakebase_instance", "pixels-lakebase",
                     "Lakebase instance name")
dbutils.widgets.text("priority_limit", "500",
                     "Top-N files to show in preload preview")

In [0]:
table = dbutils.widgets.get("table")
volume = dbutils.widgets.get("volume")
lakebase_instance = dbutils.widgets.get("lakebase_instance")
priority_limit = int(dbutils.widgets.get("priority_limit"))

volume_path = "/Volumes/" + volume.replace(".","/")

print(f"Target table     : {table}")
print(f"Lakebase instance: {lakebase_instance}")

## Step 1 — Run BOTCacheBuilder

In [0]:
from dbx.pixels.dicom.cache import BOTCacheBuilder

host = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get()
token = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()

df = spark.readStream.table(table).select("local_path")

result_df = BOTCacheBuilder(
        uc_table_name=table,
        lakebase_instance_name=lakebase_instance,
        host=host,
        token=token
    ).transform(df).select("local_path","bot_cache_status")

result_df.writeStream \
    .option("checkpointLocation", f"{volume_path}/checkpoints/bot_lakebase_cache/{table}") \
    .trigger(availableNow=True) \
    .outputMode("append") \
    .toTable(table+"_bot_cache_result") \
    .awaitTermination()

## Step 2 — Summary

In [0]:
from pyspark.sql.functions import col, get_json_object

summary = (
    spark.read.table(table+"_bot_cache_result")
    .withColumn("status", get_json_object(col("bot_cache_status"), "$.status"))
    .groupBy("status")
    .count()
    .orderBy("status")
)

display(summary)

In [0]:
# Error details — inspect any failures.
errors = (
    spark.read.table(table+"_bot_cache_result")
    .withColumn("status", get_json_object(col("bot_cache_status"), "$.status"))
    .filter(col("status") == "error")
    .select(
        col("local_path"),
        get_json_object(col("bot_cache_status"), "$.error").alias("error"),
    )
)

if errors.count() > 0:
    print(f"⚠ {errors.count()} file(s) failed — see details below:")
    display(errors.limit(50))
else:
    print("✓ All files processed without errors.")

## Step 3 — Preload priority preview

Shows which files will be front-loaded into the in-memory BOT cache on
the next server restart, based on the priority scoring algorithm:

```
score = 0.5 × recency(last_used_at,  half-life=24 h)
      + 0.2 × recency(inserted_at,   half-life=168 h)
      + 0.3 × log10(1 + access_count)
```

In [0]:
from dbx.pixels.lakebase import LakebaseUtils

lb = LakebaseUtils(
    instance_name=lakebase_instance,
    uc_table_name=table,
    min_connections=1,
    max_connections=4,
)

priority_list = lb.get_preload_priority_list(
    uc_table_name=table,
    limit=priority_limit,
)

print(f"Top {len(priority_list)} files by preload priority:")

priority_df = spark.createDataFrame(priority_list)
display(priority_df)
import json

print(json.dumps(priority_list, indent=2, default=str))

In [0]:
from dbx.pixels.lakebase import LakebaseUtils
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType, TimestampType, ArrayType

lb = LakebaseUtils(
    instance_name=lakebase_instance,
    uc_table_name=table,
    min_connections=1,
    max_connections=4,
)

priority_list = lb.get_preload_priority_list(
    uc_table_name=table,
    limit=priority_limit,
)

print(f"Top {len(priority_list)} files by preload priority:")

# Define explicit schema for the DataFrame
frame_schema = StructType([
    StructField("frame_number", IntegerType(), True),
    StructField("start_pos", IntegerType(), True),
    StructField("end_pos", IntegerType(), True),
    StructField("pixel_data_pos", IntegerType(), True)
])

priority_schema = StructType([
    StructField("filename", StringType(), True),
    StructField("frame_count", IntegerType(), True),
    StructField("transfer_syntax_uid", StringType(), True),
    StructField("last_used_at", TimestampType(), True),
    StructField("inserted_at", TimestampType(), True),
    StructField("access_count", IntegerType(), True),
    StructField("priority_score", FloatType(), True),
    StructField("frames", ArrayType(frame_schema), True)
])

# Create DataFrame with explicit schema
priority_df = spark.createDataFrame(priority_list, schema=priority_schema)
display(priority_df)

### Understanding the priority score

| Score range | Meaning |
|-------------|---------|
| > 1.0       | Frequently accessed AND recently used |
| 0.3 – 1.0   | Moderately used, recent access |
| 0.1 – 0.3   | Newly ingested, not yet accessed |
| < 0.1       | Old, infrequent, not recently accessed |

Files in the **> 0.3** band should be preloaded first to cover the
majority of viewer requests after a server restart.

## Done

The Lakebase `dicom_frames` table is now populated.  On the next server
restart the DICOMweb gateway will:
1. Call `get_preload_priority_list()` to retrieve the ranked file list.
2. Load BOT entries into the in-memory `BOTCache` from Lakebase (tier 2)
   until the RAM budget is exhausted.
3. Serve all subsequent frame requests from tier 1 or 2, never
   computing BOT on-demand again (tier 3 only for brand-new files).