diff --git a/docs/en/guides/54-query/00-sql-analytics.md b/docs/en/guides/54-query/00-sql-analytics.md index 33669d838f..d22f068ac7 100644 --- a/docs/en/guides/54-query/00-sql-analytics.md +++ b/docs/en/guides/54-query/00-sql-analytics.md @@ -2,276 +2,249 @@ title: SQL Analytics --- -> **Scenario:** EverDrive Smart Vision analysts curate a shared set of drive sessions and key frames so every downstream workload can query the same IDs without copying data between systems. +> **Scenario:** CityDrive stages every dash-cam run into shared relational tables so analysts can filter, join, and aggregate the same `video_id` / `frame_id` pairs for all downstream workloads. -This tutorial builds a miniature **EverDrive Smart Vision** dataset and shows how Databend’s single optimizer works across the rest of the guides. Every ID you create here (`SES-20240801-SEA01`, `FRAME-0001` …) reappears in the JSON, vector, geo, and ETL walkthroughs for a consistent autonomous-driving story. +This walkthrough models the relational side of that catalog and highlights practical SQL building blocks. The sample IDs here appear again in the JSON, vector, geo, and ETL guides. -## 1. Create Sample Tables -Two tables capture test sessions and the important frames extracted from dash-camera video. +## 1. Create the Base Tables +`citydrive_videos` stores clip metadata, while `frame_events` records the interesting frames pulled from each clip. ```sql -CREATE OR REPLACE TABLE drive_sessions ( - session_id VARCHAR, - vehicle_id VARCHAR, - route_name VARCHAR, - start_time TIMESTAMP, - end_time TIMESTAMP, - weather VARCHAR, - camera_setup VARCHAR +CREATE OR REPLACE TABLE citydrive_videos ( + video_id STRING, + vehicle_id STRING, + capture_date DATE, + route_name STRING, + weather STRING, + camera_source STRING, + duration_sec INT ); CREATE OR REPLACE TABLE frame_events ( - frame_id VARCHAR, - session_id VARCHAR, - frame_index INT, - captured_at TIMESTAMP, - event_type VARCHAR, - risk_score DOUBLE + frame_id STRING, + video_id STRING, + frame_index INT, + collected_at TIMESTAMP, + event_tag STRING, + risk_score DOUBLE, + speed_kmh DOUBLE ); -INSERT INTO drive_sessions VALUES - ('SES-20240801-SEA01', 'VEH-01', 'Seattle → Bellevue → Seattle', '2024-08-01 09:00', '2024-08-01 10:10', 'Sunny', 'Dual 1080p'), - ('SES-20240802-SEA02', 'VEH-02', 'Downtown Night Loop', '2024-08-02 20:15', '2024-08-02 21:05', 'Light Rain','Night Vision'), - ('SES-20240803-SEA03', 'VEH-03', 'Harbor Industrial Route', '2024-08-03 14:05', '2024-08-03 15:30', 'Overcast', 'Thermal + RGB'); +INSERT INTO citydrive_videos VALUES + ('VID-20250101-001', 'VEH-21', '2025-01-01', 'Downtown Loop', 'Rain', 'roof_cam', 3580), + ('VID-20250101-002', 'VEH-05', '2025-01-01', 'Port Perimeter', 'Overcast', 'front_cam',4020), + ('VID-20250102-001', 'VEH-21', '2025-01-02', 'Airport Connector', 'Clear', 'front_cam',3655), + ('VID-20250103-001', 'VEH-11', '2025-01-03', 'CBD Night Sweep', 'LightFog', 'rear_cam', 3310); INSERT INTO frame_events VALUES - ('FRAME-0001', 'SES-20240801-SEA01', 120, '2024-08-01 09:32:15', 'SuddenBrake', 0.82), - ('FRAME-0002', 'SES-20240801-SEA01', 342, '2024-08-01 09:48:03', 'CrosswalkPedestrian', 0.67), - ('FRAME-0003', 'SES-20240802-SEA02', 88, '2024-08-02 20:29:41', 'NightLowVisibility', 0.59), - ('FRAME-0004', 'SES-20240802-SEA02', 214, '2024-08-02 20:48:12', 'EmergencyVehicle', 0.73), - ('FRAME-0005', 'SES-20240803-SEA03', 305, '2024-08-03 15:02:44', 'CyclistOvertake', 0.64); + ('FRAME-0101', 'VID-20250101-001', 125, '2025-01-01 08:15:21', 'hard_brake', 0.81, 32.4), + ('FRAME-0102', 'VID-20250101-001', 416, '2025-01-01 08:33:54', 'pedestrian', 0.67, 24.8), + ('FRAME-0201', 'VID-20250101-002', 298, '2025-01-01 11:12:02', 'lane_merge', 0.74, 48.1), + ('FRAME-0301', 'VID-20250102-001', 188, '2025-01-02 09:44:18', 'hard_brake', 0.59, 52.6), + ('FRAME-0401', 'VID-20250103-001', 522, '2025-01-03 21:18:07', 'night_lowlight', 0.63, 38.9); ``` -> Need a refresher on table DDL? See [CREATE TABLE](/sql/sql-commands/ddl/table/ddl-create-table). +Docs: [CREATE TABLE](/sql/sql-commands/ddl/table/ddl-create-table), [INSERT](/sql/sql-commands/dml/dml-insert). --- -## 2. Filter Recent Sessions -Keep analytics focused on the most recent drives. +## 2. Filter the Working Set +Keep investigations focused on fresh drives. ```sql -WITH recent_sessions AS ( - SELECT * - FROM drive_sessions - WHERE start_time >= DATEADD('day', -7, CURRENT_TIMESTAMP) +WITH recent_videos AS ( + SELECT * + FROM citydrive_videos + WHERE capture_date >= DATEADD('day', -3, TODAY()) ) -SELECT * -FROM recent_sessions -ORDER BY start_time DESC; +SELECT v.video_id, + v.route_name, + v.weather, + COUNT(f.frame_id) AS flagged_frames +FROM recent_videos v +LEFT JOIN frame_events f USING (video_id) +GROUP BY v.video_id, v.route_name, v.weather +ORDER BY flagged_frames DESC; ``` -Filtering early keeps later joins and aggregations fast. Docs: [WHERE & CASE](/sql/sql-commands/query-syntax/query-select#where-clause). +Docs: [DATEADD](/sql/sql-functions/datetime-functions/date-add), [GROUP BY](/sql/sql-commands/query-syntax/query-select#group-by-clause). --- -## 3. JOIN -### INNER JOIN ... USING -Combine session metadata with frame-level events. - +## 3. JOIN Patterns +### INNER JOIN for frame context ```sql -WITH recent_events AS ( - SELECT * - FROM frame_events - WHERE captured_at >= DATEADD('day', -7, CURRENT_TIMESTAMP) -) -SELECT e.frame_id, - e.captured_at, - e.event_type, - e.risk_score, - s.vehicle_id, - s.route_name, - s.weather -FROM recent_events e -JOIN drive_sessions s USING (session_id) -ORDER BY e.captured_at; +SELECT f.frame_id, + f.event_tag, + f.risk_score, + v.route_name, + v.camera_source +FROM frame_events AS f +JOIN citydrive_videos AS v USING (video_id) +ORDER BY f.collected_at; ``` -### NOT EXISTS (Anti Join) -Find events whose session metadata is missing. - +### Anti join QA ```sql SELECT frame_id -FROM frame_events e +FROM frame_events f WHERE NOT EXISTS ( - SELECT 1 - FROM drive_sessions s - WHERE s.session_id = e.session_id + SELECT 1 + FROM citydrive_videos v + WHERE v.video_id = f.video_id ); ``` -### LATERAL FLATTEN (JSON Unnest) -Combine events with detection objects stored inside JSON payloads. - +### LATERAL FLATTEN for nested detections ```sql -SELECT e.frame_id, - obj.value['type']::STRING AS object_type -FROM frame_events e -JOIN frame_payloads p USING (frame_id), - LATERAL FLATTEN(p.payload['objects']) AS obj; +SELECT f.frame_id, + obj.value['type']::STRING AS detected_type, + obj.value['confidence']::DOUBLE AS confidence +FROM frame_events AS f +JOIN frame_payloads AS p ON f.frame_id = p.frame_id, + LATERAL FLATTEN(input => p.payload['objects']) AS obj +WHERE f.event_tag = 'pedestrian' +ORDER BY confidence DESC; ``` -More patterns: [JOIN reference](/sql/sql-commands/query-syntax/query-join). +Docs: [JOIN](/sql/sql-commands/query-syntax/query-join), [FLATTEN](/sql/sql-functions/table-functions/flatten). --- -## 4. GROUP BY -### GROUP BY route_name, event_type -Standard `GROUP BY` to compare routes and event types. - +## 4. Aggregations for Fleet KPIs +### Behaviour by route ```sql -WITH recent_events AS ( - SELECT * - FROM frame_events - WHERE captured_at >= DATEADD('week', -4, CURRENT_TIMESTAMP) -) -SELECT route_name, - event_type, - COUNT(*) AS event_count, - AVG(risk_score) AS avg_risk -FROM recent_events -JOIN drive_sessions USING (session_id) -GROUP BY route_name, event_type -ORDER BY avg_risk DESC, event_count DESC; +SELECT v.route_name, + f.event_tag, + COUNT(*) AS occurrences, + AVG(f.risk_score) AS avg_risk +FROM frame_events f +JOIN citydrive_videos v USING (video_id) +GROUP BY v.route_name, f.event_tag +ORDER BY avg_risk DESC, occurrences DESC; ``` -### GROUP BY ROLLUP -Adds route subtotals plus a grand total. - +### ROLLUP totals ```sql -SELECT route_name, - event_type, - COUNT(*) AS event_count, - AVG(risk_score) AS avg_risk -FROM frame_events -JOIN drive_sessions USING (session_id) -GROUP BY ROLLUP(route_name, event_type) -ORDER BY route_name NULLS LAST, event_type; +SELECT v.route_name, + f.event_tag, + COUNT(*) AS occurrences +FROM frame_events f +JOIN citydrive_videos v USING (video_id) +GROUP BY ROLLUP(v.route_name, f.event_tag) +ORDER BY v.route_name NULLS LAST, f.event_tag; ``` -### GROUP BY CUBE -Generates all combinations of route and event type. - +### CUBE for route × weather coverage ```sql -SELECT route_name, - event_type, - COUNT(*) AS event_count, - AVG(risk_score) AS avg_risk -FROM frame_events -JOIN drive_sessions USING (session_id) -GROUP BY CUBE(route_name, event_type) -ORDER BY route_name NULLS LAST, event_type; +SELECT v.route_name, + v.weather, + COUNT(DISTINCT v.video_id) AS videos +FROM citydrive_videos v +GROUP BY CUBE(v.route_name, v.weather) +ORDER BY v.route_name NULLS LAST, v.weather NULLS LAST; ``` --- -## 5. WINDOW FUNCTION -### SUM(...) OVER (running total) -Track cumulative risk across each drive with a running `SUM`. - +## 5. Window Functions +### Running risk per video ```sql -WITH session_event_scores AS ( - SELECT session_id, - captured_at, - risk_score - FROM frame_events +WITH ordered_events AS ( + SELECT video_id, collected_at, risk_score + FROM frame_events ) -SELECT session_id, - captured_at, +SELECT video_id, + collected_at, risk_score, SUM(risk_score) OVER ( - PARTITION BY session_id - ORDER BY captured_at + PARTITION BY video_id + ORDER BY collected_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS cumulative_risk -FROM session_event_scores -ORDER BY session_id, captured_at; +FROM ordered_events +ORDER BY video_id, collected_at; ``` -### AVG(...) OVER (moving average) -Show a moving average of risk over the last three events: - +### Rolling average over recent frames ```sql -WITH session_event_scores AS ( - SELECT session_id, - captured_at, - risk_score - FROM frame_events -) -SELECT session_id, - captured_at, +SELECT video_id, + frame_id, + frame_index, risk_score, AVG(risk_score) OVER ( - PARTITION BY session_id - ORDER BY captured_at + PARTITION BY video_id + ORDER BY frame_index ROWS BETWEEN 3 PRECEDING AND CURRENT ROW - ) AS moving_avg_risk -FROM session_event_scores -ORDER BY session_id, captured_at; + ) AS rolling_avg_risk +FROM frame_events +ORDER BY video_id, frame_index; ``` -Window functions let you express rolling totals or averages inline. Full list: [Window functions](/sql/sql-functions/window-functions). +Docs: [Window functions](/sql/sql-functions/window-functions). --- -## 6. Aggregating Index Acceleration -Cache heavy summaries with an [Aggregating Index](/guides/performance/aggregating-index) so dashboards stay snappy. +## 6. Aggregating Index Boost +Persist frequently used summaries for dashboards. ```sql -CREATE OR REPLACE AGGREGATING INDEX idx_route_event_summary ON frame_events +CREATE OR REPLACE AGGREGATING INDEX idx_video_event_summary AS -SELECT session_id, - event_type, +SELECT video_id, + event_tag, COUNT(*) AS event_count, AVG(risk_score) AS avg_risk FROM frame_events -GROUP BY session_id, event_type; +GROUP BY video_id, event_tag; ``` -Now run the same summary query as before—the optimizer will pull results from the index automatically: +When analysts rerun a familiar KPI, the optimizer serves it from the index: ```sql -SELECT s.route_name, - e.event_type, +SELECT v.route_name, + e.event_tag, COUNT(*) AS event_count, AVG(e.risk_score) AS avg_risk FROM frame_events e -JOIN drive_sessions s USING (session_id) -WHERE s.start_time >= DATEADD('week', -8, CURRENT_TIMESTAMP) -GROUP BY s.route_name, e.event_type +JOIN citydrive_videos v USING (video_id) +WHERE v.capture_date >= DATEADD('day', -14, TODAY()) +GROUP BY v.route_name, e.event_tag ORDER BY avg_risk DESC; ``` -`EXPLAIN` the statement to see the `AggregatingIndex` node instead of a full scan. Databend keeps the index fresh as new frames arrive, delivering sub-second dashboards without extra ETL jobs. +Docs: [Aggregating Index](/guides/performance/aggregating-index) and [EXPLAIN](/sql/sql-commands/explain-cmds/explain). --- ## 7. Stored Procedure Automation -You can also wrap the reporting logic in a stored procedure so it runs exactly the way you expect during scheduled jobs. +Wrap the logic so scheduled jobs always produce the same report. ```sql -CREATE OR REPLACE PROCEDURE generate_weekly_route_report(days_back INT) -RETURNS TABLE(route_name VARCHAR, event_count BIGINT, avg_risk DOUBLE) +CREATE OR REPLACE PROCEDURE citydrive_route_report(days_back UINT8) +RETURNS TABLE(route_name STRING, event_tag STRING, event_count BIGINT, avg_risk DOUBLE) LANGUAGE SQL AS $$ BEGIN RETURN TABLE ( - SELECT s.route_name, - COUNT(*) AS event_count, - AVG(e.risk_score) AS avg_risk + SELECT v.route_name, + e.event_tag, + COUNT(*) AS event_count, + AVG(e.risk_score) AS avg_risk FROM frame_events e - JOIN drive_sessions s USING (session_id) - WHERE e.captured_at >= DATEADD('day', -days_back, CURRENT_TIMESTAMP) - GROUP BY s.route_name + JOIN citydrive_videos v USING (video_id) + WHERE v.capture_date >= DATEADD('day', -:days_back, TODAY()) + GROUP BY v.route_name, e.event_tag ); END; $$; -CALL PROCEDURE generate_weekly_route_report(28); +CALL PROCEDURE citydrive_route_report(30); ``` -Use the returned result set directly in notebooks, ETL tasks, or automated alerts. Learn more: [Stored procedure scripting](/sql/stored-procedure-scripting). +Stored procedures can be triggered manually, via [TASKS](/guides/load-data/continuous-data-pipelines/task), or from orchestration tools. --- -You now have a full loop: ingest session data, filter, join, aggregate, accelerate heavy queries, trend over time, and publish. Swap filters or joins to adapt the same recipe to other smart-driving KPIs like driver scoring, sensor degradation, or algorithm comparisons. +With these tables and patterns in place, the rest of the CityDrive guides can reference the exact same `video_id` keys—`frame_metadata_catalog` for JSON search, frame embeddings for similarity, GPS locations for geo queries, and a single ETL path to keep them synchronized. diff --git a/docs/en/guides/54-query/01-json-search.md b/docs/en/guides/54-query/01-json-search.md index 0072c101ed..5f766298f0 100644 --- a/docs/en/guides/54-query/01-json-search.md +++ b/docs/en/guides/54-query/01-json-search.md @@ -2,139 +2,76 @@ title: JSON & Search --- -> **Scenario:** EverDrive Smart Vision’s perception services emit JSON payloads for every observed frame, and safety analysts need to search detections without moving the data out of Databend. +> **Scenario:** CityDrive attaches a metadata JSON payload to every extracted frame and needs Elasticsearch-style filtering on that JSON without copying it out of Databend. -EverDrive’s perception pipeline emits JSON payloads that we query with Elasticsearch-style syntax. By storing payloads as VARIANT and declaring an inverted index during table creation, Databend lets you run Lucene `QUERY` filters directly on the data. +Databend keeps these heterogeneous signals in one warehouse. Inverted indexes power Elasticsearch-style search on VARIANT columns, bitmap tables summarize label coverage, vector indexes answer similarity lookups, and native GEOMETRY columns support spatial filters. -## 1. CREATE SAMPLE TABLE -Each frame carries structured metadata from perception models (bounding boxes, velocities, classifications). +## 1. Create the Metadata Table +Store one JSON payload per frame so every search runs against the same structure. ```sql -CREATE OR REPLACE TABLE frame_payloads ( - frame_id VARCHAR, - run_stage VARCHAR, - payload VARIANT, - logged_at TIMESTAMP, - INVERTED INDEX idx_frame_payloads(payload) -); - -INSERT INTO frame_payloads VALUES - ('FRAME-0001', 'detection', PARSE_JSON('{ - "objects": [ - {"type":"vehicle","bbox":[545,220,630,380],"confidence":0.94}, - {"type":"pedestrian","bbox":[710,200,765,350],"confidence":0.88} - ], - "ego": {"speed_kmh": 32.5, "accel": -2.1} - }'), '2024-08-01 09:32:16'), - ('FRAME-0002', 'detection', PARSE_JSON('{ - "objects": [ - {"type":"pedestrian","bbox":[620,210,670,360],"confidence":0.91} - ], - "scene": {"lighting":"daytime","weather":"sunny"} - }'), '2024-08-01 09:48:04'), - ('FRAME-0003', 'tracking', PARSE_JSON('{ - "objects": [ - {"type":"vehicle","speed_kmh": 18.0,"distance_m": 6.2}, - {"type":"emergency_vehicle","sirens":true} - ], - "scene": {"lighting":"night","visibility":"low"} - }'), '2024-08-02 20:29:42'); -``` - -## 2. SELECT JSON Paths -Peek into the payload to confirm the structure. - -```sql -SELECT frame_id, - payload['objects'][0]['type']::STRING AS first_object, - payload['ego']['speed_kmh']::DOUBLE AS ego_speed, - payload['scene']['lighting']::STRING AS lighting -FROM frame_payloads -ORDER BY logged_at; +CREATE DATABASE IF NOT EXISTS video_unified_demo; +USE video_unified_demo; + +CREATE OR REPLACE TABLE frame_metadata_catalog ( + doc_id STRING, + meta_json VARIANT, + captured_at TIMESTAMP, + INVERTED INDEX idx_meta_json (meta_json) +) CLUSTER BY (captured_at); ``` -Casting with `::STRING` / `::DOUBLE` exposes JSON values to regular SQL filters. Databend also supports Elasticsearch-style search on top of this data via the `QUERY` function—reference variant fields by prefixing them with the column name (for example `payload.objects.type`). More tips: [Semi-structured data](/guides/load-data/load-semistructured/load-ndjson). - ---- - -## 3. Elasticsearch-style Search -`QUERY` uses Elasticsearch/Lucene syntax, so you can combine boolean logic, ranges, boosts, and lists. Below are a few patterns on the EverDrive payloads: +> Need multimodal data (vector embeddings, GPS trails, tag bitmaps)? Grab the schemas from the [Vector](./02-vector-db.md) and [Geo](./03-geo-analytics.md) guides so you can combine them with the search results shown here. +## 2. Search Patterns with `QUERY()` ### Array Match -Find frames that detected a pedestrian: - ```sql -SELECT frame_id -FROM frame_payloads -WHERE QUERY('payload.objects.type:pedestrian') -ORDER BY logged_at DESC -LIMIT 10; +SELECT doc_id, + captured_at, + meta_json['detections'] AS detections +FROM frame_metadata_catalog +WHERE QUERY('meta_json.detections.objects.type:pedestrian') +ORDER BY captured_at DESC +LIMIT 5; ``` ### Boolean AND -Vehicle travelling faster than 30 km/h **and** a pedestrian detected: - ```sql -SELECT frame_id, - payload['ego']['speed_kmh']::DOUBLE AS ego_speed -FROM frame_payloads -WHERE QUERY('payload.objects.type:pedestrian AND payload.ego.speed_kmh:[30 TO *]') -ORDER BY ego_speed DESC; +SELECT doc_id, captured_at +FROM frame_metadata_catalog +WHERE QUERY('meta_json.scene.weather_code:rain + AND meta_json.camera.sensor_view:roof') +ORDER BY captured_at; ``` ### Boolean OR / List -Night drives encountering either an emergency vehicle or a cyclist: - ```sql -SELECT frame_id -FROM frame_payloads -WHERE QUERY('payload.scene.lighting:night AND payload.objects.type:(emergency_vehicle OR cyclist)'); +SELECT doc_id, + meta_json['media_meta']['tagging']['labels'] AS labels +FROM frame_metadata_catalog +WHERE QUERY('meta_json.media_meta.tagging.labels:(hard_brake OR swerve OR lane_merge)') +ORDER BY captured_at DESC +LIMIT 10; ``` ### Numeric Ranges -Speed between 10–25 km/h (inclusive) or strictly between 25–40 km/h: - ```sql -SELECT frame_id, - payload['ego']['speed_kmh'] AS speed -FROM frame_payloads -WHERE QUERY('payload.ego.speed_kmh:[10 TO 25] OR payload.ego.speed_kmh:{25 TO 40}') -ORDER BY speed; +SELECT doc_id, + meta_json['vehicle']['speed_kmh']::DOUBLE AS speed +FROM frame_metadata_catalog +WHERE QUERY('meta_json.vehicle.speed_kmh:{30 TO 80}') +ORDER BY speed DESC +LIMIT 10; ``` ### Boosting -Prioritise frames where both a pedestrian and a vehicle appear, but emphasise the pedestrian term: - ```sql -SELECT frame_id, +SELECT doc_id, SCORE() AS relevance -FROM frame_payloads -WHERE QUERY('payload.objects.type:pedestrian^2 AND payload.objects.type:vehicle') +FROM frame_metadata_catalog +WHERE QUERY('meta_json.scene.weather_code:rain AND (meta_json.media_meta.tagging.labels:hard_brake^2 OR meta_json.media_meta.tagging.labels:swerve)') ORDER BY relevance DESC -LIMIT 10; -``` - -See [Search functions](/sql/sql-functions/search-functions) for complete Elasticsearch syntax supported by `QUERY`, `SCORE()`, and related helpers. - ---- - -## 4. Cross-Reference Frame Events -Join query results back to the frame-level risk scores created in the analytics guide. - -```sql -WITH risky_frames AS ( - SELECT frame_id, - payload['ego']['speed_kmh']::DOUBLE AS ego_speed - FROM frame_payloads - WHERE QUERY('payload.objects.type:pedestrian AND payload.ego.speed_kmh:[30 TO *]') -) -SELECT r.frame_id, - e.event_type, - e.risk_score, - r.ego_speed -FROM risky_frames r -JOIN frame_events e USING (frame_id) -ORDER BY e.risk_score DESC; +LIMIT 8; ``` -Because `frame_id` is shared across tables, you jump from raw payloads to curated analytics instantly. +`QUERY()` follows Elasticsearch semantics (boolean logic, ranges, boosts, lists). `SCORE()` exposes the Elasticsearch relevance so you can re-rank results inside SQL. See [Search functions](/sql/sql-functions/search-functions) for the full operator list. diff --git a/docs/en/guides/54-query/02-vector-db.md b/docs/en/guides/54-query/02-vector-db.md index 8632600fa2..e224a996f4 100644 --- a/docs/en/guides/54-query/02-vector-db.md +++ b/docs/en/guides/54-query/02-vector-db.md @@ -2,94 +2,98 @@ title: Vector Search --- -> **Scenario:** EverDrive Smart Vision attaches compact vision embeddings to risky frames so investigation teams can surface similar situations directly inside Databend. +> **Scenario:** CityDrive keeps per-frame embeddings in Databend so semantic similarity search (“find frames that look like this”) runs alongside traditional SQL analytics—no extra vector service required. -Every extracted frame also has a vision embedding so perception engineers can discover similar scenarios. This guide shows how to insert those vectors and perform semantic search on top of the same EverDrive IDs. +The `frame_embeddings` table shares the same `frame_id` keys as `frame_events`, `frame_payloads`, and `frame_geo_points`, which keeps semantic search and classic SQL glued together. -## 1. CREATE SAMPLE TABLE -We store a compact example using four-dimensional vectors for readability. In production you might keep 512- or 1536-dim embeddings from CLIP or a self-supervised model. +## 1. Prepare the Embedding Table +Production models tend to emit 512–1536 dimensions. The example below uses 512 so you can copy it straight into a demo cluster without changing the DDL. ```sql CREATE OR REPLACE TABLE frame_embeddings ( - frame_id VARCHAR, - session_id VARCHAR, - embedding VECTOR(4), - model_version VARCHAR, - created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, - VECTOR INDEX idx_frame_embeddings(embedding) distance='cosine' + frame_id STRING, + video_id STRING, + sensor_view STRING, + embedding VECTOR(512), + encoder_build STRING, + created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, + VECTOR INDEX idx_frame_embeddings(embedding) distance='cosine' ); INSERT INTO frame_embeddings VALUES - ('FRAME-0001', 'SES-20240801-SEA01', [0.18, 0.42, 0.07, 0.12]::VECTOR(4), 'clip-mini-v1', DEFAULT), - ('FRAME-0002', 'SES-20240801-SEA01', [0.20, 0.38, 0.12, 0.18]::VECTOR(4), 'clip-mini-v1', DEFAULT), - ('FRAME-0003', 'SES-20240802-SEA02', [0.62, 0.55, 0.58, 0.61]::VECTOR(4), 'night-fusion-v2', DEFAULT), - ('FRAME-0004', 'SES-20240802-SEA02', [0.57, 0.49, 0.52, 0.55]::VECTOR(4), 'night-fusion-v2', DEFAULT); + ('FRAME-0101', 'VID-20250101-001', 'roof_cam', RANDOM_VECTOR(512), 'clip-lite-v1', DEFAULT), + ('FRAME-0102', 'VID-20250101-001', 'roof_cam', RANDOM_VECTOR(512), 'clip-lite-v1', DEFAULT), + ('FRAME-0201', 'VID-20250101-002', 'front_cam',RANDOM_VECTOR(512), 'night-fusion-v2', DEFAULT), + ('FRAME-0401', 'VID-20250103-001', 'rear_cam', RANDOM_VECTOR(512), 'night-fusion-v2', DEFAULT); ``` -Docs: [Vector data type](/sql/sql-reference/data-types/vector) and [Vector index](/sql/sql-reference/data-types/vector#vector-indexing). +Docs: [Vector type](/sql/sql-reference/data-types/vector) and [Vector index](/sql/sql-reference/data-types/vector#vector-indexing). --- -## 2. COSINE_DISTANCE Search -Search for the frames most similar to `FRAME-0001`. +## 2. Run Cosine Search +Pull the embedding from one frame and let the HNSW index return the closest neighbours. ```sql WITH query_embedding AS ( - SELECT embedding - FROM frame_embeddings - WHERE frame_id = 'FRAME-0001' - LIMIT 1 + SELECT embedding + FROM frame_embeddings + WHERE frame_id = 'FRAME-0101' ) SELECT e.frame_id, - e.session_id, - cosine_distance(e.embedding, q.embedding) AS distance -FROM frame_embeddings e -CROSS JOIN query_embedding q + e.video_id, + COSINE_DISTANCE(e.embedding, q.embedding) AS distance +FROM frame_embeddings AS e +CROSS JOIN query_embedding AS q ORDER BY distance LIMIT 3; ``` -The cosine distance calculation uses the HNSW index we created earlier, returning the closest frames first. +Lower distance = more similar. The `VECTOR INDEX` keeps latency low even with millions of frames. ---- - -## 3. WHERE Filter + Similarity -Combine similarity search with traditional predicates to narrow the results. +Add traditional predicates (route, video, sensor view) before or after the vector comparison to narrow the candidate set. ```sql WITH query_embedding AS ( - SELECT embedding - FROM frame_embeddings - WHERE frame_id = 'FRAME-0003' - LIMIT 1 + SELECT embedding + FROM frame_embeddings + WHERE frame_id = 'FRAME-0201' ) SELECT e.frame_id, - cosine_distance(e.embedding, q.embedding) AS distance -FROM frame_embeddings e -CROSS JOIN query_embedding q -WHERE e.session_id = 'SES-20240802-SEA02' -ORDER BY distance; + e.sensor_view, + COSINE_DISTANCE(e.embedding, q.embedding) AS distance +FROM frame_embeddings AS e +CROSS JOIN query_embedding AS q +WHERE e.sensor_view = 'rear_cam' +ORDER BY distance +LIMIT 5; ``` +The optimizer still uses the vector index while honoring the `sensor_view` filter. + --- -## 4. JOIN Semantic + Risk Metadata -Join the semantic results back to risk scores or detection payloads for richer investigation. +## 3. Enrich Similar Frames +Materialize the top matches, then enrich them with `frame_events` for downstream analytics. ```sql WITH query_embedding AS ( - SELECT embedding FROM frame_embeddings WHERE frame_id = 'FRAME-0001' LIMIT 1 + SELECT embedding + FROM frame_embeddings + WHERE frame_id = 'FRAME-0102' ), similar_frames AS ( - SELECT frame_id, - cosine_distance(e.embedding, q.embedding) AS distance + SELECT frame_id, + video_id, + COSINE_DISTANCE(e.embedding, q.embedding) AS distance FROM frame_embeddings e CROSS JOIN query_embedding q ORDER BY distance LIMIT 5 ) SELECT sf.frame_id, - fe.event_type, + sf.video_id, + fe.event_tag, fe.risk_score, sf.distance FROM similar_frames sf @@ -97,4 +101,4 @@ LEFT JOIN frame_events fe USING (frame_id) ORDER BY sf.distance; ``` -This hybrid view surfaces “frames that look like FRAME-0001 and also triggered high-risk events”. +Because the embeddings live next to relational tables, you can pivot from “frames that look alike” to “frames that also had `hard_brake` tags, specific weather, or JSON detections” without exporting data to another service. diff --git a/docs/en/guides/54-query/03-geo-analytics.md b/docs/en/guides/54-query/03-geo-analytics.md index 929caf3be5..334149c22c 100644 --- a/docs/en/guides/54-query/03-geo-analytics.md +++ b/docs/en/guides/54-query/03-geo-analytics.md @@ -2,92 +2,97 @@ title: Geo Analytics --- -> **Scenario:** EverDrive Smart Vision logs GPS coordinates for each key frame so operations teams can map risky driving hot spots across the city. +> **Scenario:** CityDrive records precise GPS fixes and traffic-signal distances for each flagged frame so operations teams can answer “where did this happen?” entirely in SQL. -Every frame is tagged with GPS coordinates so we can map risky situations across the city. This guide adds a geospatial table and demonstrates spatial filters, polygons, and H3 bucketing using the same EverDrive session IDs. +`frame_geo_points` and `signal_contact_points` share the same `video_id`/`frame_id` keys as the rest of the guide, so you can move from SQL metrics to maps without copying data. -## 1. CREATE SAMPLE TABLE -Each record represents the ego vehicle at the moment a key frame was captured. Store coordinates as `GEOMETRY` so you can reuse functions like `ST_X`, `ST_Y`, and `HAVERSINE` shown throughout this workload. +## 1. Create Location Tables +If you followed the JSON guide, these tables already exist. The snippet below shows their structure plus a few Shenzhen samples. ```sql -CREATE OR REPLACE TABLE drive_geo ( - frame_id VARCHAR, - session_id VARCHAR, - location GEOMETRY, - speed_kmh DOUBLE, - heading_deg DOUBLE +CREATE OR REPLACE TABLE frame_geo_points ( + video_id STRING, + frame_id STRING, + position_wgs84 GEOMETRY, + solution_grade INT, + source_system STRING, + created_at TIMESTAMP ); -INSERT INTO drive_geo VALUES - ('FRAME-0001', 'SES-20240801-SEA01', TO_GEOMETRY('SRID=4326;POINT(-122.3321 47.6062)'), 28.0, 90), - ('FRAME-0002', 'SES-20240801-SEA01', TO_GEOMETRY('SRID=4326;POINT(-122.3131 47.6105)'), 35.4, 120), - ('FRAME-0003', 'SES-20240802-SEA02', TO_GEOMETRY('SRID=4326;POINT(-122.3419 47.6205)'), 18.5, 45), - ('FRAME-0004', 'SES-20240802-SEA02', TO_GEOMETRY('SRID=4326;POINT(-122.3490 47.6138)'), 22.3, 60), - ('FRAME-0005', 'SES-20240803-SEA03', TO_GEOMETRY('SRID=4326;POINT(-122.3610 47.6010)'), 30.1, 210); +INSERT INTO frame_geo_points VALUES + ('VID-20250101-001','FRAME-0101',TO_GEOMETRY('SRID=4326;POINT(114.0579 22.5431)'),104,'fusion_gnss','2025-01-01 08:15:21'), + ('VID-20250101-001','FRAME-0102',TO_GEOMETRY('SRID=4326;POINT(114.0610 22.5460)'),104,'fusion_gnss','2025-01-01 08:33:54'), + ('VID-20250101-002','FRAME-0201',TO_GEOMETRY('SRID=4326;POINT(114.1040 22.5594)'),104,'fusion_gnss','2025-01-01 11:12:02'), + ('VID-20250102-001','FRAME-0301',TO_GEOMETRY('SRID=4326;POINT(114.0822 22.5368)'),104,'fusion_gnss','2025-01-02 09:44:18'), + ('VID-20250103-001','FRAME-0401',TO_GEOMETRY('SRID=4326;POINT(114.1195 22.5443)'),104,'fusion_gnss','2025-01-03 21:18:07'); + +CREATE OR REPLACE TABLE signal_contact_points ( + node_id STRING, + signal_position GEOMETRY, + video_id STRING, + frame_id STRING, + frame_position GEOMETRY, + distance_m DOUBLE, + created_at TIMESTAMP +); ``` -Docs: [Geospatial data types](/sql/sql-reference/data-types/geospatial). +Docs: [Geospatial types](/sql/sql-reference/data-types/geospatial). --- -## 2. ST_DISTANCE Radius Filter -The `ST_DISTANCE` function measures the distance between geometries. Transform both the frame location and the hotspot into Web Mercator (SRID 3857) so the result is expressed in meters, then filter to 500 m. +## 2. Spatial Filters +Measure how far each frame was from a key downtown coordinate or check whether it falls inside a polygon. Convert to SRID 3857 when you need meter-level distances. ```sql -SELECT g.frame_id, - g.session_id, - e.event_type, - e.risk_score, +SELECT l.frame_id, + l.video_id, + f.event_tag, ST_DISTANCE( - ST_TRANSFORM(g.location, 3857), - ST_TRANSFORM(TO_GEOMETRY('SRID=4326;POINT(-122.3350 47.6080)'), 3857) - ) AS meters_from_hotspot -FROM drive_geo g -JOIN frame_events e USING (frame_id) + ST_TRANSFORM(l.position_wgs84, 3857), + ST_TRANSFORM(TO_GEOMETRY('SRID=4326;POINT(114.0600 22.5450)'), 3857) + ) AS meters_from_hq +FROM frame_geo_points AS l +JOIN frame_events AS f USING (frame_id) WHERE ST_DISTANCE( - ST_TRANSFORM(g.location, 3857), - ST_TRANSFORM(TO_GEOMETRY('SRID=4326;POINT(-122.3350 47.6080)'), 3857) - ) <= 500 -ORDER BY meters_from_hotspot; + ST_TRANSFORM(l.position_wgs84, 3857), + ST_TRANSFORM(TO_GEOMETRY('SRID=4326;POINT(114.0600 22.5450)'), 3857) + ) <= 400 +ORDER BY meters_from_hq; ``` -Need the raw geometry for debugging? Add `ST_ASTEXT(g.location)` to the projection. Prefer direct great-circle math instead? Swap in the `HAVERSINE` function, which operates on `ST_X`/`ST_Y` coordinates. - ---- - -## 3. ST_CONTAINS Polygon Filter -Check whether an event occurred inside a defined safety zone (for example, a school area). +Tip: add `ST_ASTEXT(l.geom)` while debugging or switch to [`HAVERSINE`](/sql/sql-functions/geospatial-functions#trigonometric-distance-functions) for great-circle math. ```sql WITH school_zone AS ( - SELECT TO_GEOMETRY('SRID=4326;POLYGON(( - -122.3415 47.6150, - -122.3300 47.6150, - -122.3300 47.6070, - -122.3415 47.6070, - -122.3415 47.6150 - ))') AS poly + SELECT TO_GEOMETRY('SRID=4326;POLYGON(( + 114.0505 22.5500, + 114.0630 22.5500, + 114.0630 22.5420, + 114.0505 22.5420, + 114.0505 22.5500 + ))') AS poly ) -SELECT g.frame_id, - g.session_id, - e.event_type -FROM drive_geo g -JOIN frame_events e USING (frame_id) +SELECT l.frame_id, + l.video_id, + f.event_tag +FROM frame_geo_points AS l +JOIN frame_events AS f USING (frame_id) CROSS JOIN school_zone -WHERE ST_CONTAINS(poly, g.location); +WHERE ST_CONTAINS(poly, l.position_wgs84); ``` --- -## 4. GEO_TO_H3 Heatmap -Aggregate events by hexagonal cell to build route heatmaps. +## 3. Hex Aggregations +Aggregate risky frames into hexagonal buckets for dashboards. ```sql -SELECT GEO_TO_H3(ST_X(location), ST_Y(location), 8) AS h3_cell, +SELECT GEO_TO_H3(ST_X(position_wgs84), ST_Y(position_wgs84), 8) AS h3_cell, COUNT(*) AS frame_count, - AVG(e.risk_score) AS avg_risk -FROM drive_geo -JOIN frame_events e USING (frame_id) + AVG(f.risk_score) AS avg_risk +FROM frame_geo_points AS l +JOIN frame_events AS f USING (frame_id) GROUP BY h3_cell ORDER BY avg_risk DESC; ``` @@ -96,44 +101,56 @@ Docs: [H3 functions](/sql/sql-functions/geospatial-functions#h3-indexing--conver --- -## 5. ST_DISTANCE + JSON QUERY -Combine spatial distance checks with rich detection metadata (from the JSON guide) to build precise alerts. +## 4. Traffic Context +Join `signal_contact_points` and `frame_geo_points` to validate stored metrics, or blend spatial predicates with JSON search. ```sql -WITH near_intersection AS ( - SELECT frame_id - FROM drive_geo - WHERE ST_DISTANCE( - ST_TRANSFORM(location, 3857), - ST_TRANSFORM(TO_GEOMETRY('SRID=4326;POINT(-122.3410 47.6130)'), 3857) - ) <= 200 +SELECT t.node_id, + t.video_id, + t.frame_id, + ST_DISTANCE(t.signal_position, t.frame_position) AS recomputed_distance, + t.distance_m AS stored_distance, + l.source_system +FROM signal_contact_points AS t +JOIN frame_geo_points AS l USING (frame_id) +WHERE t.distance_m < 0.03 -- roughly < 30 meters depending on SRID +ORDER BY t.distance_m; +``` + +```sql +WITH near_junction AS ( + SELECT frame_id + FROM frame_geo_points + WHERE ST_DISTANCE( + ST_TRANSFORM(position_wgs84, 3857), + ST_TRANSFORM(TO_GEOMETRY('SRID=4326;POINT(114.0700 22.5400)'), 3857) + ) <= 150 ) -SELECT n.frame_id, - p.payload['objects'][0]['type']::STRING AS first_object, - e.event_type, - e.risk_score -FROM near_intersection n -JOIN frame_payloads p USING (frame_id) -JOIN frame_events e USING (frame_id) -WHERE QUERY('payload.objects.type:pedestrian'); +SELECT f.frame_id, + f.event_tag, + meta.meta_json['media_meta']['tagging']['labels'] AS labels +FROM near_junction nj +JOIN frame_events AS f USING (frame_id) +JOIN frame_metadata_catalog AS meta + ON meta.doc_id = nj.frame_id +WHERE QUERY('meta_json.media_meta.tagging.labels:hard_brake'); ``` -Spatial filters, JSON operators, and classic SQL all run in one statement. +This pattern lets you filter by geography first, then apply JSON search to the surviving frames. --- -## 6. CREATE VIEW Heatmap -Export hex-level summaries to visualization tools or map layers. +## 5. Publish a Heatmap View +Expose the geo heatmap to BI or GIS tools without re-running heavy SQL. ```sql -CREATE OR REPLACE VIEW v_route_heatmap AS ( - SELECT GEO_TO_H3(ST_X(location), ST_Y(location), 7) AS h3_cell, - COUNT(*) AS frames, - AVG(e.risk_score) AS avg_risk - FROM drive_geo - JOIN frame_events e USING (frame_id) - GROUP BY h3_cell -); +CREATE OR REPLACE VIEW v_citydrive_geo_heatmap AS +SELECT GEO_TO_H3(ST_X(position_wgs84), ST_Y(position_wgs84), 7) AS h3_cell, + COUNT(*) AS frames, + AVG(f.risk_score) AS avg_risk +FROM frame_geo_points AS l +JOIN frame_events AS f USING (frame_id) +GROUP BY h3_cell; ``` -Downstream systems can query `v_route_heatmap` directly to render risk hot spots on maps without reprocessing raw telemetry. +Databend now serves vector, text, and spatial queries off the exact same `video_id`, so investigation teams never have to reconcile separate pipelines. diff --git a/docs/en/guides/54-query/04-lakehouse-etl.md b/docs/en/guides/54-query/04-lakehouse-etl.md index 6a2b912a17..f8f5892596 100644 --- a/docs/en/guides/54-query/04-lakehouse-etl.md +++ b/docs/en/guides/54-query/04-lakehouse-etl.md @@ -2,185 +2,223 @@ title: Lakehouse ETL --- -> **Scenario:** EverDrive Smart Vision’s data engineering team ships every road-test batch as Parquet files so the unified workloads can load, query, and enrich the same telemetry inside Databend. +> **Scenario:** CityDrive’s data engineering team exports each dash-cam batch as Parquet (videos, frame events, metadata JSON, embeddings, GPS tracks, traffic-signal distances) and wants one COPY pipeline to refresh the shared tables in Databend. -EverDrive’s ingest loop is straightforward: +The loading loop is straightforward: ``` -Object-store export (Parquet for example) → Stage → COPY INTO → (optional) Stream & Task +Object storage → STAGE → COPY INTO tables → (optional) STREAMS/TASKS ``` -Adjust bucket paths/credentials (and swap Parquet for your actual format if different), then paste the commands below. All syntax mirrors the official [Load Data guides](/guides/load-data/). +Adjust the bucket path or format to match your environment, then paste the commands below. Syntax mirrors the [Load Data guides](/guides/load-data/). --- -## 1. Stage -EverDrive’s data engineering team exports four files per batch—sessions, frame events, detection payloads (with nested JSON fields), and frame embeddings—to an S3 bucket. This guide uses Parquet as the example format, but you can plug in CSV, JSON, or other supported formats by adjusting the `FILE_FORMAT` clause. Create a named connection once, then reuse it across stages. +## 1. Create a Stage +Point a reusable stage at the bucket that holds the CityDrive exports. Swap the credentials/URL for your own account; Parquet is used here, but any supported format works with a different `FILE_FORMAT`. ```sql -CREATE OR REPLACE CONNECTION everdrive_s3 +CREATE OR REPLACE CONNECTION citydrive_s3 STORAGE_TYPE = 's3' ACCESS_KEY_ID = '' SECRET_ACCESS_KEY = ''; -CREATE OR REPLACE STAGE drive_stage - URL = 's3://everdrive-lakehouse/raw/' - CONNECTION = (CONNECTION_NAME = 'everdrive_s3') +CREATE OR REPLACE STAGE citydrive_stage + URL = 's3://citydrive-lakehouse/raw/' + CONNECTION = (CONNECTION_NAME = 'citydrive_s3') FILE_FORMAT = (TYPE = 'PARQUET'); ``` -See [Create Stage](/sql/sql-commands/ddl/stage/ddl-create-stage) for additional options. +> [!IMPORTANT] +> Replace the placeholder AWS keys and bucket URL with real values from your environment. Without valid credentials, `LIST`, `SELECT ... FROM @citydrive_stage`, and `COPY INTO` statements will fail with `InvalidAccessKeyId`/403 errors from S3. -List the export folders (Parquet in this walkthrough) to confirm they are visible: +Quick sanity check: ```sql -LIST @drive_stage/sessions/; -LIST @drive_stage/frame-events/; -LIST @drive_stage/payloads/; -LIST @drive_stage/embeddings/; +LIST @citydrive_stage/videos/; +LIST @citydrive_stage/frame-events/; +LIST @citydrive_stage/manifests/; +LIST @citydrive_stage/frame-embeddings/; +LIST @citydrive_stage/frame-locations/; +LIST @citydrive_stage/traffic-lights/; ``` --- -## 2. Preview -Before loading anything, peek inside the Parquet files to validate the schema and sample records. +## 2. Peek at the Files +Use a `SELECT` against the stage to confirm schema and sample rows before loading. ```sql SELECT * -FROM @drive_stage/sessions/session_2024_08_16.parquet +FROM @citydrive_stage/videos/capture_date=2025-01-01/videos.parquet LIMIT 5; SELECT * -FROM @drive_stage/frame-events/frame_events_2024_08_16.parquet +FROM @citydrive_stage/frame-events/batch_2025_01_01.parquet LIMIT 5; ``` -Repeat the preview for payloads and embeddings as needed. Databend automatically uses the file format specified on the stage. +Databend infers the format from the stage definition, so no extra options are required here. --- -## 3. COPY INTO -Load each file into the tables used throughout the guides. Use inline casts to map incoming columns to table columns; the projections below assume Parquet but the same shape applies to other formats. +## 3. COPY INTO the Unified Tables +Each export maps to one of the shared tables used across the guides. Inline casts keep schemas consistent even if upstream ordering changes. -### Sessions +### `citydrive_videos` ```sql -COPY INTO drive_sessions (session_id, vehicle_id, route_name, start_time, end_time, weather, camera_setup) +COPY INTO citydrive_videos (video_id, vehicle_id, capture_date, route_name, weather, camera_source, duration_sec) FROM ( - SELECT session_id::STRING, + SELECT video_id::STRING, vehicle_id::STRING, + capture_date::DATE, route_name::STRING, - start_time::TIMESTAMP, - end_time::TIMESTAMP, weather::STRING, - camera_setup::STRING - FROM @drive_stage/sessions/ + camera_source::STRING, + duration_sec::INT + FROM @citydrive_stage/videos/ ) FILE_FORMAT = (TYPE = 'PARQUET'); ``` -### Frame Events +### `frame_events` ```sql -COPY INTO frame_events (frame_id, session_id, frame_index, captured_at, event_type, risk_score) +COPY INTO frame_events (frame_id, video_id, frame_index, collected_at, event_tag, risk_score, speed_kmh) FROM ( SELECT frame_id::STRING, - session_id::STRING, + video_id::STRING, frame_index::INT, - captured_at::TIMESTAMP, - event_type::STRING, - risk_score::DOUBLE - FROM @drive_stage/frame-events/ + collected_at::TIMESTAMP, + event_tag::STRING, + risk_score::DOUBLE, + speed_kmh::DOUBLE + FROM @citydrive_stage/frame-events/ ) FILE_FORMAT = (TYPE = 'PARQUET'); ``` -### Detection Payloads -The payload files include nested columns (`payload` column is a JSON object). Use the same projection to copy them into the `frame_payloads` table. +### `frame_metadata_catalog` +```sql +COPY INTO frame_metadata_catalog (doc_id, meta_json, captured_at) +FROM ( + SELECT doc_id::STRING, + meta_json::VARIANT, + captured_at::TIMESTAMP + FROM @citydrive_stage/manifests/ +) +FILE_FORMAT = (TYPE = 'PARQUET'); +``` +### `frame_embeddings` ```sql -COPY INTO frame_payloads (frame_id, run_stage, payload, logged_at) +COPY INTO frame_embeddings (frame_id, video_id, sensor_view, embedding, encoder_build, created_at) FROM ( SELECT frame_id::STRING, - run_stage::STRING, - payload, - logged_at::TIMESTAMP - FROM @drive_stage/payloads/ + video_id::STRING, + sensor_view::STRING, + embedding::VECTOR(768), -- replace with your actual dimension + encoder_build::STRING, + created_at::TIMESTAMP + FROM @citydrive_stage/frame-embeddings/ ) FILE_FORMAT = (TYPE = 'PARQUET'); ``` -### Frame Embeddings +### `frame_geo_points` ```sql -COPY INTO frame_embeddings (frame_id, session_id, embedding, model_version, created_at) +COPY INTO frame_geo_points (video_id, frame_id, position_wgs84, solution_grade, source_system, created_at) FROM ( - SELECT frame_id::STRING, - session_id::STRING, - embedding::VECTOR(4), -- Replace 4 with your actual embedding dimension - model_version::STRING, + SELECT video_id::STRING, + frame_id::STRING, + position_wgs84::GEOMETRY, + solution_grade::INT, + source_system::STRING, created_at::TIMESTAMP - FROM @drive_stage/embeddings/ + FROM @citydrive_stage/frame-locations/ ) FILE_FORMAT = (TYPE = 'PARQUET'); ``` -All downstream guides (analytics/search/vector/geo) now see this batch. +### `signal_contact_points` +```sql +COPY INTO signal_contact_points (node_id, signal_position, video_id, frame_id, frame_position, distance_m, created_at) +FROM ( + SELECT node_id::STRING, + signal_position::GEOMETRY, + video_id::STRING, + frame_id::STRING, + frame_position::GEOMETRY, + distance_m::DOUBLE, + created_at::TIMESTAMP + FROM @citydrive_stage/traffic-lights/ +) +FILE_FORMAT = (TYPE = 'PARQUET'); +``` + +After this step, every downstream workload—SQL analytics, Elasticsearch `QUERY()`, vector similarity, geospatial filters—reads the exact same data. --- -## 4. Stream (Optional) -If you want downstream jobs to react to new rows after each `COPY INTO`, create a stream on the key tables (for example `frame_events`). Stream usage follows the [Continuous Pipeline → Streams](/guides/load-data/continuous-data-pipelines/stream) guide. +## 4. Streams for Incremental Reactions (Optional) +Use streams when you want downstream jobs to consume only the rows added since the last batch. ```sql CREATE OR REPLACE STREAM frame_events_stream ON TABLE frame_events; -SELECT * FROM frame_events_stream; -- Shows new rows since the last consumption +SELECT * FROM frame_events_stream; -- shows newly copied rows +-- …process rows… +SELECT * FROM frame_events_stream WITH CONSUME; -- advance the offset ``` -After processing the stream, call `CONSUME STREAM frame_events_stream;` (or insert the rows into another table) to advance the offset. +`WITH CONSUME` ensures the stream cursor moves forward after the rows are handled. Reference: [Streams](/guides/load-data/continuous-data-pipelines/stream). --- -## 5. Task (Optional) -Tasks execute **one SQL statement** on a schedule. Create a small task for each table (or call a stored procedure if you prefer a single entry point). +## 5. Tasks for Scheduled Loads (Optional) +Tasks run **one SQL statement** on a schedule. Create lightweight tasks per table or wrap the logic in a stored procedure if you prefer one entry point. ```sql -CREATE OR REPLACE TASK task_load_sessions +CREATE OR REPLACE TASK task_load_citydrive_videos WAREHOUSE = 'default' - SCHEDULE = 5 MINUTE + SCHEDULE = 10 MINUTE AS - COPY INTO drive_sessions (session_id, vehicle_id, route_name, start_time, end_time, weather, camera_setup) + COPY INTO citydrive_videos (video_id, vehicle_id, capture_date, route_name, weather, camera_source, duration_sec) FROM ( - SELECT session_id::STRING, + SELECT video_id::STRING, vehicle_id::STRING, + capture_date::DATE, route_name::STRING, - start_time::TIMESTAMP, - end_time::TIMESTAMP, weather::STRING, - camera_setup::STRING - FROM @drive_stage/sessions/ + camera_source::STRING, + duration_sec::INT + FROM @citydrive_stage/videos/ ) FILE_FORMAT = (TYPE = 'PARQUET'); -ALTER TASK task_load_sessions RESUME; +ALTER TASK task_load_citydrive_videos RESUME; CREATE OR REPLACE TASK task_load_frame_events WAREHOUSE = 'default' - SCHEDULE = 5 MINUTE -AS - COPY INTO frame_events (frame_id, session_id, frame_index, captured_at, event_type, risk_score) + SCHEDULE = 10 MINUTE + AS + COPY INTO frame_events (frame_id, video_id, frame_index, collected_at, event_tag, risk_score, speed_kmh) FROM ( SELECT frame_id::STRING, - session_id::STRING, + video_id::STRING, frame_index::INT, - captured_at::TIMESTAMP, - event_type::STRING, - risk_score::DOUBLE - FROM @drive_stage/frame-events/ + collected_at::TIMESTAMP, + event_tag::STRING, + risk_score::DOUBLE, + speed_kmh::DOUBLE + FROM @citydrive_stage/frame-events/ ) FILE_FORMAT = (TYPE = 'PARQUET'); ALTER TASK task_load_frame_events RESUME; - --- Repeat for frame_payloads and frame_embeddings ``` -See [Continuous Pipeline → Tasks](/guides/load-data/continuous-data-pipelines/task) for cron syntax, dependencies, and error handling. +Add more tasks for `frame_metadata_catalog`, embeddings, or GPS data using the same pattern. Full options: [Tasks](/guides/load-data/continuous-data-pipelines/task). + +--- + +Once these jobs run, every guide in the Unified Workloads series reads from the same CityDrive tables—no extra ETL layers, no duplicate storage. diff --git a/docs/en/guides/54-query/index.md b/docs/en/guides/54-query/index.md index b9b28844b2..c53511ce66 100644 --- a/docs/en/guides/54-query/index.md +++ b/docs/en/guides/54-query/index.md @@ -2,14 +2,14 @@ title: Unified Workloads --- -Databend now serves as a unified engine for SQL analytics, multimodal search, vector similarity, geospatial analysis, and continuous ETL. This mini-series uses the **EverDrive Smart Vision** scenario (session IDs such as `SES-20240801-SEA01`, frame IDs such as `FRAME-0001`) to show how one dataset flows through every workload without copying data between systems. +CityDrive Intelligence records every dash-cam drive, splits it into frames, and stores multiple signals per `video_id`: relational metadata, JSON manifests, behaviour tags, embeddings, and GPS traces. This guide set shows how Databend keeps all those workloads in one warehouse—no copy jobs, no extra search cluster. | Guide | What it covers | |-------|----------------| -| [SQL Analytics](./00-sql-analytics.md) | Build shared tables, slice sessions, add window/aggregate speedups | -| [JSON & Search](./01-json-search.md) | Store detection payloads and `QUERY` risky scenes | -| [Vector Search](./02-vector-db.md) | Keep frame embeddings and find semantic neighbors | -| [Geo Analytics](./03-geo-analytics.md) | Map incidents with `HAVERSINE`, polygons, H3 | -| [Lakehouse ETL](./04-lakehouse-etl.md) | Stage files, `COPY INTO` tables, optional stream/task | +| [SQL Analytics](./00-sql-analytics.md) | Base tables, filters, joins, windows, aggregating indexes | +| [JSON & Search](./01-json-search.md) | Load `frame_metadata_catalog`, run Elasticsearch `QUERY()`, link bitmap tags | +| [Vector Search](./02-vector-db.md) | Persist embeddings, run cosine search, join risk metrics | +| [Geo Analytics](./03-geo-analytics.md) | Use `GEOMETRY`, distance/polygon filters, traffic-light joins | +| [Lakehouse ETL](./04-lakehouse-etl.md) | Stage once, `COPY INTO` shared tables, add streams/tasks | -Work through them in sequence to see how Databend’s single optimizer powers analytics, search, vector, geo, and loading pipelines on the same fleet data. +Walk through them in order to see how the same identifiers flow from classic SQL to text search, vector, geo, and ETL—everything grounded in a single CityDrive scenario.