Skip to content

BigQuery driver: CubeStore fails importing pre-aggregations with many export files (concurrent signed URL downloads) #10703

@AnveshJarabani

Description

@AnveshJarabani

Bug Description

When using BigQuery with a GCS export bucket (CUBEJS_DB_EXPORT_BUCKET), the BigQueryDriver.unload() method generates signed URLs for every exported CSV file and passes them all to CubeStore via CREATE TABLE ... LOCATION. CubeStore then attempts to download all files concurrently using FuturesUnordered in estimate_location_row_count, which overwhelms the HTTP connection pool and fails.

For large tables (e.g. 48M rows), BigQuery's EXPORT DATA produces ~640 CSV gzip files. CubeStore fires 640 concurrent HTTPS requests to GCS, and the reqwest client fails with connection errors after ~20-100 concurrent connections.

Error

Internal: error sending request for url (https://storage.googleapis.com/bucket/file.csv.gz?GoogleAccessId=...&Expires=...&Signature=...)

Originating from:

cubestore::CubeError as core::convert::From<reqwest::error::Error>::from

Environment

  • Cube.js: v1.6.36
  • CubeStore: v1.6
  • Database: BigQuery
  • Export bucket: GCS
  • Docker Compose deployment (single-node CubeStore)
  • Table size: 48M rows → 640 CSV gzip files (~6MB each, ~4GB total)

Steps to Reproduce

  1. Configure BigQuery with GCS export bucket
  2. Define an originalSql pre-aggregation with external: true on a large table (>10M rows)
  3. Trigger pre-aggregation build
  4. CubeStore fails during estimate_location_row_count — 640 concurrent signed URL HEAD requests overwhelm the connection pool

Root Cause Analysis

BigQuery driver (BigQueryDriver.js, unload() method):

  • Exports table to GCS via createExtractJob — produces N files (BigQuery controls the shard count, ~1GB uncompressed per file)
  • Lists all files and generates signed URLs for each
  • Returns all signed URLs in csvFile array

CubeStore (import/mod.rs, estimate_location_row_count):

  • Receives all URLs in CREATE TABLE ... LOCATION url1, url2, ..., urlN
  • Fires all HEAD requests concurrently via FuturesUnordered
  • reqwest connection pool fails between 20-100 concurrent connections in Docker networking

Individual connections work fine (verified with curl and openssl from inside the container). The issue is purely concurrent connection exhaustion.

Suggested Fixes

Option A: CubeStore — limit concurrency in estimate_location_row_count

Use a semaphore or buffer_unordered(N) instead of unbounded FuturesUnordered when issuing HEAD requests. A concurrency limit of 20-50 would avoid connection exhaustion while still being fast.

Option B: BigQuery driver — use temp:// upload path

Instead of returning signed URLs, download the GCS files in the Cube API container and upload them to CubeStore's /upload-temp-file HTTP endpoint with controlled concurrency (e.g. 10 at a time), then return temp://filename URIs. This bypasses the signed URL path entirely.

Current Workaround

We override unload() via driverFactory in cube.js to download GCS files locally and upload them to CubeStore's temp file API with a concurrency of 10:

driverFactory: ({ dataSource } = {}) => {
  const { BigQueryDriver } = require('@cubejs-backend/bigquery-driver');
  const driver = new BigQueryDriver({});
  
  const origUnload = driver.unload.bind(driver);
  driver.unload = async function(table) {
    // ... export to GCS same as original ...
    // Download each file from GCS using SA credentials (no signed URL)
    // Upload to CubeStore via POST /upload-temp-file?name=<name>
    // Return { csvFile: tempNames.map(n => `temp://${n}`) }
  };
  
  return driver;
},

This works but shouldn't be necessary — the driver or CubeStore should handle large file counts gracefully.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions