Skip to content

fix: dataset QA follow-ups (nil tags, counts, provider match)#909

Merged
Atharva-Kanherkar merged 1 commit into
mainfrom
fix/dataset-qa-followups
Jun 1, 2026
Merged

fix: dataset QA follow-ups (nil tags, counts, provider match)#909
Atharva-Kanherkar merged 1 commit into
mainfrom
fix/dataset-qa-followups

Conversation

@Atharva-Kanherkar
Copy link
Copy Markdown
Collaborator

Summary

  • Default nil dataset example tags to [] on upsert so dataset example add without --tag no longer returns HTTP 500.
  • Include active_example_count and version_count on GET /datasets/{id} (same aggregation as list).
  • Reject synthetic generation jobs when the provider account provider/key does not match the model alias (returns 400 instead of silent rejections at runtime).

Test plan

  • go test -short -race -count=1 ./internal/api ./internal/repository
  • Merge and redeploy API/worker
  • Re-run production CLI QA: dataset example add without tags, dataset view counts, mismatched dataset generate → 400

Default nil example tags to an empty array so manual/import adds no longer 500,
include active example and version counts on dataset get, and reject generation
jobs when the provider account does not match the selected model alias.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 1, 2026

Greptile Summary

This PR fixes three dataset QA regressions: nil tags on example upsert causing HTTP 500, missing active_example_count/version_count on the single-dataset GET endpoint, and silent acceptance of generation jobs where the provider account's key doesn't match the model alias's catalog provider key.

  • Nil tags fix: datasetExampleTags normalises a nil Tags slice to []string{} before calling InsertDatasetExample/UpdateDatasetExample, preventing a DB driver panic on upsert without --tag.
  • Count aggregation on GET: GetDatasetByID is rewritten to LEFT JOIN dataset_examples and dataset_versions with the same COUNT(DISTINCT …) aggregation already used by the list query; a shared mapDatasetDetailRow mapper eliminates duplication.
  • Provider mismatch validation: Two new guards in StartDatasetGeneration reject mismatched ProviderKey/CatalogProviderKey and pinned ProviderAccountID, but handleDatasetGenerationError was not updated to map these errors to HTTP 400 — they currently fall through to the default 500 handler.

Confidence Score: 3/5

The nil-tags fix and count-aggregation changes are safe to merge; the provider mismatch validation is incomplete because the HTTP handler was not updated.

The provider mismatch validation added in StartDatasetGeneration will return HTTP 500 on mismatch instead of the intended 400, because handleDatasetGenerationError only string-matches a narrow hardcoded list of validation messages and the two new ones are absent. The manager-layer test passes, but the actual API surface will behave incorrectly for callers hitting a mismatch.

backend/internal/api/datasets_generations.go — the handleDatasetGenerationError switch needs cases for the two new provider mismatch error messages (or, better, sentinel errors).

Important Files Changed

Filename Overview
backend/internal/api/datasets_generations.go Adds provider-key and provider-account-ID validation before creating a generation job, but the HTTP error handler is not updated to map the new errors to 400 — they will return 500 in production.
backend/internal/api/datasets_generations_test.go Adds a test for provider mismatch and fixes the happy-path test to include matching provider keys; tests the manager layer only, not the HTTP handler status code.
backend/internal/repository/datasets.go Adds nil→empty-slice normalization for tags to fix the HTTP 500 on tagless upserts, and refactors list/get-by-id mappers to share a common detail-row mapping function.
backend/db/queries/datasets.sql Rewrites GetDatasetByID to LEFT JOIN dataset_examples and dataset_versions and aggregate counts, matching the list-query pattern.
backend/internal/repository/sqlc/datasets.sql.go SQLC-generated file updated to return a new GetDatasetByIDRow struct with ActiveExampleCount and VersionCount fields.
backend/internal/repository/sqlc/querier.go Updates the Querier interface to reflect the new GetDatasetByID return type.

Sequence Diagram

sequenceDiagram
    participant CLI
    participant Handler as startDatasetGenerationHandler
    participant Manager as DatasetManager.StartDatasetGeneration
    participant DB

    CLI->>Handler: "POST /datasets/{id}/generate"
    Handler->>Manager: StartDatasetGeneration(input)
    Manager->>DB: GetProviderAccountByID
    DB-->>Manager: providerAccount
    Manager->>DB: GetModelAliasByID
    DB-->>Manager: modelAlias
    Manager->>Manager: check providerAccount.WorkspaceID
    Manager->>Manager: check EqualFold(providerAccount.ProviderKey, modelAlias.CatalogProviderKey)
    alt provider key mismatch
        Manager-->>Handler: errors.New("provider_account provider does not match...")
        Handler->>Handler: handleDatasetGenerationError (no case matches)
        Handler-->>CLI: HTTP 500 (intended: 400)
    else provider account ID mismatch
        Manager-->>Handler: errors.New("provider_account_id does not match...")
        Handler->>Handler: handleDatasetGenerationError (no case matches)
        Handler-->>CLI: HTTP 500 (intended: 400)
    else all checks pass
        Manager->>DB: CreateDatasetGenerationJob
        DB-->>Manager: job
        Manager-->>Handler: job
        Handler-->>CLI: HTTP 202 Accepted
    end
Loading

Comments Outside Diff (1)

  1. backend/internal/api/datasets_generations.go, line 257-258 (link)

    P1 Provider mismatch errors fall through to HTTP 500

    The two new errors.New(...) returns added in this PR — "provider_account provider does not match model alias provider" and "provider_account_id does not match model alias provider account" — are not enumerated in handleDatasetGenerationError. That switch's string-comparison case covers only "target_count must be between 1 and 100" and "dataset must have at least one active seed example", so both new messages hit the default branch, which calls handleDatasetError, which logs and returns HTTP 500. The PR description says these should return 400, and the test validates the manager layer but not the HTTP status code.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: backend/internal/api/datasets_generations.go
    Line: 257-258
    
    Comment:
    **Provider mismatch errors fall through to HTTP 500**
    
    The two new `errors.New(...)` returns added in this PR — `"provider_account provider does not match model alias provider"` and `"provider_account_id does not match model alias provider account"` — are not enumerated in `handleDatasetGenerationError`. That switch's string-comparison case covers only `"target_count must be between 1 and 100"` and `"dataset must have at least one active seed example"`, so both new messages hit the `default` branch, which calls `handleDatasetError`, which logs and returns HTTP 500. The PR description says these should return 400, and the test validates the manager layer but not the HTTP status code.
    
    How can I resolve this? If you propose a fix, please make it concise.

    Fix in Codex

Fix All in Codex

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
backend/internal/api/datasets_generations.go:257-258
**Provider mismatch errors fall through to HTTP 500**

The two new `errors.New(...)` returns added in this PR — `"provider_account provider does not match model alias provider"` and `"provider_account_id does not match model alias provider account"` — are not enumerated in `handleDatasetGenerationError`. That switch's string-comparison case covers only `"target_count must be between 1 and 100"` and `"dataset must have at least one active seed example"`, so both new messages hit the `default` branch, which calls `handleDatasetError`, which logs and returns HTTP 500. The PR description says these should return 400, and the test validates the manager layer but not the HTTP status code.

Reviews (1): Last reviewed commit: "fix: harden dataset example upserts and ..." | Re-trigger Greptile

@Atharva-Kanherkar Atharva-Kanherkar merged commit 86ec2ab into main Jun 1, 2026
6 checks passed
@Atharva-Kanherkar Atharva-Kanherkar deleted the fix/dataset-qa-followups branch June 1, 2026 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant