Skip to content

Explicitly fail or alert users when batch generating segments with non-compliant data (unsorted with sortkey, partitioning) #6126

@lgo

Description

@lgo

While using the Spark segment generation, we've had a configuration with configurations such as a sortedColumn. Initially it was assumed that the data will be sorted by the segment generation jobs and we missed that it needed to be sorted upstream.

This is documented on the sort index section (https://docs.pinot.apache.org/basics/indexing/forward-index).

For offline push, input data needs to be sorted before running Pinot segment conversion and push job.

It may be helpful to add this information to batch ingestion pages, such as https://docs.pinot.apache.org/basics/data-import/batch-ingestion.

Additionally, it's unclear if the same will happen if users specify a partition scheme on a table but do not correctly partition the input data. (Searching "partition" on the docs yielded no mentions about this).

This is an easy thing to miss, and while pre-processing jobs especially help (#4353) it would be good to prevent the mistake in the first place with invariants and actionable errors.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions