While using the Spark segment generation, we've had a configuration with configurations such as a sortedColumn. Initially it was assumed that the data will be sorted by the segment generation jobs and we missed that it needed to be sorted upstream.
This is documented on the sort index section (https://docs.pinot.apache.org/basics/indexing/forward-index).
For offline push, input data needs to be sorted before running Pinot segment conversion and push job.
It may be helpful to add this information to batch ingestion pages, such as https://docs.pinot.apache.org/basics/data-import/batch-ingestion.
Additionally, it's unclear if the same will happen if users specify a partition scheme on a table but do not correctly partition the input data. (Searching "partition" on the docs yielded no mentions about this).
This is an easy thing to miss, and while pre-processing jobs especially help (#4353) it would be good to prevent the mistake in the first place with invariants and actionable errors.