Skip to content

Getting numGroupsLimitReached for medium cardinality column (10000 distinct values) #6437

@amitchopraait

Description

@amitchopraait

I have a column with 10000 distinct values.

select count(distinct(device)) from metrics
{
	"resultTable": {
		"dataSchema": {
			"columnNames": [
				"distinctcount(device)"
			],
			"columnDataTypes": [
				"INT"
			]
		},
		"rows": [
			[
				10000
			]
		]
	},
	"exceptions": [],
	"numServersQueried": 1,
	"numServersResponded": 1,
	"numSegmentsQueried": 1,
	"numSegmentsProcessed": 1,
	"numSegmentsMatched": 1,
	"numConsumingSegmentsQueried": 0,
	"numDocsScanned": 3999999,
	"numEntriesScannedInFilter": 0,
	"numEntriesScannedPostFilter": 0,
	"numGroupsLimitReached": false,
	"totalDocs": 3999999,
	"timeUsedMs": 21,
	"segmentStatistics": [],
	"traceInfo": {},
	"minConsumingFreshnessTimeMs": 0
}

When i do a group by for this column, i get numGroupsLimitReached = true in the response. Even though the documentation states the default limit is set to 100k

select device, count(device) as aggreg from metrics group by device order by aggreg desc limit 10

{
	"resultTable": {
		"dataSchema": {
			"columnNames": [
				"device",
				"aggreg"
			],
			"columnDataTypes": [
				"STRING",
				"LONG"
			]
		},
		"rows": [
			[
				"device-6230",
				475
			],
			[
				"device-3277",
				470
			],
			[
				"device-2311",
				469
			],
			[
				"device-3933",
				469
			],
			[
				"device-4059",
				468
			],
			[
				"device-6002",
				468
			],
			[
				"device-621",
				466
			],
			[
				"device-2903",
				465
			],
			[
				"device-3900",
				463
			],
			[
				"device-9324",
				463
			]
		]
	},
	"exceptions": [],
	"numServersQueried": 1,
	"numServersResponded": 1,
	"numSegmentsQueried": 1,
	"numSegmentsProcessed": 1,
	"numSegmentsMatched": 1,
	"numConsumingSegmentsQueried": 0,
	"numDocsScanned": 3999999,
	"numEntriesScannedInFilter": 0,
	"numEntriesScannedPostFilter": 3999999,
	"numGroupsLimitReached": true,
	"totalDocs": 3999999,
	"timeUsedMs": 87,
	"segmentStatistics": [],
	"traceInfo": {},
	"minConsumingFreshnessTimeMs": 0
}

As per conversation in slack:

Do you have it configured explicitly? The config key is pinot.server.query.executor.num.groups.limit
10 replies

Amit Chopra  23 minutes ago
@Jackie - I believe you mean in pinot-server.conf? no i haven’t set it.

Jackie  21 minutes ago
Hmm.. That is unexpected

Jackie  20 minutes ago
Do you run the query in PQL mode or SQL mode?

Amit Chopra  20 minutes ago
SQL mode

Jackie  16 minutes ago
I just checked the code and we don't set it in SQL mode..

Jackie  15 minutes ago
Could you please file a github issue and put the details?

Amit Chopra  14 minutes ago
sure. Let me file an issue. Just so that i understand, you mean the 100k limit is not set? But what is the default limit in SQL mode today then?

Jackie  13 minutes ago
We don't put the numGroupsLimitReached in SQL mode. I don't know how it shows up in the response

Amit Chopra  12 minutes ago
got it

Amit Chopra  11 minutes ago
As per the second part of the question, given there was a limit of 10 on the query, shouldn’t this be handled by the engine (even if it was a column with more than 100k distinct values)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions