Skip to content

In "Kafka-Indexing- Service "real-time ingestion task string dimension field aggregate statistics problem #10168

@DockerLive

Description

@DockerLive

Affected Version

Druid version : 0.18.0

Description

  • Use the Kafka-Indexing service to ingest real-time data
  • The data Schema has a dimension field of type string;The real value, of course, is a number, and is typed as a string to make it easier to use the Bitmap index
  • I had a problem converting this string type dimension field to a number type in an SQL query for aggregate statistics.The statistics occasionally show a value of 0
  • There is no problem with corresponding statistics when the data is solidified
  • When I tried to make this field into a numerical metric, I had no problem with statistics
  • This issue did not occur in the previous version of the upgrade (0.15.0)

Kafka-Indexing json:

"dataSchema": {
"dataSource": "cell_monitor_v3_test",
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "HOUR",
"queryGranularity": "MINUTE",
"rollup": true
},
"metricsSpec": [
{
"type": "longSum",
"name": "cell_count",
"fieldName": "cell_count",
"expression": null
},
{
"type": "longSum",
"name": "sum_status_continue_time",
"fieldName": "sum_status_continue_time",
"expression": null
}
],
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "snapshot_time",
"format": "millis"
},
"dimensionsSpec": {
"dimensions": [
"space_status",
"reserve_status",
"dis_code",
"business_type",
"status_continue_time",
"has_exception"
]
}
}
},
"transformSpec": {
"transforms": [
{ "type": "expression", "name": "sum_status_continue_time", "expression": "nvl(status_continue_time,0)" }
]
}
}

As a string dimension, the query SQL is shown below, with a data statistic of 0 in the ingestion Task interval :

SELECT
__time,
SUM(cell_count),
round((SUM(CAST(status_continue_time AS NUMERIC))/2)/sum(cell_count),2),
SUM(CAST(status_continue_time AS NUMERIC))/2
FROM "cell_monitor_v3_test"
WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '10' MINUTE
AND space_status = 2
AND business_type > 0
AND has_exception = 0
GROUP BY __time

As a numerical indicator, the query SQL is shown below, and the data statistics are correct in the ingestion Task interval :
SELECT
__time,
SUM(cell_count),
round((SUM(sum_status_continue_time)/2)/sum(cell_count),2),
SUM(sum_status_continue_time)/2
FROM "cell_monitor_v3_test"
WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '10' MINUTE
AND space_status = 2
AND business_type > 0
AND has_exception = 0
GROUP BY __time

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions