DruidSegmentReader should work if timestamp is specified as a dimension#9530
DruidSegmentReader should work if timestamp is specified as a dimension#9530ccaominh merged 4 commits intoapache:masterfrom
Conversation
Tests for compaction and re-indexing a datasource with the timestamp column
| } | ||
| } | ||
|
|
||
| /* |
There was a problem hiding this comment.
Could you change this to single-line comments? We don't usually use multi-line comment.
| * Timestamp is added last because we expect that the time column will always be a date time object. | ||
| * If it is added earlier, it can be overwritten by metrics or dimenstions with the same name. | ||
| * | ||
| * If a user names a metric or dimension `__time` it will be overwritten. This case should be rare since |
There was a problem hiding this comment.
Hmm, I think this overwriting should never happen, but it could happen for some reason in practice, e.g., user mistake. How about logging a warning if there are duplicate column names? Doc could say some kind of warning messages can be printed if there are duplicates.
There was a problem hiding this comment.
I'm worried about log explosion. since this is done per row. I'd have to add explicit checking outside of this next block. Maybe in the constructor? Would that be visible enough in the logs?
There was a problem hiding this comment.
Ah good point. Now I think we need some schema validation for ingestion which could probably be done in DataSchema. But this would be a larger issue than the bug this PR fixes, and I'm ok with adding it later.
| private static final String INDEX_DATASOURCE = "wikipedia_index_test"; | ||
|
|
||
| private static final String INDEX_WITH_TIMESTAMP_TASK = "/indexer/wikipedia_with_timestamp_index_task.json"; | ||
| // TODO: add queries that validate timestamp is different from the __time column since it is a dimension |
There was a problem hiding this comment.
Would you open an issue for this instead of TODO?
…on (apache#9530) * DruidSegmentReader should work if timestamp is specified as a dimension * Add integration tests Tests for compaction and re-indexing a datasource with the timestamp column * Instructions to run integration tests against quickstart * address pr
…on (apache#9530) * DruidSegmentReader should work if timestamp is specified as a dimension * Add integration tests Tests for compaction and re-indexing a datasource with the timestamp column * Instructions to run integration tests against quickstart * address pr
DruidSegmentReader should work if timestamp is specified as a dimension (apache#9530)
DruidInputSource does not support a dimension or metric having the name "timestamp", however this was supported by the ingestSegment firehose which was deprecated in favor of the Druid InputSource. If you try, you will see an exception like
org.apache.druid.indexing.common.task.IndexTask - Encountered exception in BUILD_SEGMENTS. java.lang.ClassCastException: java.lang.String cannot be cast to org.joda.time.DateTimeThis change makes it so that you can re-index and compact datasources when a column is explicitly called timestamp and adds integration tests for them.
This PR has: