Skip to content

DruidSegmentReader should work if timestamp is specified as a dimension#9530

Merged
ccaominh merged 4 commits intoapache:masterfrom
suneet-s:timestamp-dim
Mar 25, 2020
Merged

DruidSegmentReader should work if timestamp is specified as a dimension#9530
ccaominh merged 4 commits intoapache:masterfrom
suneet-s:timestamp-dim

Conversation

@suneet-s
Copy link
Contributor

@suneet-s suneet-s commented Mar 17, 2020

DruidInputSource does not support a dimension or metric having the name "timestamp", however this was supported by the ingestSegment firehose which was deprecated in favor of the Druid InputSource. If you try, you will see an exception like

org.apache.druid.indexing.common.task.IndexTask - Encountered exception in BUILD_SEGMENTS. java.lang.ClassCastException: java.lang.String cannot be cast to org.joda.time.DateTime

This change makes it so that you can re-index and compact datasources when a column is explicitly called timestamp and adds integration tests for them.
This PR has:

  • been self-reviewed.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths.
  • added integration tests.
  • been tested in a test Druid cluster.

Tests for compaction and re-indexing a datasource with the timestamp column
}
}

/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you change this to single-line comments? We don't usually use multi-line comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* Timestamp is added last because we expect that the time column will always be a date time object.
* If it is added earlier, it can be overwritten by metrics or dimenstions with the same name.
*
* If a user names a metric or dimension `__time` it will be overwritten. This case should be rare since
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think this overwriting should never happen, but it could happen for some reason in practice, e.g., user mistake. How about logging a warning if there are duplicate column names? Doc could say some kind of warning messages can be printed if there are duplicates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about log explosion. since this is done per row. I'd have to add explicit checking outside of this next block. Maybe in the constructor? Would that be visible enough in the logs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point. Now I think we need some schema validation for ingestion which could probably be done in DataSchema. But this would be a larger issue than the bug this PR fixes, and I'm ok with adding it later.

private static final String INDEX_DATASOURCE = "wikipedia_index_test";

private static final String INDEX_WITH_TIMESTAMP_TASK = "/indexer/wikipedia_with_timestamp_index_task.json";
// TODO: add queries that validate timestamp is different from the __time column since it is a dimension
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you open an issue for this instead of TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 after CI.

@ccaominh ccaominh merged commit 55c08e0 into apache:master Mar 25, 2020
suneet-s added a commit to suneet-s/druid that referenced this pull request Mar 25, 2020
…on (apache#9530)

* DruidSegmentReader should work if timestamp is specified as a dimension

* Add integration tests

Tests for compaction and re-indexing a datasource with the timestamp column

* Instructions to run integration tests against quickstart

* address pr
suneet-s added a commit to suneet-s/druid that referenced this pull request Mar 25, 2020
…on (apache#9530)

* DruidSegmentReader should work if timestamp is specified as a dimension

* Add integration tests

Tests for compaction and re-indexing a datasource with the timestamp column

* Instructions to run integration tests against quickstart

* address pr
@jihoonson jihoonson added this to the 0.18.0 milestone Mar 25, 2020
@jihoonson jihoonson added the Bug label Mar 25, 2020
jihoonson pushed a commit that referenced this pull request Mar 26, 2020
…on (#9530) (#9566)

* DruidSegmentReader should work if timestamp is specified as a dimension

* Add integration tests

Tests for compaction and re-indexing a datasource with the timestamp column

* Instructions to run integration tests against quickstart

* address pr
@suneet-s suneet-s deleted the timestamp-dim branch March 26, 2020 00:17
jihoonson added a commit to implydata/druid-public that referenced this pull request Mar 26, 2020
DruidSegmentReader should work if timestamp is specified as a dimension (apache#9530)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants