Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove "granularity" from IngestSegmentFirehose. #4110

Merged
merged 1 commit into from
Mar 24, 2017
Merged

Conversation

gianm
Copy link
Contributor

@gianm gianm commented Mar 24, 2017

It wasn't doing anything useful (the sequences were being concatted, and
cursor.getTime() wasn't being called) and it defaulted to Granularities.NONE.
Changing it to Granularities.ALL gave me a 700x+ performance boost on a
small dataset I was reindexing (2m27s to 365ms). Most of that was from avoiding
making a lot of unnecessary column selectors.

It wasn't doing anything useful (the sequences were being concatted, and
cursor.getTime() wasn't being called) and it defaulted to Granularities.NONE.
Changing it to Granularities.ALL gave me a 700x+ performance boost on a
small dataset I was reindexing (2m27s to 365ms). Most of that was from avoiding
making a lot of unnecessary column selectors.
@gianm gianm added this to the 0.10.1 milestone Mar 24, 2017
@@ -77,7 +76,7 @@ public IngestSegmentFirehose(
Filters.toFilter(dimFilter),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe could be further simplified by not calling concat() a few lines above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How else would this be turned into a Sequence<InputRow> rather than Sequence<Sequence<InputRow>>?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, seems there is no way.

@fjy fjy merged commit b4289c0 into apache:master Mar 24, 2017
gianm added a commit to implydata/druid-public that referenced this pull request May 6, 2017
It wasn't doing anything useful (the sequences were being concatted, and
cursor.getTime() wasn't being called) and it defaulted to Granularities.NONE.
Changing it to Granularities.ALL gave me a 700x+ performance boost on a
small dataset I was reindexing (2m27s to 365ms). Most of that was from avoiding
making a lot of unnecessary column selectors.
gianm added a commit to implydata/druid-public that referenced this pull request May 16, 2017
It wasn't doing anything useful (the sequences were being concatted, and
cursor.getTime() wasn't being called) and it defaulted to Granularities.NONE.
Changing it to Granularities.ALL gave me a 700x+ performance boost on a
small dataset I was reindexing (2m27s to 365ms). Most of that was from avoiding
making a lot of unnecessary column selectors.
@himanshug
Copy link
Contributor

@gianm it appears that this breaks re-indexing which expects IngestSegmentFireHose to give individual rows from the segment without any truncation.
how about making ALL default granularity but still providing the option for caller to change the granularity so that re-indexing path can stay same ?
I wonder if this impacts re-indexing done via local index task too.

@gianm gianm deleted the isff branch October 31, 2017 18:04
@gianm
Copy link
Contributor Author

gianm commented Oct 31, 2017

@himanshug What specifically has broken? IIRC the rows still do have their original, unchanged timestamps -- the only difference is that the cursor timestamps are truncated. But reindexing shouldn't be using the cursor timestamps anyway.

@himanshug
Copy link
Contributor

ok, I assumed CursorFactory.makeCursor(..) could truncate timestamps based on provided granularity. its fine if it returns rows without truncation. thanks.

@gianm
Copy link
Contributor Author

gianm commented Oct 31, 2017

IIRC what happens is the cursor.getTime() is truncated, but the timestamp on the actual rows (row.getTimestampFromEpoch()) is not truncated, since it comes from the time column, not from the cursor. So I think it should be ok. If you notice anything different please raise it…

@@ -171,7 +158,6 @@ public DatasourceIngestionSpec withQueryGranularity(Granularity granularity)
intervals,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this method is effectively just "clone", the method argument is unused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants