Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hadoop reindexing task doesn't remove dimensions #5095

Closed
l15k4 opened this issue Nov 16, 2017 · 4 comments
Closed

Hadoop reindexing task doesn't remove dimensions #5095

l15k4 opened this issue Nov 16, 2017 · 4 comments

Comments

@l15k4
Copy link

l15k4 commented Nov 16, 2017

Hey,

on druid 0.10.1 I'm having troubles cleaning up segments from unwanted dimensions using hadoop indexing task. Discussed in here https://groups.google.com/forum/#!topic/druid-user/aBbuMYNRID8

I tried both schemaless approach dimensionExclusions https://pastebin.com/raw/DwXw8dmE and dimensions https://pastebin.com/raw/cxFBYn7q ... segments get reindexed but they contain all those unwanted dimensions.

At first I thought that the dimensions are not picked up from ParseSpec, so I also tried to specify it in IoConfig.InputSpec.IngestionSpec.dimension too, but that didn't help.

@himanshug
Copy link
Contributor

on a quick look , task spec ( https://pastebin.com/raw/cxFBYn7q ) looks fine to me , you don't need to include same list in ingestionSpec.dimensions . however it should have still worked as the list is same.

how did you verify that new segments still had unwanted dimensions that you did not mention in your task spec?
did the reindexing job finish successfully? from task logs can you see that it published new segments and verified that those specific segments had extra dimensions?

@l15k4
Copy link
Author

l15k4 commented Nov 16, 2017

@himanshug The indexing job succeeded, I verified by :

  • historical cat var/druid/segment-cache/gwiq-daily-s/.../meta.smoosh lists all the dimensions
  • also the segments got fresh timestamp as they were replaced
  • some dimensions are doubled d_d & d_D and pivot fails on duplicate dimensions when introspecting druid's metadata, which is one of the reasons I'm doing this.

In the indexing-logs I can see :

2017-11-16T07:04:06,300 INFO [task-runner-0-priority-0] io.druid.indexer.IndexGeneratorJob - Adding segment gwiq-daily-s_2017-04-05T00:00:00.000Z_2017-04-18T00:00:00.000Z_2017-11-16T05:32:04.493Z to the list of published segments

whitelisted dimensions only :

2017-11-16T07:04:06,925 INFO [task-runner-0-priority-0] io.druid.indexing.common.actions.RemoteTaskActionClient - Performing action for task[index_hadoop_gwiq-daily-s_2017-11-16T05:32:04.340Z]: SegmentInsertAction{segments=[DataSegment{size=410865824, shardSpec=NoneShardSpec, metrics=[count, gwid, mid, diid], dimensions=[d_ln, c-geo:c3, d_convert, d_pid, d_g, d_language, d_c, d_channel, d_site, d_qsdjqwhgd, dvc, d_testkey, d_city, cid, d_creative, d_referer, d_s, d_p, d_click, d_a, d_section], version='2017-11-16T05:32:04.493Z', loadSpec={type=s3_zip, bucket=gwiq-views-s, key=gwiq/druid/segments/gwiq-daily-s/2017-04-05T00:00:00.000Z_2017-04-18T00:00:00.000Z/2017-11-16T05:32:04.493Z/0/index.zip, S3Schema=s3n}, interval=2017-04-05T00:00:00.000Z/2017-04-18T00:00:00.000Z, dataSource='gwiq-daily-s', binaryVersion='9'}]}

@himanshug
Copy link
Contributor

In the cat var/druid/segment-cache/gwiq-daily-s/.../meta.smoosh .. what is the value of ... ? just want to ensure that you're looking at the same segment that was published i.e. gwiq-daily-s_2017-04-05T00:00:00.000Z_2017-04-18T00:00:00.000Z_2017-11-16T05:32:04.493Z

@l15k4
Copy link
Author

l15k4 commented Nov 16, 2017

ah, I revisited those segments again and it looks that only last 3 segments from 26 did not get updated
and they were unfortunately those I always checked 🤦

Ok thank you @himanshug now it's gonna be easy to investigate the rest. Look at the timestamp of those last 3 segments :

Nov 16 07:09 2015-01-01T00:00:00.000Z_2016-04-05T00:00:00.000Z
Nov 16 07:09 2016-04-05T00:00:00.000Z_2016-08-03T00:00:00.000Z
Nov 16 07:09 2016-08-03T00:00:00.000Z_2016-10-26T00:00:00.000Z
Nov 16 07:09 2016-10-26T00:00:00.000Z_2016-12-26T00:00:00.000Z
Nov 16 07:09 2016-12-26T00:00:00.000Z_2017-02-06T00:00:00.000Z
Nov 16 07:08 2017-02-06T00:00:00.000Z_2017-02-28T00:00:00.000Z
Nov 16 07:08 2017-02-28T00:00:00.000Z_2017-03-12T00:00:00.000Z
Nov 16 07:08 2017-03-12T00:00:00.000Z_2017-03-24T00:00:00.000Z
Nov 16 07:08 2017-03-24T00:00:00.000Z_2017-04-05T00:00:00.000Z
Nov 16 07:08 2017-04-05T00:00:00.000Z_2017-04-18T00:00:00.000Z
Nov 16 07:08 2017-04-18T00:00:00.000Z_2017-04-30T00:00:00.000Z
Nov 16 07:08 2017-04-30T00:00:00.000Z_2017-05-12T00:00:00.000Z
Nov 16 07:08 2017-05-12T00:00:00.000Z_2017-05-25T00:00:00.000Z
Nov 16 07:08 2017-05-25T00:00:00.000Z_2017-06-07T00:00:00.000Z
Nov 16 07:08 2017-06-07T00:00:00.000Z_2017-06-21T00:00:00.000Z
Nov 16 07:07 2017-06-21T00:00:00.000Z_2017-07-08T00:00:00.000Z
Nov 16 07:07 2017-07-08T00:00:00.000Z_2017-07-22T00:00:00.000Z
Nov 16 07:07 2017-07-22T00:00:00.000Z_2017-08-05T00:00:00.000Z
Nov 16 07:07 2017-08-05T00:00:00.000Z_2017-08-22T00:00:00.000Z
Nov 16 07:07 2017-08-22T00:00:00.000Z_2017-09-04T00:00:00.000Z
Nov 16 07:07 2017-09-04T00:00:00.000Z_2017-09-15T00:00:00.000Z
Nov 16 07:07 2017-09-15T00:00:00.000Z_2017-09-30T00:00:00.000Z
Nov 16 07:07 2017-09-30T00:00:00.000Z_2017-10-22T00:00:00.000Z
Nov 14 10:06 2017-10-22T00:00:00.000Z_2017-11-09T00:00:00.000Z
Nov 14 10:06 2017-11-09T00:00:00.000Z_2017-11-10T00:00:00.000Z
Nov 14 10:06 2017-11-11T00:00:00.000Z_2017-11-12T00:00:00.000Z

@l15k4 l15k4 closed this as completed Nov 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants