Feature for hadoop batch re-ingesion and delta ingestion #1374

himanshug · 2015-05-18T19:18:19Z

This PR implements following features.

Hadoop Batch Re-Ingestion/Re-Indexing
Hadoop Batch "Delta" Ingestion

We implement a "DatasourceInputFormat" and "DatasourcePathSpec" that can be used to batch re-ingest data back to druid from existing Datasource (something like IngestSegmentFirehose) segments.

Also, provided changes to allow specifying multiple PathSpecs (by using "MultiplePathSpec" ) in the batch ingestion so that you can combine both DatasourcePathSpec and StaticPathSpec (or any other) in the same ingestion to add late arriving data to existing ingested interval (aka "Delta" Ingestion)

original proposal discussion: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/druid-development/EXYAPcV6pk4/Nu1uRGKEctAJ

drcrallen · 2015-05-19T16:29:43Z

indexing-hadoop/src/main/java/io/druid/indexer/hadoop/DatasourceInputFormat.java

+  public static final String CONF_DRUID_SCHEMA = "druid.datasource.schema";
+
+  @Override
+  public List<InputSplit> getSplits(JobContext context) throws IOException, InterruptedException


How would you feel about having a DataSegmentInputSplit as per https://github.com/druid-io/druid/pull/1351/files#diff-e3e429f5bc3e9ee67f25b69b7e6c3969R981

?

Alternatively we could have a "HadoopDataSegment" which has a hadoopy loadSpec which is only a URI parsable by hadoop (in addition to the other things a DataSegment usually has. That could allow the peon/overlord to setup stuff correctly for hadoop.

We have the data to get all the splits done completely, it is just a matter of ensuring the data gets propagated to Hadoop correctly.

I think your DataSegmentSplit and my DatasourceInputSplit are pretty much the same (and reusable) except I'm just keeping "path" in it while you are keeping whole DataSegment inside it. Do you really need all of DataSegment, can we just keep "loadSpec" in the input split, that information should be enough to load the segment.

cheddar · 2015-05-22T17:52:20Z

indexing-hadoop/src/main/java/io/druid/indexer/HadoopDruidIndexerMapper.java

@@ -104,6 +94,8 @@ public final static InputRow parseInputRow(Writable value, InputRowParser parser
    if(parser instanceof StringInputRowParser && value instanceof Text) {
      //Note: This is to ensure backward compatibility with 0.7.0 and before
      return ((StringInputRowParser)parser).parse(value.toString());
+    } else if(value instanceof HadoopWritableInputRow) {


It looks like we are adjusting things to make the Mapper understand more different types of objects, I think it might be cleaner to define the Mapper interface in terms of InputRow objects and move all of the parser and stuff to the InputFormat.

This would have the downside of making it a bit more difficult to take other InputFormats, but would have the plus of getting rid of the parser when the parser is not needed (it becomes a part of the InputFormat instead of the mapper).

If Mapper takes InputRow, then the row might contain a complex metric object such as HyperLogLogCollector. Then mapper needs to know how to serialize it.
On the other hand, If we introduce
HadoopWritableInputRow and enforce mapper to only take those (not done now to keep this feature PR simple and decided to put the if/else) then complete ownership of reading/parsing data goes to the InputFormat(which should understand the input data best) and mapper is assured to not get into serde issues.

cheddar · 2015-05-22T18:00:42Z

The high-level direction seems good. The biggest question in my mind is do we keep the Writable InputRow or do we adjust the jobs to pull out the key/value and all that now?

fjy · 2015-07-17T14:37:08Z

@himanshug @cheddar hey guys just wondering what the status of this is

himanshug · 2015-07-17T15:06:19Z

@fjy this is blocked on #1472 getting merged. as soon as that one is merged, I will update this PR and it will be ready to further review [and merge].

fjy · 2015-07-20T23:44:04Z

@himanshug #1472 is merged

himanshug · 2015-07-21T02:25:46Z

@fjy thanks, I will update this one soon, just doing some testing of the code to make sure that it works.

gianm · 2015-08-13T22:19:11Z

👍 after latest changes

drcrallen · 2015-08-14T17:49:11Z

indexing-hadoop/src/main/java/io/druid/indexer/hadoop/DatasourceRecordReader.java

+  private int rowNum;
+  private MapBasedRow currRow;
+
+  private List<QueryableIndex> indexes = Lists.newArrayList();


Very minor nitpick: but this could be a Closer instead of a list of queryableIndex.

I'm fine with it as is, but for future reference a closer would make it more obvious what's happening here.

hmmm, for now, let us keep it as is.

drcrallen · 2015-08-14T18:29:08Z

@himanshug I do have a concern about concurrency here, if I use delta ingestion but do not have a lock on the interval, then whatever I ingest may or may not be actually representative of the latest data. A simple example is if I run delta ingestion twice at the same time for different input paths to add. (or start a second one before the first has finished).

The indexing service hadoop indexer should be handling locks correctly, and I think such a race condition should not be able to proceed when the indexing service is used.

But the standalone hadoop indexer could very easily encounter a race. But since we in general do not support accounting for race conditions in the stand alone stuff, that is just how its going to be.

Does that sound correct?

gianm · 2015-08-14T18:33:12Z

@drcrallen that sounds right to me. The stance on standalone stuff (hadoop & realtime) has always been that you need to think about consistency and concurrency yourself, since Druid only does meaningful locking when you use the indexing service.

himanshug · 2015-08-14T18:36:29Z

@drcrallen wrt concurrency, that is true. But, this PR is not introducing/changing that behavior. The issue you described with standalone hadoop indexer is there even today and not related to this PR really.
FWIW, I am not a fan of standalone hadoop indexer and would rather see it go away. It is generally painful to maintain multiple ways of doing same thing. Also, it confuses users and we see email threads about explaining the differences between 2 kind of hadoop ingestion.

drcrallen · 2015-08-14T18:37:59Z

@himanshug / @gianm Cool, I'm good with maintaining existing behavior, and just wanted to make sure our expectations are documented in case anyone ever comes back to this PR's comment thread.

…ecomes reusable Conflicts: indexing-service/src/main/java/io/druid/indexing/firehose/IngestSegmentFirehoseFactory.java

Conflicts: indexing-hadoop/src/main/java/io/druid/indexer/path/PathSpec.java indexing-service/pom.xml

… we can grab data from multiple places in same ingestion Conflicts: indexing-hadoop/src/main/java/io/druid/indexer/HadoopDruidIndexerConfig.java indexing-hadoop/src/main/java/io/druid/indexer/JobHelper.java Conflicts: indexing-hadoop/src/main/java/io/druid/indexer/path/PathSpec.java

…xer uses overlord action to get list of segments and passes when running as an overlord task. and, uses metadata store directly when running as standalone hadoop indexer also, serialized list of segments is passed to DatasourcePathSpec so that hadoop classloader issues do not creep up

…indexing

…estion

himanshug · 2015-08-16T19:26:12Z

@gianm @drcrallen @nishantmonu51
I have made the change now so that DatasourcePathSpec gets List of segments instead of UsedSegmentLister instance. HadoopIndexTask and CliInternalHadoopIndexer use overlord action and metadata store respectively to get the list of segments and pass down.
Since, we are not playing with setting things at class level anymore , so classloader issues dont creep up.
I did test it both ways (with and without hadoop libs on overlord classpath) and standalone hadoop indexer.
If you would like to just see the related changes made, please see himanshug@22228c2
I have rebased things, so this is ready to merge (unless someone notices some other problem) :)

drcrallen · 2015-08-17T15:37:33Z

indexing-hadoop/src/main/java/io/druid/indexer/path/DatasourcePathSpec.java

+  )
+  {
+    this.mapper = Preconditions.checkNotNull(mapper, "null mapper");
+    this.segments = segments;


Convention in some other places is to check for null and set as ImmutalbeList.of() if the argument is null. Would that be appropriate to use here? it seems most of the logic will throw errors if this is not set to a non-empty list.

no, it is valid for segments to not be specified, the check is done as the first thing inside addInputPaths(..). it is expected to be set by then.

drcrallen · 2015-08-17T17:02:26Z

I'm 👍 if we can get one other committer to agree with the comment in #1374 (comment)

gianm · 2015-08-17T22:40:08Z

👍 at a3bab5b

Feature for hadoop batch re-ingesion and delta ingestion

drcrallen reviewed May 19, 2015
View reviewed changes

xvrl added the Discuss label May 22, 2015

cheddar reviewed May 22, 2015
View reviewed changes

himanshug force-pushed the batch_delta_ingestion3 branch 3 times, most recently from 914b965 to bbe682f Compare June 29, 2015 20:11

gianm mentioned this pull request Jul 14, 2015

User-friendly Hadoop-based re-indexing/compaction #1517

Closed

himanshug force-pushed the batch_delta_ingestion3 branch from bbe682f to 12222cf Compare July 22, 2015 02:50

himanshug self-assigned this Jul 22, 2015

himanshug added Feature and removed Discuss labels Jul 22, 2015

himanshug changed the title ~~[Initial Review] Feature for batch re-ingesion and delta ingestion~~ Feature for batch re-ingesion and delta ingestion Jul 22, 2015

drcrallen reviewed Aug 14, 2015
View reviewed changes

himanshug added 7 commits August 14, 2015 14:44

refactor IngestSegmentFirehoseFactory so that IngestSegmentFirehose b…

4d4aa8b

…ecomes reusable Conflicts: indexing-service/src/main/java/io/druid/indexing/firehose/IngestSegmentFirehoseFactory.java

do not run parser if value from InputFormat is already an InputRow

f1d309a

Druid Hadoop InputFormat and pathSpec

1ae56f1

Conflicts: indexing-hadoop/src/main/java/io/druid/indexer/path/PathSpec.java indexing-service/pom.xml

IndexGeneratorJobTest type unit test for batch delta ingestion and re…

a3bab5b

…indexing

updating the docs on how to do hadoop batch re-ingesion and delta ing…

cfd81bf

…estion

himanshug force-pushed the batch_delta_ingestion3 branch from f3e2c61 to cfd81bf Compare August 16, 2015 19:08

drcrallen reviewed Aug 17, 2015
View reviewed changes

drcrallen added a commit that referenced this pull request Aug 17, 2015

Merge pull request #1374 from himanshug/batch_delta_ingestion3

b9792b5

Feature for hadoop batch re-ingesion and delta ingestion

drcrallen merged commit b9792b5 into apache:master Aug 17, 2015

himanshug deleted the batch_delta_ingestion3 branch August 21, 2015 03:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature for hadoop batch re-ingesion and delta ingestion #1374

Feature for hadoop batch re-ingesion and delta ingestion #1374

himanshug commented May 18, 2015

drcrallen May 19, 2015

himanshug May 20, 2015

cheddar May 22, 2015

himanshug May 22, 2015

cheddar commented May 22, 2015

fjy commented Jul 17, 2015

himanshug commented Jul 17, 2015

fjy commented Jul 20, 2015

himanshug commented Jul 21, 2015

gianm commented Aug 13, 2015

drcrallen Aug 14, 2015

himanshug Aug 14, 2015

drcrallen Aug 14, 2015

drcrallen commented Aug 14, 2015

gianm commented Aug 14, 2015

himanshug commented Aug 14, 2015

drcrallen commented Aug 14, 2015

himanshug commented Aug 16, 2015

drcrallen Aug 17, 2015

himanshug Aug 17, 2015

drcrallen Aug 17, 2015

drcrallen commented Aug 17, 2015

gianm commented Aug 17, 2015

Feature for hadoop batch re-ingesion and delta ingestion #1374

Feature for hadoop batch re-ingesion and delta ingestion #1374

Conversation

himanshug commented May 18, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cheddar commented May 22, 2015

fjy commented Jul 17, 2015

himanshug commented Jul 17, 2015

fjy commented Jul 20, 2015

himanshug commented Jul 21, 2015

gianm commented Aug 13, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drcrallen commented Aug 14, 2015

gianm commented Aug 14, 2015

himanshug commented Aug 14, 2015

drcrallen commented Aug 14, 2015

himanshug commented Aug 16, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drcrallen commented Aug 17, 2015

gianm commented Aug 17, 2015