Honor ignoreInvalidRows in reducer of Hadoop indexer #1226

dkharrat · 2015-03-19T04:04:12Z

In the reducer of the Hadoop indexer, exceptions are sometimes thrown for invalid column values. This change ignores those lines if 'ignoreInvalidRows' flag is enabled, similar to the mapper.

Without this change, the Hadoop Indexer job failed due to 1 line that was corrupt.

drcrallen · 2015-03-19T05:02:28Z

Other data solutions have a setting for how many input rows can be bad before a job fails, and often have a way to store the bad rows in their own output so they can be cleaned up later.

Is this expected to be bad because the row had a read error or because the data itself is corrupt?

dkharrat · 2015-03-19T08:29:53Z

How to deal with the bad rows properly depends on the data set and individual requirements, and so it should be configurable. At minimum though, if they're set to be ignored, I think it's a good idea to log them somewhere (perhaps a file?) for debugging purposes. For my particular use case, there's nothing I can do with the bad rows other than ignore them, since the some of the fields in the log lines are externally-generated.

My change in this PR is to simply honor the ignoreInvalidRows flag that is specified in the indexer config file, which already works for the mappers.

drcrallen · 2015-03-19T15:23:52Z

indexing-hadoop/src/main/java/io/druid/indexer/IndexGeneratorJob.java

+            numRows = index.add(inputRow);
+          } catch (ParseException e) {
+            if (config.isIgnoreInvalidRows()) {
+              log.info("Ignoring invalid row [%s] due to parsing error: %s", value.toString(), e.getMessage());


Probably should make this a warn, also put the error as the first argument like this:
log.info(e,"Ignoring invalid row [%s] due to parsing error", value.toString());

Updated in the new commit.

drcrallen · 2015-03-19T15:26:30Z

@dkharrat : can you make the title of the PR something more like "Honor ignoreInvalidRows in hadoop indexer"

This helps signify this as a bug rather than a feature enhancement.

himanshug · 2015-03-19T19:41:06Z

@dkharrat this code is already there in HadoopDruidIndxerMapper which honors the "ignoreInvalidRows" configuration. Reducer will not even get the record if there was a problem with parsing it.
can u describe the case where reducer can get the malformed record?

dkharrat · 2015-03-19T20:49:55Z

@drcrallen done, I updated the title.

@himadrisingh001 yes, the mapper also ignores malformed records, but it only parses the line based on the ParseSpec and doesn't parse the individual column values. In my case, I had a perfectly valid CSV line (so mapper succeeded in parsing it), but the column value was invalid (an integer was expected for a particular column, but got a string instead), so when the index tries to parse the column value, an exception is thrown. My change handles that case.

drcrallen · 2015-03-19T21:42:47Z

indexing-hadoop/src/main/java/io/druid/indexer/IndexGeneratorJob.java

@@ -382,8 +383,18 @@ protected void reduce(
          context.progress();
          final InputRow inputRow = index.formatRow(parser.parse(value.toString()));


Can you move the try block to start up here? parser.parse can also throw ParseException

himanshug · 2015-03-19T23:45:48Z

thanks @dkharrat that sounds reasonable. i agree with @drcrallen on including parser.parse inside try-catch block.

fjy · 2015-03-20T00:37:09Z

@dkharrat Thanks for the contrib! Do you mind signing our CLA: http://druid.io/community/cla.html

drcrallen · 2015-03-20T00:37:23Z

👍

fjy · 2015-03-20T00:38:22Z

indexing-hadoop/src/main/java/io/druid/indexer/HadoopDruidIndexerMapper.java

@@ -77,6 +80,7 @@ protected void map(
      }
      catch (Exception e) {
        if (config.isIgnoreInvalidRows()) {
+          log.warn(e, "Ignoring invalid row [%s] due to parsing error", value.toString());


this can potentially log a tremendous amount of errors. Can we cap the logging at something reasonable? Default = 10 or something

I have mixed feelings about that limit.

In our architecture, the realtime is treated as "Probably right" and the batch fixup is treated as "definitely right"

As such, If the data itself is bad, It should probably be cleaned up externally unless the user is really OK with completely loosing data in Druid.

Thinking a bit more about it, I can see two major ways of looking at it:

ignoreInvalidRows is an undocumented option for people who are intentionally trying to loose data. In which case either printing out no logs or printing out only a few lines would be OK.

ignoreInvalidRows is a "feature" whereby bad rows can go to an ephemeral something^(tm) where they can be fixed and recovered.

But officially supporting ignoreInvalidRows feels like it breaks the Lambda assumption that batch fixup gets all the data. As such, I'm suggesting 1, which is how this PR is going towards anyways.

either ways, It is good to have ignoreInvalidRows. In some of our pipelines we are getting data from other systems which might be bad and we don't care about losing that portion.
It will be too bad if this feature is not there and we have to have one full pass on the data just to filter the invalid and then ingest it to druid and that effectively increases our end-to-end ingestion time.

may be simple solution here is to just change the log.warn to log.debug so that it gets into logs only if user really wanted it in addition to the INVALID_ROW_COUNTER

@himanshug I like that solution.

The reducer of the hadoop indexer now ignores lines with parsing exceptions (if enabled by the indexer config).

dkharrat · 2015-03-20T05:33:35Z

@himanshug i like your idea of using debug level to log invalid lines. I just updated the commit with this change.

dkharrat · 2015-03-20T05:36:12Z

@fjy sure, I just submitted the CLA form.

Honor ignoreInvalidRows in reducer of Hadoop indexer

drcrallen reviewed Mar 19, 2015
View reviewed changes

dkharrat changed the title ~~handle parsing exceptions in hadoop indexer~~ Honor ignoreInvalidRows in reducer of Hadoop indexer Mar 19, 2015

dkharrat force-pushed the master branch from f8acd62 to 2abfc3f Compare March 19, 2015 21:07

drcrallen reviewed Mar 19, 2015
View reviewed changes

dkharrat force-pushed the master branch 2 times, most recently from f0a54ce to e908d9d Compare March 20, 2015 00:34

fjy reviewed Mar 20, 2015
View reviewed changes

dkharrat added 2 commits March 19, 2015 22:31

Honor ignoreInvalidRows in Hadoop indexer

58d5f5e

The reducer of the hadoop indexer now ignores lines with parsing exceptions (if enabled by the indexer config).

log invalid rows in mapper of Hadoop indexer

3a6dc99

dkharrat force-pushed the master branch from e908d9d to 3a6dc99 Compare March 20, 2015 05:31

fjy added a commit that referenced this pull request Mar 20, 2015

Merge pull request #1226 from dkharrat/master

bb91183

Honor ignoreInvalidRows in reducer of Hadoop indexer

fjy merged commit bb91183 into apache:master Mar 20, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Honor ignoreInvalidRows in reducer of Hadoop indexer #1226

Honor ignoreInvalidRows in reducer of Hadoop indexer #1226

dkharrat commented Mar 19, 2015

drcrallen commented Mar 19, 2015

dkharrat commented Mar 19, 2015

drcrallen Mar 19, 2015

dkharrat Mar 19, 2015

drcrallen commented Mar 19, 2015

himanshug commented Mar 19, 2015

dkharrat commented Mar 19, 2015

drcrallen Mar 19, 2015

himanshug commented Mar 19, 2015

fjy commented Mar 20, 2015

drcrallen commented Mar 20, 2015

fjy Mar 20, 2015

drcrallen Mar 20, 2015

himanshug Mar 20, 2015

drcrallen Mar 20, 2015

dkharrat commented Mar 20, 2015

dkharrat commented Mar 20, 2015

		@@ -382,8 +383,18 @@ protected void reduce(
		context.progress();
		final InputRow inputRow = index.formatRow(parser.parse(value.toString()));

Honor ignoreInvalidRows in reducer of Hadoop indexer #1226

Honor ignoreInvalidRows in reducer of Hadoop indexer #1226

Conversation

dkharrat commented Mar 19, 2015

drcrallen commented Mar 19, 2015

dkharrat commented Mar 19, 2015

drcrallen Mar 19, 2015

Choose a reason for hiding this comment

dkharrat Mar 19, 2015

Choose a reason for hiding this comment

drcrallen commented Mar 19, 2015

himanshug commented Mar 19, 2015

dkharrat commented Mar 19, 2015

drcrallen Mar 19, 2015

Choose a reason for hiding this comment

himanshug commented Mar 19, 2015

fjy commented Mar 20, 2015

drcrallen commented Mar 20, 2015

fjy Mar 20, 2015

Choose a reason for hiding this comment

drcrallen Mar 20, 2015

Choose a reason for hiding this comment

himanshug Mar 20, 2015

Choose a reason for hiding this comment

drcrallen Mar 20, 2015

Choose a reason for hiding this comment

dkharrat commented Mar 20, 2015

dkharrat commented Mar 20, 2015