forked from keith-turner/goraci
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Split Generator into two MR jobs, so that MR tasks are idempotent, an…
…d task failures does not cause Verify.verify() to fail
- Loading branch information
Showing
2 changed files
with
153 additions
and
104 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
c320c50
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just looking at this change and the comment. The original Generator map reduce job should be able to fail at any point without causing undefined nodes because only flushed nodes are referenced. Undefined nodes indicate data loss. Its possible that in some cases this two mapper approach could cover up data loss by rewriting data that was lost.
An alternative change would be to modify the loop verify step so that it does not fail if unreferenced > 0. Unreferenced nodes are ok, these are nodes that nothing point to. Unreferenced nodes do not indicate data loss.
c320c50
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right that the repeated execution for the same map input might cause to rewrite data loss, but task failure is supposed to be relatively uncommon, although it happens and causes our nightlies to fail.
The original Generate / Verify is fine, but for Loop, we check in Verify.verify() the total number of referenced and unreferenced nodes. We can relax that, and as you suggested, just check for undefined nodes, but somehow I want to keep those tighter checks. Wdyt?
c320c50
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a machine went down and a mapper, tabletserver/regionserver, datanode all died and data loss occurred, wouldn't you like to know about that? Its possible that the event that causes a mapper to die may also cause data loss.
If you could obtain the # of failed map task then you could bound # unreferenced. I think the following should be true.
#unreferenced <= num_failed_map_task * 1000000
I can understand the desire to get nice clean counts. Any of the counts being off indicate problems. For example if the referenced counts were too high, that would indicate a problem. Where the heck did the extra data come from? For me the main focus of this test has always been to detect data loss, so I am very happy when undefined==0. Although the idea of detecting extra unexpected data is interesting.
c320c50
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not care if the DN/TT dies. In our nightly test environment running on 5-10 nodes, we see failed tasks for various reasons from time to time. But these should not affect the ingestion test. For RS failures, we should not care as well. They should not cause data loss. In fact, we deliberately kill RSs during the test.
There is another solution, which is to add a column to each row, holding the map task id. And revert to the original implementation. In verify step, we should only count rows from successfully finished map tasks from the previous jobs. I believe at the start of the verify job, we can obtain the failed / successfully finished task id's and either do the filtering, or just delete the rows from failed tasks before verify.
c320c50
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I would still want to examine the data generated by failed map task for lost data because there should not be any lost data.