New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeing more data loss when running continuous ingest with agitation. #449

Closed
keith-turner opened this Issue Apr 27, 2018 · 1 comment

Comments

Projects
None yet
1 participant
@keith-turner
Contributor

keith-turner commented Apr 27, 2018

With the fixes for #432 and #441 applied to 1.9.0-SNAPSHOT I ran conitunuous ingest for 24hrs with agitation. I ran the agitator as follows.

nohup ./tserver-agitator.pl 1:10 1:10 1 3 &> logs/tserver-agitator.log &

The above command sleeps 1 to 10 minutes randomly between killing tserver processes and randomly kills 1 to 3 tablets servers. I think in the past I have run with a non random period, but not sure. I suspect the random period may be uncovering some new bugs.

After running verify I saw the following data UNDEFINED count, which indicates data was lost. This count is much lower than what I saw before #441 (saw ~600K). I am going to attempt to track the cause of this down.

        org.apache.accumulo.test.continuous.ContinuousVerify$Counts
                REFERENCED=22559054297
                UNDEFINED=6176
                UNREFERENCED=8007075
@keith-turner

This comment has been minimized.

Contributor

keith-turner commented Apr 27, 2018

I determined the cause of this bug. It happens when a tablet is unloaded from a tserver and then later comes back. When it comes back, if data is written and no minor compaction occurs before the tserver dies then that data may not be seen. Below is a detailed example, the seq# below are WAL seq#.

  • TabletA is loaded on tserver5
  • Define tablet is written to WAL1 with seq#1
  • TabletA is loaded on tserver5 for a bit and many compactions occur
  • Data is written to WAL1 for tabletA with seq# 50
  • TabletA is compacted
    • compaction start is written to WAL with seq# 51
    • compaction finish is writtent to WAL with seq# 52
  • TabletA is unloaded from tserver5
  • tserver5 starts using WAL2, however WAL1 is still referenced by other tablets
  • TabletA is loaded from tserver5
  • Define tablet is written to WAL2 with seq#1
  • Data is written to WAL2 for TabletA with seq# 1 (the seq# is lower because it resets when tablet is reassigned)
  • tserver5 dies

During recovery of TabletA WAL1 and WAL2 are examined to find the last compaction to finish and its seq#. The last compaction to finish has a seq# of 52, therefore we look for data >= 52. However the data in WAL2 has a lower seq# and is therefore ignored.

Before 1.8.0 WAL1 would never have been included in the recovery and this problem would not have occurred. However the changes made in 1.8.0 (moved from tracking WAL refs per tablet to per tablet server) bring WAL1 into the recovery process.

Notes from tracking down bug : accumulo-449.txt

keith-turner added a commit to keith-turner/accumulo that referenced this issue May 2, 2018

fixes apache#449 fix two bugs with WAL recovery
 * Fix bug where tablet is unloaded, reloaded on tserver, and then tserver dies
 * Fix bug with out of order logs.  Recovery code assumed logs were passed in
   time order.  However, since 1.8.0 they have been passed in random order. Rewrote
   recovery code to handle out of order logs.  The fix was to read all logs in
   a sorted merged way.

keith-turner added a commit to keith-turner/accumulo that referenced this issue May 8, 2018

fixes apache#449 fix two bugs with WAL recovery (apache#458)
 * Fix bug where tablet is unloaded, reloaded on tserver, and then tserver dies
 * Fix bug with out of order logs.  Recovery code assumed logs were passed in
   time order.  However, since 1.8.0 they have been passed in random order. Rewrote
   recovery code to handle out of order logs.  The fix was to read all logs in
   a sorted merged way.

keith-turner added a commit to keith-turner/accumulo that referenced this issue May 8, 2018

keith-turner added a commit to keith-turner/accumulo that referenced this issue May 8, 2018

keith-turner added a commit that referenced this issue May 8, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment