Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sorted recovery write to RFiles #2117

Merged
merged 29 commits into from Jun 21, 2021

Conversation

milleruntime
Copy link
Contributor

@milleruntime milleruntime commented May 25, 2021

  • Change LogSorter and SortedLogRecovery to write/read RFile instead of MapFile
  • Rewrite LogSorter writeBuffer method and make static to test
  • Create toKey() and fromKey() methods in LogFileKey for converting WAL keys
  • Create toValue() and FromValue() methods in LogFileValue for converting WAL values
  • Byte encode the rows of the Key to sort WAL events properly
  • Make RecoveryLogsIterator convert Key Value pairs on next()

@milleruntime
Copy link
Contributor Author

Ran full ITs and saw timeout failures in: CreateInitialSplitsIT, SuspendedTabletsIT

@milleruntime
Copy link
Contributor Author

Ran full ITs and saw timeout failures in: CreateInitialSplitsIT, SuspendedTabletsIT

Both ITs passed when run locally.

@milleruntime
Copy link
Contributor Author

I found a bug with the writeBuffer method while testing in Uno.

2021-05-26T11:00:44,630 [tserver.AssignmentHandler] WARN : exception trying to assign tablet +r<< null
java.lang.RuntimeException: Error recovering tablet +r<< from log files
        at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:398) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.accumulo.tserver.AssignmentHandler.run(AssignmentHandler.java:160) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.io.IOException: java.lang.IllegalStateException: First log entry value is not OPEN
        at org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:539) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:1148) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:357) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        ... 2 more
Caused by: java.lang.IllegalStateException: First log entry value is not OPEN
        at org.apache.accumulo.tserver.log.RecoveryLogsIterator.validateFirstKey(RecoveryLogsIterator.java:151) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.accumulo.tserver.log.RecoveryLogsIterator.getFiles(RecoveryLogsIterator.java:133) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.accumulo.tserver.log.RecoveryLogsIterator.(RecoveryLogsIterator.java:71) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.accumulo.tserver.log.SortedLogRecovery.findMaxTabletId(SortedLogRecovery.java:113) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.accumulo.tserver.log.SortedLogRecovery.findLogsThatDefineTablet(SortedLogRecovery.java:153) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.accumulo.tserver.log.SortedLogRecovery.recover(SortedLogRecovery.java:299) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:537) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:1148) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:357) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        ... 2 more

This is because the entries get flushed in the LogSorter based on tserver.sort.buffer.size but the validation is still being run for each sorted part. I will have to find a better place to call validateFirstKey.

@milleruntime
Copy link
Contributor Author

Need to rework my rework in 007d581

@milleruntime
Copy link
Contributor Author

I successfully tested the recovery locally using Uno with TestIngest of 50000 rows and using two tservers. Ran all the ITs again and WriteAheadLogEncryptedIT was the only one failing. I refactored to pass server context in order to get the configuration and it now passes.

There may be room for future performance improvements but I think the equivalent functionality of sorting recovery with RFile is ready to merge. This change should take care of #2076 as well but I want to do further testing with encryption since I don't think it works exactly how we want it to yet. I can do that as a follow on task.

Copy link
Contributor

@keith-turner keith-turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started reviewing this, but was not able to finish. I will come back to it later. Posting the comments I made so far.

Copy link
Contributor

@keith-turner keith-turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looking at this, but I got farther.

@milleruntime
Copy link
Contributor Author

@keith-turner Did you want to take another look at these changes? I think this is ready to merge

@keith-turner
Copy link
Contributor

@keith-turner Did you want to take another look at these changes? I think this is ready to merge

currently looking at it

Copy link
Contributor

@keith-turner keith-turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milleruntime it tool a while but I finally finished reviewing the entire change. This is a really nice change. Even though the diff is small, its a complex change so it took time to review.

@ctubbsii ctubbsii added this to In progress in 2.1.0 via automation Jun 16, 2021
milleruntime and others added 8 commits June 17, 2021 10:27
…r/LogFileKey.java

Co-authored-by: Keith Turner <kturner@apache.org>
…r/LogFileKey.java

Co-authored-by: Keith Turner <kturner@apache.org>
…r/LogFileKey.java

Co-authored-by: Keith Turner <kturner@apache.org>
…r/LogFileKey.java

Co-authored-by: Keith Turner <kturner@apache.org>
…r/LogFileKey.java

Co-authored-by: Keith Turner <kturner@apache.org>
Copy link
Member

@ctubbsii ctubbsii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor notes from a quick pass

milleruntime and others added 3 commits June 17, 2021 12:13
…r/LogFileValue.java

Co-authored-by: Christopher Tubbs <ctubbsii@apache.org>
…r/LogFileKey.java

Co-authored-by: Christopher Tubbs <ctubbsii@apache.org>
@milleruntime milleruntime changed the title Change sorted recovery to write to RFiles Make sorted recovery write to RFiles Jun 21, 2021
@milleruntime milleruntime merged commit 3e765b1 into apache:main Jun 21, 2021
2.1.0 automation moved this from In progress to Done Jun 21, 2021
@milleruntime milleruntime deleted the sorted-rfile branch June 21, 2021 13:53
milleruntime added a commit to milleruntime/accumulo that referenced this pull request Jun 30, 2021
* Add upgradeFiles method for upgrading files to Upgrader
* Refactor status handling in UpgradeCoordinator to allow upgradeFiles
to run concurrently and wait for metadata upgrade to complete
* Implement upgradeFiles method in Upgrader9to10
* Add dropSortedMapWALFiles for resolving sorted map files that may
still be around during upgrade. Follow up for apache#2117
* Closes apache#2179
milleruntime added a commit to milleruntime/accumulo that referenced this pull request Jun 30, 2021
* Add upgradeFiles method for upgrading files to Upgrader
* Refactor status handling in UpgradeCoordinator to allow upgradeFiles
to run concurrently and wait for metadata upgrade to complete
* Implement upgradeFiles method in Upgrader9to10
* Add dropSortedMapWALFiles for resolving sorted map files that may
still be around during upgrade. Follow up for apache#2117
* Closes apache#2179
milleruntime added a commit to milleruntime/accumulo that referenced this pull request Jul 30, 2021
* Add upgradeFiles method for upgrading files to Upgrader
* Refactor status handling in UpgradeCoordinator to allow upgradeFiles
to run concurrently and wait for metadata upgrade to complete
* Implement upgradeFiles method in Upgrader9to10
* Add dropSortedMapWALFiles for resolving sorted map files that may
still be around during upgrade. Follow up for apache#2117
* Closes apache#2179
milleruntime added a commit that referenced this pull request Jul 30, 2021
* Add dropSortedMapWALFiles to Upgrader9to10 for resolving sorted 
map files that may still be around during upgrade. Follow up for #2117
* Closes #2179
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
2.1.0
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

5 participants