Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixes #558 use copy to avoid deadlock in tserver #559

Merged
merged 4 commits into from
Jul 13, 2018

Conversation

keith-turner
Copy link
Contributor

No description provided.

@keith-turner keith-turner changed the base branch from master to 1.9 July 13, 2018 00:16
}

Set<String> beginClearingUnusedLogs() {
Set<String> doomed = new HashSet<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I raised this before, but hopefully I can make my objection more clear this time around:

"Doomed" is an extremely ambiguous concept (it conveys no information about the end state, only that it will be "unfortunate", whatever that means in any given context). Using it here, especially in already error prone WAL code, introduces unnecessary confusion to subsequent contributors/debuggers. It should either be clearly documented when it is declared, with a direct, complete, and unambiguous answer to the question "What does it mean to be 'doomed' here?" or avoided entirely. I prefer to avoid: If we mean "logs to clear" then let's just say what we mean and call the variable "logsToClear". If we mean something else, then let's use similarly simple language to convey what we mean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted changes to this code since 1.9.1 in order to minimize what I had to consider. In this process I lost the doomed comment. I can make the variable name more clear.

@keith-turner
Copy link
Contributor Author

It would be useful if someone else could double check my anaylsis of otherLogs and currentLogs. In Eclipse I looked for any place one of the sets was mutated. When mutated I rebuilt the immutable copy. If someone else could pull this patch into an IDE and make sure I didn't miss any mutations of the sets I would appreciate it.

@keith-turner
Copy link
Contributor Author

All ITs passed for merge of ec3b7ed into 42a3534

Copy link
Member

@mikewalch mikewalch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@keith-turner
Copy link
Contributor Author

keith-turner commented Jul 13, 2018

I ran a ~1hr CI test w/ agitation against this branch and it went well. I can run longer test w/ and w/o agitation against RC2 starting Mon.

		REFERENCED=648712264
		UNREFERENCED=8000024

@keith-turner keith-turner merged commit 8519514 into apache:1.9 Jul 13, 2018
@keith-turner
Copy link
Contributor Author

keith-turner commented Jul 18, 2018

This is the proper fix to the issue described in #539. Also this change may fix a deadlock introduced in #441 / #443.

@keith-turner
Copy link
Contributor Author

keith-turner commented Jul 19, 2018

While looking into creating the release notes, I was getting confused trying to unravel and understand the #441 and #539 bugs fixes. Looking back, I have it all sorted out in my mind now. I am writing up a summary of my understanding in this comment.

For each tablet there are three different sets of active WALogs. While a WALog exist in anyone of these sets it must not be garbage collected. The following are the three sets.

  1. WALogs related to the tablets active in memory map
  2. WAlogs related to the tablets inactive in memory map that is in the process of being minor compacted
  3. WAlogs related to a newly minor compacted file that is in the process of being added to the tablets row in the metadata table.

From Accumulo [1.8.0,1.9.0] only set 1 was consulted to determine what WALogs were active for a tablet. In #443 the code was changed to also consider set 2, however this may have introduced a possible deadlock in the code. So 1.9.1 consults sets 1 and 2, but may have a deadlock. This PR also considers set 3 and fixes the deadlock. So 1.9.2 consults sets 1,2,and 3 w/o having a deadlock.

keith-turner added a commit to keith-turner/accumulo that referenced this pull request Jul 19, 2018
keith-turner added a commit that referenced this pull request Jul 19, 2018
@keith-turner keith-turner deleted the accumulo-558 branch December 6, 2018 15:16
@ctubbsii ctubbsii added this to the 1.9.2 milestone Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants