New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TS-4915: Crash from hostdb in PriorityQueueLess #1088

Merged
merged 1 commit into from Oct 13, 2016

Conversation

@shinrich
Member

shinrich commented Oct 11, 2016

These changes have been running on my production box since leaving work Monday night. Will keep an eye on it. Lower traffic overnight might not be stressing it sufficiently.

The main change was in PriorityQueueLess<>::erase. The assignment of the end item to the erase point was not preserving the entry index. So the assumption that entry->index is less than _v.length() was made invalid the next time around. I think breaking this entry->index == _v index assignment can also harm the bubble_sorting logic. I think PriorityQueueLess<>::pop also has a problem, but my work load was not triggering that function, so I didn't dive in there.

The other change was in RefCountCachePartition::make_space_for. There was an extra pop which I believe was doubly removing an entry already removed in PriorityQueueLess::erase (called from RefCountCachePartition::erase).

@shinrich shinrich added this to the 7.1.0 milestone Oct 11, 2016

@atsci

This comment has been minimized.

Show comment
Hide comment
@atsci

atsci commented Oct 11, 2016

FreeBSD build successful! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/957/ for details.

@atsci

This comment has been minimized.

Show comment
Hide comment
@atsci

atsci commented Oct 11, 2016

Linux build successful! See https://ci.trafficserver.apache.org/job/Github-Linux/849/ for details.

@jacksontj

This comment has been minimized.

Show comment
Hide comment
@jacksontj

jacksontj Oct 11, 2016

Contributor

Re-pasting comments from IRC-- for easier retrieval.

The pop you remove does seem extraneous-- and there are test cases for some of the refcountcache stuff (https://github.com/apache/trafficserver/blob/master/iocore/hostdb/test_RefCountCache.cc) should be easy enough to make a test case for evictions).

As for the priority queue, I ran into a couple issues of it not doing what I wanted either. Thankfully that has some decent tests-- so assuming you can figure out the sequence of events that causes the problem-- adding a test case is quite simple (https://github.com/apache/trafficserver/blob/master/lib/ts/test_PriorityQueue.cc)

Contributor

jacksontj commented Oct 11, 2016

Re-pasting comments from IRC-- for easier retrieval.

The pop you remove does seem extraneous-- and there are test cases for some of the refcountcache stuff (https://github.com/apache/trafficserver/blob/master/iocore/hostdb/test_RefCountCache.cc) should be easy enough to make a test case for evictions).

As for the priority queue, I ran into a couple issues of it not doing what I wanted either. Thankfully that has some decent tests-- so assuming you can figure out the sequence of events that causes the problem-- adding a test case is quite simple (https://github.com/apache/trafficserver/blob/master/lib/ts/test_PriorityQueue.cc)

@bryancall

Great find on both of these issues.

Show outdated Hide outdated lib/ts/PriorityQueue.h
Show outdated Hide outdated lib/ts/PriorityQueue.h
@shinrich

This comment has been minimized.

Show comment
Hide comment
@shinrich

shinrich Oct 11, 2016

Member

While writing the test, I see that the bubble_up/down isn't work (or at least isn't working as I expect). Was able to make a failing test without the erase fix.

Member

shinrich commented Oct 11, 2016

While writing the test, I see that the bubble_up/down isn't work (or at least isn't working as I expect). Was able to make a failing test without the erase fix.

@atsci

This comment has been minimized.

Show comment
Hide comment
@atsci

atsci commented Oct 11, 2016

FreeBSD build successful! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/966/ for details.

@shinrich

This comment has been minimized.

Show comment
Hide comment
@shinrich

shinrich Oct 11, 2016

Member

The problem appears to be in _bubble_down. In the test, I had 4 items, w, x, y, z with weights 10, 20, 30, 40 respectively. I pushed in reverse order, but the overall order afterwards was correct.

I deleted item X (entry index 1). In the original code, this swaps Z (with index 3) into slot 1. Then we call bubble_up, which doesn't do anything because Z's weight is greater than entry 0. Then it calls bubble_down with index 1. It computes left as 2*index + 1 or 3 in this case. This is greater than or equal to the current length of 3, so it gives up. It leaves the vector with with W(10), Z(40), and Y(30). If the vector were longer, the bubble_down logic would work, but with the short vector it gives up too soon. @masaori335 any ideas?

Member

shinrich commented Oct 11, 2016

The problem appears to be in _bubble_down. In the test, I had 4 items, w, x, y, z with weights 10, 20, 30, 40 respectively. I pushed in reverse order, but the overall order afterwards was correct.

I deleted item X (entry index 1). In the original code, this swaps Z (with index 3) into slot 1. Then we call bubble_up, which doesn't do anything because Z's weight is greater than entry 0. Then it calls bubble_down with index 1. It computes left as 2*index + 1 or 3 in this case. This is greater than or equal to the current length of 3, so it gives up. It leaves the vector with with W(10), Z(40), and Y(30). If the vector were longer, the bubble_down logic would work, but with the short vector it gives up too soon. @masaori335 any ideas?

@atsci

This comment has been minimized.

Show comment
Hide comment
@atsci

atsci commented Oct 11, 2016

Linux build successful! See https://ci.trafficserver.apache.org/job/Github-Linux/858/ for details.

@atsci

This comment has been minimized.

Show comment
Hide comment
@atsci

atsci commented Oct 11, 2016

FreeBSD build successful! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/967/ for details.

@atsci

This comment has been minimized.

Show comment
Hide comment
@atsci

atsci commented Oct 11, 2016

Linux build successful! See https://ci.trafficserver.apache.org/job/Github-Linux/859/ for details.

@shinrich

This comment has been minimized.

Show comment
Hide comment
@shinrich

shinrich Oct 11, 2016

Member

Added a potential fix to _bubble_down to just do a simple bubble sort from the last index to the end of the array to catch any overshooting left index calculations.

Member

shinrich commented Oct 11, 2016

Added a potential fix to _bubble_down to just do a simple bubble sort from the last index to the end of the array to catch any overshooting left index calculations.

@atsci

This comment has been minimized.

Show comment
Hide comment
@atsci

atsci commented Oct 11, 2016

Linux build failed! See https://ci.trafficserver.apache.org/job/Github-Linux/861/ for details.

@atsci

This comment has been minimized.

Show comment
Hide comment
@atsci

atsci commented Oct 11, 2016

FreeBSD build successful! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/969/ for details.

@zwoop

This comment has been minimized.

Show comment
Hide comment
@zwoop

zwoop Oct 11, 2016

Contributor

Clang-format ...

Contributor

zwoop commented Oct 11, 2016

Clang-format ...

@shinrich

This comment has been minimized.

Show comment
Hide comment
@shinrich

shinrich Oct 11, 2016

Member

Bleah. Clang-formatted. The fix has been running without crash for nearly 24 hours on my prod box.

Member

shinrich commented Oct 11, 2016

Bleah. Clang-formatted. The fix has been running without crash for nearly 24 hours on my prod box.

@atsci

This comment has been minimized.

Show comment
Hide comment
@atsci

atsci commented Oct 11, 2016

Linux build failed! See https://ci.trafficserver.apache.org/job/Github-Linux/865/ for details.

@atsci

This comment has been minimized.

Show comment
Hide comment
@atsci

atsci commented Oct 11, 2016

FreeBSD build successful! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/973/ for details.

@masaori335

This comment has been minimized.

Show comment
Hide comment
@masaori335

masaori335 Oct 12, 2016

Contributor

@shinrich It looks working correctly. What is problem with the vector W(10), Z(40), and Y(30) in that case?
This PriorityQueue is implemented as BinaryHeap in an array. So left child node(Z(40)) could be larger than right child node(Y(30)). And we don't need to sort the array.

Contributor

masaori335 commented Oct 12, 2016

@shinrich It looks working correctly. What is problem with the vector W(10), Z(40), and Y(30) in that case?
This PriorityQueue is implemented as BinaryHeap in an array. So left child node(Z(40)) could be larger than right child node(Y(30)). And we don't need to sort the array.

@shinrich

This comment has been minimized.

Show comment
Hide comment
@shinrich

shinrich Oct 12, 2016

Member

@masaori335 ok, that makes sense. I reverted my hack in the _bubble_down. Hopefully that fixes that transient TSConfigSyntax build failure that occurred on the Linux build.

If all are agreed, I think we should commit this fix. My build was running without crash in production for over 24 hours until it re-installed last night.

Member

shinrich commented Oct 12, 2016

@masaori335 ok, that makes sense. I reverted my hack in the _bubble_down. Hopefully that fixes that transient TSConfigSyntax build failure that occurred on the Linux build.

If all are agreed, I think we should commit this fix. My build was running without crash in production for over 24 hours until it re-installed last night.

@atsci

This comment has been minimized.

Show comment
Hide comment
@atsci

atsci commented Oct 12, 2016

FreeBSD build successful! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/979/ for details.

@atsci

This comment has been minimized.

Show comment
Hide comment
@atsci

atsci commented Oct 12, 2016

Linux build successful! See https://ci.trafficserver.apache.org/job/Github-Linux/871/ for details.

@atsci

This comment has been minimized.

Show comment
Hide comment
@atsci

atsci commented Oct 12, 2016

FreeBSD build successful! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/982/ for details.

@atsci

This comment has been minimized.

Show comment
Hide comment
@atsci

atsci commented Oct 12, 2016

Linux build successful! See https://ci.trafficserver.apache.org/job/Github-Linux/874/ for details.

@shinrich

This comment has been minimized.

Show comment
Hide comment
@shinrich

shinrich Oct 13, 2016

Member

Squashed and rebased. Assuming all looks good with merge in the morning.

Member

shinrich commented Oct 13, 2016

Squashed and rebased. Assuming all looks good with merge in the morning.

@atsci

This comment has been minimized.

Show comment
Hide comment
@atsci

atsci commented Oct 13, 2016

FreeBSD build successful! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/991/ for details.

@atsci

This comment has been minimized.

Show comment
Hide comment
@atsci

atsci commented Oct 13, 2016

Linux build successful! See https://ci.trafficserver.apache.org/job/Github-Linux/883/ for details.

@bryancall

This comment has been minimized.

Show comment
Hide comment
@bryancall

bryancall Oct 13, 2016

Contributor

I have been running this in production without an issue. However, I still have a comment in the code about the !empt() conditional and the bubble_up function call.

Contributor

bryancall commented Oct 13, 2016

I have been running this in production without an issue. However, I still have a comment in the code about the !empt() conditional and the bubble_up function call.

@shinrich

This comment has been minimized.

Show comment
Hide comment
@shinrich

shinrich Oct 13, 2016

Member

Removed the extra !empty() check in ::erase. The tests still pass. I'm running it in production again without incident. Will let it bake some more and then will merge it if all looks good.

Member

shinrich commented Oct 13, 2016

Removed the extra !empty() check in ::erase. The tests still pass. I'm running it in production again without incident. Will let it bake some more and then will merge it if all looks good.

@atsci

This comment has been minimized.

Show comment
Hide comment
@atsci

atsci commented Oct 13, 2016

Linux build successful! See https://ci.trafficserver.apache.org/job/Github-Linux/888/ for details.

@atsci

This comment has been minimized.

Show comment
Hide comment
@atsci

atsci commented Oct 13, 2016

FreeBSD build successful! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/996/ for details.

@shinrich shinrich merged commit b19348c into apache:master Oct 13, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment