Skip to content

Conversation

@ywangd
Copy link
Member

@ywangd ywangd commented Aug 29, 2024

A cacheFileRegion can be concurrently evicted while its being incref'd.

See this comment

// we own that one refcount (since we CAS'ed evicted to 1)
// grab io, rely on incref'ers also checking evicted field.

Resolves: #112314

@ywangd ywangd added >test Issues or PRs that are addressing/adding tests v8.16.0 :Search Foundations/Search Catch all for Search Foundations labels Aug 29, 2024
@ywangd ywangd requested a review from henningandersen August 29, 2024 01:59
@elasticsearchmachine elasticsearchmachine added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Aug 29, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

Copy link
Contributor

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm @henningandersen could you also take a look here? I wonder if this doesn't show a concurrency issue now?
In a sense this result to me means that the tryRead path could now randomly fail run because it does a non-volatile read on the IO that was just set.
It will eventually work, but for stuff that just (for some measure of "just") written to the cache there is a chance the fast-path is broken isn't there? Maybe I'm missing a detail?

@henningandersen
Copy link
Contributor

Thanks for the ping @original-brownbear . I think it is ok.

In the path where we set io to null, we evict first, then make assumptions on the refcount. As stated in the comment, we rely on incref'ers also checking evicted field.. However, my addition to the test here neglected to do so, so checking evicted after incref makes sense to me.

As for fast path read / tryRead, I think it is fine too. We may not see that io was turned into a null, but we only do this after having set evicted and tryRead always checks evicted (using volatile read) after the read. We never reassign the io field to another object, it only goes null -> some IO -> null (and the latter only after evicting it).

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for jumping on this. I want to give Armin a chance to respond before approving though.

Comment on lines 497 to 500
if (yield[i] == 0) {
Thread.yield();
}
assertNotNull(cacheFileRegion.testOnlyNonVolatileIO());
assertTrue(cacheFileRegion.testOnlyNonVolatileIO() != null || cacheFileRegion.isEvicted());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can now also do this outside the incref block, I think it could be good to move it there. I originally intended this to be a safe spot for it, but given the extra check for evicted we may as well just do it prior to the incref block.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. Pushed 7f50458

@henningandersen
Copy link
Contributor

One more note on the safety here: we may also in the past (prior to the non-volatile io PR) not have seen the transition to null of io and it may also have been reused for other data. Checking evicted after ensures it is safe (as before, i.e., since we introduced this fast path read).

@ywangd
Copy link
Member Author

ywangd commented Aug 30, 2024

I do find the combine usage of evicted and refcount a bit trickly to reason about. It surprised me that incref does not always prevent eviction. That said, I believe it is safe on the tryRead path since it checks eviction right before returning. In rare cases, it may read from the wrong IO (if the IO gets reassigned to another region) and waste some cpu cycles. But it should not return invalid data.

@ywangd
Copy link
Member Author

ywangd commented Aug 30, 2024

The nonVolatileIO() usage in tryRead may slightly increase the chance of it "reading from a stale IO and waste some cpu cycles". A region's IO lifecyle goes from null --> IO --> null. Currently, the tryRead method can still see IO after it becomes null due to the usage of nonVolatileIO. This should be an extreme case and hence not much of a concern also because it ends up on the slow path anyway. The bottom line is that safety is still guaranteed because of the eviction check as discussed above.

@original-brownbear
Copy link
Contributor

Checking evicted after ensures it is safe (as before, i.e., since we introduced this fast path read).

Yea that was the idea, it's fine to read some garbage bytes optimistically as long as we can check that they're garbage after. But I get it now, sorry was tired last night :D this test definitely does not show a problem in the becoming visible of IO != null, just the other direction that doesn't matter much since we have the flag :)

=> LGTM :) thanks both!

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@henningandersen
Copy link
Contributor

I'll merge this to reenable the test coverage of this.

@henningandersen henningandersen merged commit 212fe03 into elastic:main Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch >test Issues or PRs that are addressing/adding tests v8.16.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] SharedBlobCacheServiceTests testGetMultiThreaded failing

4 participants