Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-15635: don't repeat close hooks if SRI cleared twice #376

Merged
merged 17 commits into from Nov 15, 2021

Conversation

mkhludnev
Copy link
Member

https://issues.apache.org/jira/browse/SOLR-15635

  • SOLR-15635: close hooks are invoked twice if SolrRequestInfo cleared twice.

Description

User impact: /export?q={!join NOscore_param fromIndex=is_closed_by_this_req ...}...

Solution

Clear SRI.closeHooks when they re invoked.

Tests

Added test for /export?q={!join}. And unit test for SRI particularly.

Checklist

Please review the following and check all that apply:

  • [v] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • [v] I have created a Jira issue and added the issue ID to my pull request title.
  • [v] I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • [v] I have developed this patch against the main branch.
  • [v] I have run ./gradlew check.
  • [v] I have added tests for my changes.
  • [-] I have added documentation for the Reference Guide

@mkhludnev
Copy link
Member Author

Hi, @dsmiley @NazerkeBS!
You review is quite appreciated!

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@mkhludnev
Copy link
Member Author

I don't like SRI.addCloseHookStrict() too, @dsmiley. As another band-aid I can put that atomic boolean wrapper into JoinQuery only. WDYT, @madrob ?
The question closer to the root cause is why SRI.getInheritableThreadLocalProvider().clean() not only cleans threadlocal (which make sense for thread returning into the pool) but also closes SRI which it hasn't created.

@madrob
Copy link
Contributor

madrob commented Nov 1, 2021

When I was working on SOLR-15555, the current behavior of closing the SRI helped me find a bug, so I'm reasonably confident that we don't want to change things to strict. If we do, then we should treat it like SolrCore close, where if there are too many calls to it then we log it and ask the person to create an issue - it more than likely means a bug in our code already.

SRI.getInheritableThreadLocalProvider().clean() is supposed to remove the context done by set - looks like this came from SOLR-14384 and maybe we didn't get all of the usages there?

I'm still speaking in generalities here, but I'm getting more and more convinced that the problem isn't with SRI.

@mkhludnev
Copy link
Member Author

mkhludnev commented Nov 3, 2021

@madrob, honestly, I hardly follow.

we don't want to change things to strict. .. then we log it and ask the person to create an issue

we do logging when refcount==-1, but when refcount==0 it's regular closing, and core should be closed even with no WARN log. And that's what happen here when /export?q={!join}, and btw and user can close almost any core of reachable instance.
I'm not sure what you mean in usages of SRI.getInheritableThreadLocalProvider().clean(). I still think this clean() shouldn't close SRI, but just clean pool's threadlocal. I have an idea how to test it. But I don't know how to distinguish between old and potentially new test failures if I commit it. Does it sounds like an idea for the third attempt?

@madrob
Copy link
Contributor

madrob commented Nov 3, 2021

I thought the problem here was that we end up calling close too many times, right? So like on SolrCore, if we call close on the hook multiple times we could log about it instead of silently ignoring the other times.

clean closing SRI is fine because it's supposed to be mirrored by creating a new entry on the SRI stack during set. So if something is calling clean to call close, that means it didn't call set before.

@mkhludnev
Copy link
Member Author

mkhludnev commented Nov 5, 2021

the problem here was that we end up calling close too many times, right?

Exactly SRI.closeHook() is invoked twice, one from the request thread, the second one due to cleaning pool's thread.

if we call close on the hook multiple times we could log about it instead of silently ignoring the other times.

That's almost what's done in wrapper for closeable. PR: mkhludnev@36fd4aa#diff-7b4f31855280c776cee1dde2117f04a2fa9fcaf971d2a1b15cfd3f9978896394R210 ok. I can move AtomicBoolean closed into SRI. And then, you propose to just log it, without preventing closing core? How it helps users who have core closed after /export?q={!join}?

clean closing SRI is fine because it's supposed to be mirrored by creating a new entry on the SRI stack during set.

Not really. We have single SRI instance, it get into request's thread stack and into pool's thread stack. That's how single SRI have set twice. Both stacks should be clean for sure, but closeables needs to be invoked once IMO.

@mkhludnev
Copy link
Member Author

rolled back SRI.addCloseHookStrict() pushed more accurate version:

  1. SRI.InheritableThreadlocal clears SRI but doesn't close SRI
  2. introduced SRI.closed. it fails only test but just logs error if closeHooks() are repeatedly called.

Opinions, @madrob, @dsmiley , @NazerkeBS ?

@dsmiley
Copy link
Contributor

dsmiley commented Nov 7, 2021

Thanks for pushing forward on this Mikhail!

I ran tests and found that org.apache.solr.client.solrj.io.stream.CloudAuthStreamTest#testDaemonUpdateStreamInsufficientCredentials failed -- somehow setRequestInfo was being called with an already-closed SRI. The call occurred in org.apache.solr.common.util.ExecutorUtil.InheritableThreadLocalProvider#set for SRI. It appears that daemonStream has a lifespan that surpasses the parent request. That makes sense -- it'a a daemon. Maybe that executor shouldn't inherit the SRI because it's in the background and will exceed the parent request? I lean this way. Any opinion @joel-bernstein ? Or we could switch to a refCount approach so that this case and maybe others will close only once it's removed from the last SRI ThreadLocal stack. I just started playing with that but I fear there is a race between the original request closing down and the new daemon thread in the pool grabbing a reference to the SRI that will be closed in any millisecond... either could win to increase/decrease the refCount, meaning that sometimes the SRI would be inherited and sometimes not -- clearly not good.

@mkhludnev
Copy link
Member Author

Sigh..
It seems SRI.closed can't be introduced. If a request thread spins off MDCAwareThreadPool it's not obvious who should close SRI. In master branch, it's closed twice. In this branch it's closed in the request thread. The problem that request thread might close it before it will be used in thread pool. At least it's what happen in code right now. It may make harm some cases, but it's out of scope of this issue.

@mkhludnev
Copy link
Member Author

we have (at least) two tread pool usages https://github.com/mkhludnev/solr/blob/SOLR-15635-clear_SRI_twice/solr/solrj/src/java/org/apache/solr/client/solrj/io/stream/DaemonStream.java#L208

public void run(Callable<Boolean> writer) throws IOException {

it both occurrences there are just shutdown/Now() without awaitTermination() it introduces races in closing SRI. @joel-bernstein in your opinion, can we call awaitTermination() there both? In this case request thread could close SRI synchronously. As an alternative idea, we can do SRI nesting before launching those pools, but it should be a special kind of SRI, which will be closed by the pool? Is it a good band-aid?

@dsmiley
Copy link
Contributor

dsmiley commented Nov 10, 2021

can we call awaitTermination()

In at least daemon() -- no; it misses the point.

Thinking about this more... I see a way to fix the race I described in the scenario I described last. In org.apache.solr.common.util.ExecutorUtil.InheritableThreadLocalProvider#store (which is called by the thread submitting to the threadPool, increase a refCount there. Then, it doesn't matter wether the Solr request thread finishes first or if the pool's Runnable completes first. Whichever happens last will do the actual close'ing. Make sense?

@mkhludnev
Copy link
Member Author

In at least daemon() -- no; it misses the point.

Yep. I'm missing the point. 1) Why that daemon is shut down right there? Why it's done asynchronously w/o awaitTerm?

org.apache.solr.common.util.ExecutorUtil.InheritableThreadLocalProvider#store (which is called by the thread submitting to the threadPool, increase a refCount there.

well... not really. In ITHP.store() there's no incremented refcount yet. Pool just can initiate core closing via hook.

I can try to stack new SRI right before spinning the pools, delegating closing to those pools.

@dsmiley
Copy link
Contributor

dsmiley commented Nov 10, 2021

Yep. I'm missing the point. 1) Why that daemon is shut down right there? Why it's done asynchronously w/o awaitTerm?

By definition, a daemon lives beyond the request thread, and that's what this particular streaming expression is expressly for (by-design). Reminder: ExecutorService.shutdown() doesn't cancel the already running tasks, but it ensures it closes itself once the task(s) are done.

well... not really. In ITHP.store() there's no incremented refcount yet.

I'm making a proposal; there are no refCounts yet -- indeed. I have some WIP that I could commit here. Let me know if you'd like to see.

I can try to stack new SRI right before spinning the pools, delegating closing to those pools.

I don't really follow but maybe you can show or explain further? If you're saying this pool will create a new SolrRequestInfo (special in some way)... I'm not sure I like It but we'll see I guess.

@mkhludnev
Copy link
Member Author

I don't really follow but maybe you can show or explain further?

Anyway. I tried. it didn't worked out. I'd be happy to have a look on your approach.

@mkhludnev
Copy link
Member Author

Well cool! You've got it, @dsmiley. I slightly improved the test. When it's a time to commit it?

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any concerns @madrob ? Mikhail and I are good with it.

solr/CHANGES.txt Outdated Show resolved Hide resolved
private static void closeHooks(SolrRequestInfo info) {
if (info.closeHooks != null) {
for (Closeable hook : info.closeHooks) {
private synchronized void close() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the whole method need to be sync?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reducing synchronized scope to condition seems working. Is it worth to push into PRs' branch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be the whole method -- for the refCount & closeHook manipulation. Besides; it shouldn't be contented.

@mkhludnev mkhludnev merged commit dbdd2d4 into apache:main Nov 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants