Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ZEPPELIN-2502] RemoteInterpreterServer hang forever during shutdown #2322

Closed
wants to merge 1 commit into from

Conversation

andreaTP
Copy link
Contributor

@andreaTP andreaTP commented May 4, 2017

What is this PR for?

There is the chance to have a RemoteServerInterpreter hang forever during shutdown

What type of PR is it?

[Bug Fix]

What is the Jira issue?

[ZEPPELIN-2502]

How should this be tested?

Unit test provided for the fix.

Questions:

  • Is there breaking changes for older versions?
  • Does this needs documentation?

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like pollEvent() is also under synchronized (eventQueue) - doesn't that mean while it waits, nothing can remove items from the eventQueue?

@andreaTP
Copy link
Contributor Author

andreaTP commented May 5, 2017

@felixcheung good catch, after a second look I can see that while it is waiting on pollEvent it will also delay the sendEvent since it is synchronized too.

I guess all these problems are already solved by Java standard lib, if I have positive feedback I will go substituting eventQueue with a LinkedBlockingQueue and removing all synchronized statements since they become unuseful.

@felixcheung
Copy link
Member

I think you are getting a 👍

@@ -477,15 +477,18 @@ public void onParaInfosReceived(Map<String, String> infos) {
/**
* Wait for eventQueue becomes empty
*/
public void waitForEventQueueBecomesEmpty() {
public void waitForEventQueueBecomesEmpty(long atMost) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice if atMost indicated what time unit it wants. Rename it to atMostMilliseconds?

Copy link
Contributor Author

@andreaTP andreaTP May 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, and I harmonize this while addressing your comments below
( deadline etc.)

@@ -111,7 +113,8 @@ public void shutdown() throws TException {
// this case, need to force kill the process

long startTime = System.currentTimeMillis();
while (System.currentTimeMillis() - startTime < 2000 && server.isServing()) {
while (System.currentTimeMillis() - startTime < DEFAULT_SHUTDOWN_TIMEOUT &&
Copy link
Contributor

@FireArrow FireArrow May 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While not really part of the scope of this PR, it is a chance to rework the logic to remove some unnecessary arithmetic and improve readability a bit.
Change long startTime = System.currentTimeMillis(); to long deadline = System.currentTimeMillis() + DEFAULT_SHUTDOWN_TIMEOUT;
Change while (System.currentTimeMillis() - startTime < DEFAULT_SHUTDOWN_TIMEOUT && to while (System.currentTimeMillis() < deadline) &&

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@andreaTP
Copy link
Contributor Author

andreaTP commented May 8, 2017

Looking at the code it looks quite harmful also the pattern used here, here and here

if anything goes wrong the Thread can stay in wait status potentially forever and operations performed there are all but not transactional.

Could I ask if somebody can at least provide some kind of deadline also for those?
Better will be a little refactoring using something like ConcurrentHashMap for operations.

I can even do the refactoring if I have positive feedbacks, anyhow I believe this PR is ok as is and I open a new issue and new PR in case.

@FireArrow
Copy link
Contributor

I'm all for refactoring the use of synchronized into using more proper data types instead. It's to easy to get stuck in a [live|dead]lock when doing the synchronized yourself

@andreaTP
Copy link
Contributor Author

andreaTP commented May 9, 2017

I have had to fix some escaping problems (I could reproduce the problem locally manually) to make travis happy, and it also required some retries ....
anyhow travis is now happy: https://travis-ci.org/nokia/zeppelin/builds/230331180

I hope this PR is addressed and we can move on.
I still wait hint from @felixcheung and @Leemoonsoo on how to proceed on the rest.

thanks in advance

@Leemoonsoo
Copy link
Member

LGTM

Looking at the code it looks quite harmful also the pattern used here, here and here

Further refactoring on codes pointed out sounds good as well.

@andreaTP
Copy link
Contributor Author

am I supposed to do anything else?

@Leemoonsoo
Copy link
Member

@andreaTP Changes on file ZeppelinIT.java are required? And It'll be great if this branch is rebased (or merge) to master and see CI becomes green.

@andreaTP
Copy link
Contributor Author

@Leemoonsoo I can reproduce the error locally running on Chrome (i.e. the current encoding of the test doesn't work for me out of the box with chrome), should I separate these modifications in a separate PR?

@Leemoonsoo
Copy link
Member

@andreaTP Yes. separate PR would be more helpful.

@andreaTP
Copy link
Contributor Author

@Leemoonsoo sounds good :-) I'll do that tomorrow!
and I will follow up with this PR. Thanks!

@Leemoonsoo
Copy link
Member

Thanks @andreaTP !

@andreaTP andreaTP force-pushed the processHang branch 2 times, most recently from c9e2a30 to e58483e Compare May 16, 2017 09:07
@andreaTP
Copy link
Contributor Author

I reverted back modifications to concurrency management since it looks like they result in Ui response instability, I'm not sure why this happens, I believe that we can conclude with a clean PR like this.
In case I manage to refactor and use proper java.util.concurrent into remote I will send another PR.
WDYT? @Leemoonsoo @felixcheung

@andreaTP
Copy link
Contributor Author

ping

@Leemoonsoo
Copy link
Member

Merge to master and branch-0.7 if no further comment

asfgit pushed a commit that referenced this pull request May 20, 2017
### What is this PR for?
There is the chance to have a RemoteServerInterpreter hang forever during shutdown

### What type of PR is it?
[Bug Fix]

### What is the Jira issue?
[ZEPPELIN-2502]

### How should this be tested?
Unit test provided for the fix.

### Questions:
* Is there breaking changes for older versions?
* Does this needs documentation?

Author: andrea <andrea.peruffo1982@gmail.com>

Closes #2322 from andreaTP/processHang and squashes the following commits:

e58483e [andrea] [ZEPPELIN-2502] RemoteInterpreterServer hang forever during shutdown

(cherry picked from commit c7c9aa1)
Signed-off-by: Lee moon soo <moon@apache.org>
@asfgit asfgit closed this in c7c9aa1 May 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants