Skip to content

ZEPPELIN-3401. Deadlock while restarting interpreter#2937

Closed
zjffdu wants to merge 1 commit intoapache:masterfrom
zjffdu:ZEPPELIN-3401
Closed

ZEPPELIN-3401. Deadlock while restarting interpreter#2937
zjffdu wants to merge 1 commit intoapache:masterfrom
zjffdu:ZEPPELIN-3401

Conversation

@zjffdu
Copy link
Copy Markdown
Contributor

@zjffdu zjffdu commented Apr 23, 2018

What is this PR for?

I suspect it is due to deadlock issue between LifecycleThread & CronJobThread:

Thread Locked Waiting
LifecycleThread InterpreterGroup Note
CronJobThread Note Wait for Paragraph to finish (Paragraph can not finish because it needs the lock of InterpreterGroup

This PR wold eliminate the necessary to lock on Note for LifecycleThread

What type of PR is it?

[Bug Fix ]

Todos

  • - Task

What is the Jira issue?

How should this be tested?

Screenshots (if appropriate)

Questions:

  • Does the licenses files need update? No
  • Is there breaking changes for older versions? No
  • Does this needs documentation? No

@zjffdu
Copy link
Copy Markdown
Contributor Author

zjffdu commented Apr 23, 2018

\cc @weand

@jongyoul
Copy link
Copy Markdown
Member

IMHO, it would be better to use ReentrantReadWriteLock. What do you think of it?

@zjffdu
Copy link
Copy Markdown
Contributor Author

zjffdu commented Apr 23, 2018

It is just used in 2 places, synchronized should be fine. ReadWriteLock is more suitable for the scenario of many reads & a few write

1 similar comment
@zjffdu
Copy link
Copy Markdown
Contributor Author

zjffdu commented Apr 23, 2018

It is just used in 2 places, synchronized should be fine. ReadWriteLock is more suitable for the scenario of many reads & a few write

@weand
Copy link
Copy Markdown
Contributor

weand commented Apr 23, 2018

Today we had that bug again (seems to occur daily) without the patch added to PROD.
Now just added the patch to PROD. Will wait another 48h from now before I tell you that it's working.

@weand
Copy link
Copy Markdown
Contributor

weand commented Apr 25, 2018

so far no more issues. LGTM 👍

@zjffdu
Copy link
Copy Markdown
Contributor Author

zjffdu commented Apr 25, 2018

Thanks for verifying it, @weand

@asfgit asfgit closed this in 3fb878b Apr 26, 2018
asfgit pushed a commit that referenced this pull request Apr 26, 2018
### What is this PR for?

I suspect it is due to deadlock issue between LifecycleThread & CronJobThread:
Thread | Locked | Waiting
-- | -- | --
LifecycleThread | InterpreterGroup | Note
CronJobThread | Note | Wait for Paragraph to finish (Paragraph can not finish because it needs the lock of InterpreterGroup

This PR wold eliminate the necessary to lock on Note for LifecycleThread

### What type of PR is it?
[Bug Fix ]

### Todos
* [ ] - Task

### What is the Jira issue?
* https://issues.apache.org/jira/browse/ZEPPELIN-3401

### How should this be tested?
* First time? Setup Travis CI as described on https://zeppelin.apache.org/contribution/contributions.html#continuous-integration
* Strongly recommended: add automated unit tests for any new or changed behavior
* Outline any manual steps to test the PR here.

### Screenshots (if appropriate)

### Questions:
* Does the licenses files need update? No
* Is there breaking changes for older versions? No
* Does this needs documentation? No

Author: Jeff Zhang <zjffdu@apache.org>

Closes #2937 from zjffdu/ZEPPELIN-3401 and squashes the following commits:

5ffcc11 [Jeff Zhang] ZEPPELIN-3401. Deadlock while restarting interpreter

(cherry picked from commit 3fb878b)
Signed-off-by: Jeff Zhang <zjffdu@apache.org>
prabhjyotsingh pushed a commit to prabhjyotsingh/zeppelin that referenced this pull request Jul 4, 2018
I suspect it is due to deadlock issue between LifecycleThread & CronJobThread:
Thread | Locked | Waiting
-- | -- | --
LifecycleThread | InterpreterGroup | Note
CronJobThread | Note | Wait for Paragraph to finish (Paragraph can not finish because it needs the lock of InterpreterGroup

This PR wold eliminate the necessary to lock on Note for LifecycleThread

[Bug Fix ]

* [ ] - Task

* https://issues.apache.org/jira/browse/ZEPPELIN-3401

* First time? Setup Travis CI as described on https://zeppelin.apache.org/contribution/contributions.html#continuous-integration
* Strongly recommended: add automated unit tests for any new or changed behavior
* Outline any manual steps to test the PR here.

* Does the licenses files need update? No
* Is there breaking changes for older versions? No
* Does this needs documentation? No

Author: Jeff Zhang <zjffdu@apache.org>

Closes apache#2937 from zjffdu/ZEPPELIN-3401 and squashes the following commits:

5ffcc11 [Jeff Zhang] ZEPPELIN-3401. Deadlock while restarting interpreter

(cherry picked from commit 3fb878b)
Signed-off-by: Jeff Zhang <zjffdu@apache.org>
(cherry picked from commit 6ce81b5)

Change-Id: I311c44cc5912d6c9d8ecaa09a8c3b614c66034e7
mckartha pushed a commit to syntechdev/zeppelin that referenced this pull request Aug 9, 2018
### What is this PR for?

I suspect it is due to deadlock issue between LifecycleThread & CronJobThread:
Thread | Locked | Waiting
-- | -- | --
LifecycleThread | InterpreterGroup | Note
CronJobThread | Note | Wait for Paragraph to finish (Paragraph can not finish because it needs the lock of InterpreterGroup

This PR wold eliminate the necessary to lock on Note for LifecycleThread

### What type of PR is it?
[Bug Fix ]

### Todos
* [ ] - Task

### What is the Jira issue?
* https://issues.apache.org/jira/browse/ZEPPELIN-3401

### How should this be tested?
* First time? Setup Travis CI as described on https://zeppelin.apache.org/contribution/contributions.html#continuous-integration
* Strongly recommended: add automated unit tests for any new or changed behavior
* Outline any manual steps to test the PR here.

### Screenshots (if appropriate)

### Questions:
* Does the licenses files need update? No
* Is there breaking changes for older versions? No
* Does this needs documentation? No

Author: Jeff Zhang <zjffdu@apache.org>

Closes apache#2937 from zjffdu/ZEPPELIN-3401 and squashes the following commits:

5ffcc11 [Jeff Zhang] ZEPPELIN-3401. Deadlock while restarting interpreter
mckartha pushed a commit to syntechdev/zeppelin that referenced this pull request Aug 9, 2018
### What is this PR for?

I suspect it is due to deadlock issue between LifecycleThread & CronJobThread:
Thread | Locked | Waiting
-- | -- | --
LifecycleThread | InterpreterGroup | Note
CronJobThread | Note | Wait for Paragraph to finish (Paragraph can not finish because it needs the lock of InterpreterGroup

This PR wold eliminate the necessary to lock on Note for LifecycleThread

### What type of PR is it?
[Bug Fix ]

### Todos
* [ ] - Task

### What is the Jira issue?
* https://issues.apache.org/jira/browse/ZEPPELIN-3401

### How should this be tested?
* First time? Setup Travis CI as described on https://zeppelin.apache.org/contribution/contributions.html#continuous-integration
* Strongly recommended: add automated unit tests for any new or changed behavior
* Outline any manual steps to test the PR here.

### Screenshots (if appropriate)

### Questions:
* Does the licenses files need update? No
* Is there breaking changes for older versions? No
* Does this needs documentation? No

Author: Jeff Zhang <zjffdu@apache.org>

Closes apache#2937 from zjffdu/ZEPPELIN-3401 and squashes the following commits:

5ffcc11 [Jeff Zhang] ZEPPELIN-3401. Deadlock while restarting interpreter

(cherry picked from commit 3fb878b)
Signed-off-by: Jeff Zhang <zjffdu@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants