Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28709][DSTREAMS] Fix StreamingContext leak through Streaming #25439

Conversation

choojoyq
Copy link

@choojoyq choojoyq commented Aug 13, 2019

What changes were proposed in this pull request?

In my application spark streaming is restarted programmatically by stopping StreamingContext without stopping of SparkContext and creating/starting a new one. I use it for automatic detection of Kafka topic/partition changes and automatic failover in case of non fatal exceptions.

However i notice that after multiple restarts driver fails with OOM. During investigation of heap dump i figured out that StreamingContext object isn't cleared by GC after stopping.

Screen Shot 2019-08-14 at 12 23 33

There are several places which holds reference to it :

  1. StreamingTab registers StreamingJobProgressListener which holds reference to Streaming Context directly to LiveListenerBus shared queue via ssc.sc.addSparkListener(listener) method invocation. However this listener isn't unregistered at stop method.
  2. json handlers (/streaming/json and /streaming/batch/json) aren't unregistered in SparkUI, while they hold reference to StreamingJobProgressListener. Basically the same issue affects all the pages, i assume that renderJsonHandler should be added to pageToHandlers cache on attachPage method invocation in order to unregistered it as well on detachPage.
  3. SparkUi holds reference to StreamingJobProgressListener in the corresponding local variable which isn't cleared after stopping of StreamingContext.

How was this patch tested?

Added tests to existing test suites.
After i applied these changes via reflection in my app OOM on driver side gone.

@dongjoon-hyun
Copy link
Member

ok to test

@@ -575,6 +577,8 @@ class StreamingContext private[streaming] (
try {
validate()

registerProgressListener()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand you are suggesting that the unregister is important, but do we need this? This seems to be a behavior change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. I believeStreamingTab shouldn't be responsible for registering/unregistering the listener as it could be and even already used in other place (metrics). Moreover seems there is also a bug that if ui is disabled, listener isn't registered and metrics aren't reported.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I tend to agree. The behavior change is registering listeners when when the UI is disabled, but that seems like possibly a fix (we'd have to look deeper to figure out whether something else registers it in this case, like StreamingSource). This seems like a better place to manage it, certainly unregistering, and if it's unregistered here it should be registered here.

@@ -52,8 +52,6 @@ class InputStreamsSuite extends TestSuiteBase with BeforeAndAfter {

// Set up the streaming context and input streams
withStreamingContext(new StreamingContext(conf, batchDuration)) { ssc =>
ssc.addStreamingListener(ssc.progressListener)

Copy link
Member

@dongjoon-hyun dongjoon-hyun Aug 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you remove the behavior change here( https://github.com/apache/spark/pull/25439/files#r313624949 ), it seems that we don't need to change this, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

@dongjoon-hyun
Copy link
Member

Thank you for your first contribution, @choojoyq . I left a few comments.

Hi, @zsxwing . Could you review this PR?

@SparkQA
Copy link

SparkQA commented Aug 13, 2019

Test build #109063 has finished for PR 25439 at commit 2c55f89.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@choojoyq choojoyq force-pushed the SPARK-28709-fix-streaming-context-leak-on-stop branch 2 times, most recently from 75935cd to fbfabb0 Compare August 14, 2019 06:31
@choojoyq choojoyq force-pushed the SPARK-28709-fix-streaming-context-leak-on-stop branch from fbfabb0 to 4d5965e Compare August 14, 2019 06:33
@SparkQA
Copy link

SparkQA commented Aug 14, 2019

Test build #109088 has finished for PR 25439 at commit 4d5965e.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@choojoyq
Copy link
Author

@dongjoon-hyun thanks for the so fast review.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Aug 16, 2019

Test build #109232 has finished for PR 25439 at commit 4d5965e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@choojoyq
Copy link
Author

Hi @zsxwing, could you please take a look ?

@@ -575,6 +577,8 @@ class StreamingContext private[streaming] (
try {
validate()

registerProgressListener()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I tend to agree. The behavior change is registering listeners when when the UI is disabled, but that seems like possibly a fix (we'd have to look deeper to figure out whether something else registers it in this case, like StreamingSource). This seems like a better place to manage it, certainly unregistering, and if it's unregistered here it should be registered here.

@@ -93,6 +93,7 @@ private[spark] abstract class WebUI(
attachHandler(renderJsonHandler)
val handlers = pageToHandlers.getOrElseUpdate(page, ArrayBuffer[ServletContextHandler]())
handlers += renderHandler
handlers += renderJsonHandler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this new change related?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, handlers store reference to streaming and batch pages which store reference to StreamingJobProgressListener which stores reference to StreamingContext. So all the handlers should be unregistered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see so that's kind of an additional bug fix, that this never recorded the handler in the list?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, seems that it wasn't an issue in past cause other tabs can't be detached in runtime.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-28709][DSTREAMS] - Fix StreamingContext leak through Streaming… [SPARK-28709][DSTREAMS] Fix StreamingContext leak through Streaming Aug 24, 2019
@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Aug 24, 2019

Test build #109670 has finished for PR 25439 at commit 4d5965e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Aug 26, 2019

Merged to master

@srowen srowen closed this in 9f8c7a2 Aug 26, 2019
@zzcclp
Copy link
Contributor

zzcclp commented Aug 27, 2019

@choojoyq @srowen will this pr be merged into branch-2.4 too?

@srowen
Copy link
Member

srowen commented Aug 27, 2019

Yes I think that's OK. We're in a 'code freeze' for the 2.4.4 release at the moment, so I hesitate to merge anything but critical fixes until it's finalized. But it could go in for 2.4.5.

@zzcclp
Copy link
Contributor

zzcclp commented Aug 27, 2019

OK, thanks.

@dongjoon-hyun
Copy link
Member

Thank you all. Yes. This is a good backport candidate for 2.4.5.

@zzcclp
Copy link
Contributor

zzcclp commented Sep 3, 2019

@choojoyq @srowen @dongjoon-hyun 2.4.4 was released, do you plan to merge this pr into branch-2.4?

srowen pushed a commit that referenced this pull request Sep 3, 2019
In my application spark streaming is restarted programmatically by stopping StreamingContext without stopping of SparkContext and creating/starting a new one. I use it for automatic detection of Kafka topic/partition changes and automatic failover in case of non fatal exceptions.

However i notice that after multiple restarts driver fails with OOM. During investigation of heap dump i figured out that StreamingContext object isn't cleared by GC after stopping.

<img width="1901" alt="Screen Shot 2019-08-14 at 12 23 33" src="https://user-images.githubusercontent.com/13151161/63010149-83f4c200-be8e-11e9-9f48-12b6e97839f4.png">

There are several places which holds reference to it :

1. StreamingTab registers StreamingJobProgressListener which holds reference to Streaming Context directly to LiveListenerBus shared queue via ssc.sc.addSparkListener(listener) method invocation. However this listener isn't unregistered at stop method.
2. json handlers (/streaming/json and /streaming/batch/json) aren't unregistered in SparkUI, while they hold reference to StreamingJobProgressListener. Basically the same issue affects all the pages, i assume that renderJsonHandler should be added to pageToHandlers cache on attachPage method invocation in order to unregistered it as well on detachPage.
3. SparkUi holds reference to StreamingJobProgressListener in the corresponding local variable which isn't cleared after stopping of StreamingContext.

Added tests to existing test suites.
After i applied these changes via reflection in my app OOM on driver side gone.

Closes #25439 from choojoyq/SPARK-28709-fix-streaming-context-leak-on-stop.

Authored-by: Nikita Gorbachevsky <nikitag@playtika.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
@srowen
Copy link
Member

srowen commented Sep 3, 2019

Also backported to 2.4. I resolved a minor merge conflict carefully and think I got it right.

@zzcclp
Copy link
Contributor

zzcclp commented Sep 4, 2019

Thank @srowen

dongjoon-hyun added a commit that referenced this pull request Sep 15, 2019
… avoid CCE

### What changes were proposed in this pull request?

[SPARK-27122](#24088) fixes `ClassCastException` at `yarn` module by introducing `DelegatingServletContextHandler`. Initially, this was discovered with JDK9+, but the class path issues affected JDK8 environment, too. After [SPARK-28709](#25439), I also hit the similar issue at `streaming` module.

This PR aims to fix `streaming` module by adding `getContextPath` to `DelegatingServletContextHandler` and using it.

### Why are the changes needed?

Currently, when we test `streaming` module independently, it fails like the following.
```
$ build/mvn test -pl streaming
...
UISeleniumSuite:
- attaching and detaching a Streaming tab *** FAILED ***
  java.lang.ClassCastException: org.sparkproject.jetty.servlet.ServletContextHandler cannot be cast to org.eclipse.jetty.servlet.ServletContextHandler
...
Tests: succeeded 337, failed 1, canceled 0, ignored 1, pending 0
*** 1 TEST FAILED ***
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with the modified tests. And do the following manually.
Since you can observe this when you run `streaming` module test only (instead of running all), you need to install the changed `core` module and use it.

```
$ java -version
openjdk version "1.8.0_222"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_222-b10)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.222-b10, mixed mode)
$ build/mvn install -DskipTests
$ build/mvn test -pl streaming
```

Closes #25791 from dongjoon-hyun/SPARK-29087.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
dongjoon-hyun added a commit that referenced this pull request Sep 15, 2019
… avoid CCE

### What changes were proposed in this pull request?

[SPARK-27122](#24088) fixes `ClassCastException` at `yarn` module by introducing `DelegatingServletContextHandler`. Initially, this was discovered with JDK9+, but the class path issues affected JDK8 environment, too. After [SPARK-28709](#25439), I also hit the similar issue at `streaming` module.

This PR aims to fix `streaming` module by adding `getContextPath` to `DelegatingServletContextHandler` and using it.

### Why are the changes needed?

Currently, when we test `streaming` module independently, it fails like the following.
```
$ build/mvn test -pl streaming
...
UISeleniumSuite:
- attaching and detaching a Streaming tab *** FAILED ***
  java.lang.ClassCastException: org.sparkproject.jetty.servlet.ServletContextHandler cannot be cast to org.eclipse.jetty.servlet.ServletContextHandler
...
Tests: succeeded 337, failed 1, canceled 0, ignored 1, pending 0
*** 1 TEST FAILED ***
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with the modified tests. And do the following manually.
Since you can observe this when you run `streaming` module test only (instead of running all), you need to install the changed `core` module and use it.

```
$ java -version
openjdk version "1.8.0_222"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_222-b10)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.222-b10, mixed mode)
$ build/mvn install -DskipTests
$ build/mvn test -pl streaming
```

Closes #25791 from dongjoon-hyun/SPARK-29087.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 729b318)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
rluta pushed a commit to rluta/spark that referenced this pull request Sep 17, 2019
In my application spark streaming is restarted programmatically by stopping StreamingContext without stopping of SparkContext and creating/starting a new one. I use it for automatic detection of Kafka topic/partition changes and automatic failover in case of non fatal exceptions.

However i notice that after multiple restarts driver fails with OOM. During investigation of heap dump i figured out that StreamingContext object isn't cleared by GC after stopping.

<img width="1901" alt="Screen Shot 2019-08-14 at 12 23 33" src="https://user-images.githubusercontent.com/13151161/63010149-83f4c200-be8e-11e9-9f48-12b6e97839f4.png">

There are several places which holds reference to it :

1. StreamingTab registers StreamingJobProgressListener which holds reference to Streaming Context directly to LiveListenerBus shared queue via ssc.sc.addSparkListener(listener) method invocation. However this listener isn't unregistered at stop method.
2. json handlers (/streaming/json and /streaming/batch/json) aren't unregistered in SparkUI, while they hold reference to StreamingJobProgressListener. Basically the same issue affects all the pages, i assume that renderJsonHandler should be added to pageToHandlers cache on attachPage method invocation in order to unregistered it as well on detachPage.
3. SparkUi holds reference to StreamingJobProgressListener in the corresponding local variable which isn't cleared after stopping of StreamingContext.

Added tests to existing test suites.
After i applied these changes via reflection in my app OOM on driver side gone.

Closes apache#25439 from choojoyq/SPARK-28709-fix-streaming-context-leak-on-stop.

Authored-by: Nikita Gorbachevsky <nikitag@playtika.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants