Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24525][SS] Provide an option to limit number of rows in a MemorySink #21559

Closed
wants to merge 16 commits into from

Conversation

mukulmurthy
Copy link
Contributor

@mukulmurthy mukulmurthy commented Jun 13, 2018

What changes were proposed in this pull request?

Provide an option to limit number of rows in a MemorySink. Currently, MemorySink and MemorySinkV2 have unbounded size, meaning that if they're used on big data, they can OOM the stream. This change adds a maxMemorySinkRows option to limit how many rows MemorySink and MemorySinkV2 can hold. By default, they are still unbounded.

How was this patch tested?

Added new unit tests.

@mukulmurthy
Copy link
Contributor Author

@jose-torres @brkyvz for review

@brkyvz
Copy link
Contributor

brkyvz commented Jun 13, 2018

ok to test

@brkyvz
Copy link
Contributor

brkyvz commented Jun 13, 2018

Jenkins add to whitelist

@JoshRosen
Copy link
Contributor

jenkins add to whitelist

}

private def truncateRowsIfNeeded(rows: Array[Row], maxRows: Int, batchId: Long): Array[Row] = {
if (rows.length > maxRows) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also adding a check here to make sure maxRows >= 0. It shouldn't ever be negative, but doesn't hurt to safeguard.

@SparkQA
Copy link

SparkQA commented Jun 14, 2018

Test build #91796 has finished for PR 21559 at commit f981cb8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 14, 2018

Test build #91799 has finished for PR 21559 at commit 4ab9bda.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

numRows = 0
}

private def truncateRowsIfNeeded(rows: Array[Row], maxRows: Int, batchId: Long): Array[Row] = {
Copy link
Contributor

@jose-torres jose-torres Jun 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd document that maxRows is the remaining row capacity, not the maximum row limit defined in the options. I got confused for a minute here.

numRows = 0
}

private def truncateRowsIfNeeded(rows: Array[Row], maxRows: Int, batchId: Long): Array[Row] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this go in MemorySinkBase?

@jose-torres
Copy link
Contributor

lgtm

* Companion object to MemorySinkBase.
*/
object MemorySinkBase {
val MAX_MEMORY_SINK_ROWS = "maxMemorySinkRows"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxRows is sufficient

}
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove extra line

*/
def getMemorySinkCapacity(options: DataSourceOptions): Option[Int] = {
val maxRows = options.getInt(MAX_MEMORY_SINK_ROWS, MAX_MEMORY_SINK_ROWS_DEFAULT)
if (maxRows >= 0) Some(maxRows) else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to do if (maxRows >= 0) maxRows else Int.MaxValue - 10
We can't exceed runtime array max size anyway

@@ -81,22 +84,35 @@ class MemorySinkV2 extends DataSourceV2 with StreamWriteSupport with MemorySinkB
}.mkString("\n")
}

def write(batchId: Long, outputMode: OutputMode, newRows: Array[Row]): Unit = {
def write(batchId: Long, outputMode: OutputMode, newRows: Array[Row], sinkCapacity: Option[Int])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: our style is more like

  def write(
    batchId: Long,
    outputMode: OutputMode,
    newRows: Array[Row],
    sinkCapacity: Option[Int]): Unit = {


private def truncateRowsIfNeeded(rows: Array[Row], maxRows: Int, batchId: Long): Array[Row] = {
if (rows.length > maxRows) {
logWarning(s"Truncating batch $batchId to $maxRows rows")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does take behave with negative rows? Printing a warning message with negative values may be weird. I would also include the sink limit in the warning.

@SparkQA
Copy link

SparkQA commented Jun 14, 2018

Test build #91861 has finished for PR 21559 at commit b2ef59c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 14, 2018

Test build #91864 has finished for PR 21559 at commit 25d6de1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait MemorySinkBase extends BaseStreamingSink with Logging

@SparkQA
Copy link

SparkQA commented Jun 15, 2018

Test build #91868 has finished for PR 21559 at commit e5b6175.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@brkyvz brkyvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just two minor nits

* Gets the max number of rows a MemorySink should store. This number is based on the memory
* sink row limit if it is set. If not, there is no limit.
* @param options Options for writing from which we get the max rows option
* @return The maximum number of rows a memorySink should store, or None for no limit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to update docs

sinkLimit: Int,
batchId: Long): Array[Row] = {
if (rows.length > batchLimit && batchLimit >= 0) {
logWarning(s"Truncating batch $batchId to $batchLimit rows because of sink limit $sinkLimit")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not sure if these sinks get used by Continuous processing too. If so I would rename batch to trigger version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This piece is shared by MemorySink and MemorySinkV2, and the MemorySinkV2 (continuous processing) sink still calls them batches.

@SparkQA
Copy link

SparkQA commented Jun 15, 2018

Test build #91926 has finished for PR 21559 at commit 0402b60.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@brkyvz
Copy link
Contributor

brkyvz commented Jun 15, 2018

Thanks! Merging to master!

@asfgit asfgit closed this in e4fee39 Jun 15, 2018
@mukulmurthy mukulmurthy deleted the SPARK-24525 branch July 11, 2018 22:33
otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
…rySink

Provide an option to limit number of rows in a MemorySink. Currently, MemorySink and MemorySinkV2 have unbounded size, meaning that if they're used on big data, they can OOM the stream. This change adds a maxMemorySinkRows option to limit how many rows MemorySink and MemorySinkV2 can hold. By default, they are still unbounded.

Added new unit tests.

Author: Mukul Murthy <mukul.murthy@databricks.com>

Closes apache#21559 from mukulmurthy/SPARK-24525.

Ref: LIHADOOP-48531

RB=1852593
G=superfriends-reviewers
R=mshen,fli,latang,yezhou,zolin
A=
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants