[SPARK-48061][SQL][TESTS] Parameterize max limits of `spark.sql.test.randomDataGenerator` #46305

dongjoon-hyun · 2024-04-30T16:48:39Z

What changes were proposed in this pull request?

This PR aims to parameterize MAX_ARR_SIZE, MAX_MAP_SIZE, and MAX_STR_LEN of spark.sql.test.randomDataGenerator by supporting.

spark.sql.test.randomDataGenerator.maxArraySize
spark.sql.test.randomDataGenerator.maxMapSize
spark.sql.test.randomDataGenerator.maxStrLen

Why are the changes needed?

Apache Spark already has the code which needs these parameters. We had better support these to allow the developers to use them without changing and recompiling the source code.

spark/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryHashPartitionVerifySuite.scala

Lines 155 to 156 in 0329479

    
           // To limit the golden file size under 10Mb, please set the final val MAX_STR_LEN: Int = 100 
        
           // and final val MAX_ARR_SIZE: Int = 4 in org.apache.spark.sql.RandomDataGenerator

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manual.

BEFORE (golden file size: 269M)

$ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *StreamingQueryHashPartitionVerifySuite"

$ ls -alh ./sql/core/target/scala-2.13/test-classes/structured-streaming/partition-tests/rowsAndPartIds
-rw-r--r--  1 dongjoon  staff   269M Apr 30 09:55 ./sql/core/target/scala-2.13/test-classes/structured-streaming/partition-tests/rowsAndPartIds

AFTER (golden file size: 5.8M)

$ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *StreamingQueryHashPartitionVerifySuite" \
-Dspark.sql.test.randomDataGenerator.maxStrLen=100 \
-Dspark.sql.test.randomDataGenerator.maxArraySize=4

$ ls -alh ./sql/core/target/scala-2.13/test-classes/structured-streaming/partition-tests/rowsAndPartIds
-rw-r--r--  1 dongjoon  staff   5.8M Apr 30 09:56 ./sql/core/target/scala-2.13/test-classes/structured-streaming/partition-tests/rowsAndPartIds

Was this patch authored or co-authored using generative AI tooling?

No.

…randomDataGenerator`

dongjoon-hyun · 2024-04-30T16:59:12Z

Hi, @viirya . Could you review this PR, please?

viirya

Looks good to me. Thanks @dongjoon-hyun

dongjoon-hyun · 2024-04-30T17:51:46Z

Thank you, @viirya !

Since this is irrelevant to CI and I did manual verification. I'll merge this~

Merged to master for Apache Spark 4.0.0.

…randomDataGenerator` ### What changes were proposed in this pull request? This PR aims to parameterize `MAX_ARR_SIZE`, `MAX_MAP_SIZE`, and `MAX_STR_LEN` of `spark.sql.test.randomDataGenerator` by supporting. - `spark.sql.test.randomDataGenerator.maxArraySize` - `spark.sql.test.randomDataGenerator.maxMapSize` - `spark.sql.test.randomDataGenerator.maxStrLen` ### Why are the changes needed? Apache Spark already has the code which needs these parameters. We had better support these to allow the developers to use them without changing and recompiling the source code. https://github.com/apache/spark/blob/0329479acb6758c4d3e53d514ea832a181d31065/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryHashPartitionVerifySuite.scala#L155-L156 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual. **BEFORE (golden file size: `269M`)** ``` $ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *StreamingQueryHashPartitionVerifySuite" $ ls -alh ./sql/core/target/scala-2.13/test-classes/structured-streaming/partition-tests/rowsAndPartIds -rw-r--r-- 1 dongjoon staff 269M Apr 30 09:55 ./sql/core/target/scala-2.13/test-classes/structured-streaming/partition-tests/rowsAndPartIds ``` **AFTER (golden file size: `5.8M`)** ``` $ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *StreamingQueryHashPartitionVerifySuite" \ -Dspark.sql.test.randomDataGenerator.maxStrLen=100 \ -Dspark.sql.test.randomDataGenerator.maxArraySize=4 $ ls -alh ./sql/core/target/scala-2.13/test-classes/structured-streaming/partition-tests/rowsAndPartIds -rw-r--r-- 1 dongjoon staff 5.8M Apr 30 09:56 ./sql/core/target/scala-2.13/test-classes/structured-streaming/partition-tests/rowsAndPartIds ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46305 from dongjoon-hyun/SPARK-48061. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

[SPARK-48061][SQL][TESTS] Parameterize max limits of `spark.sql.test.…

87fb5f8

…randomDataGenerator`

github-actions bot added the SQL label Apr 30, 2024

viirya approved these changes Apr 30, 2024

View reviewed changes

dongjoon-hyun closed this in 9caa6f7 Apr 30, 2024

dongjoon-hyun deleted the SPARK-48061 branch April 30, 2024 17:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48061][SQL][TESTS] Parameterize max limits of `spark.sql.test.randomDataGenerator` #46305

[SPARK-48061][SQL][TESTS] Parameterize max limits of `spark.sql.test.randomDataGenerator` #46305

dongjoon-hyun commented Apr 30, 2024 •

edited

dongjoon-hyun commented Apr 30, 2024

viirya left a comment

dongjoon-hyun commented Apr 30, 2024

	// To limit the golden file size under 10Mb, please set the final val MAX_STR_LEN: Int = 100
	// and final val MAX_ARR_SIZE: Int = 4 in org.apache.spark.sql.RandomDataGenerator

[SPARK-48061][SQL][TESTS] Parameterize max limits of spark.sql.test.randomDataGenerator #46305

[SPARK-48061][SQL][TESTS] Parameterize max limits of spark.sql.test.randomDataGenerator #46305

Conversation

dongjoon-hyun commented Apr 30, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

dongjoon-hyun commented Apr 30, 2024

viirya left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Apr 30, 2024

[SPARK-48061][SQL][TESTS] Parameterize max limits of `spark.sql.test.randomDataGenerator` #46305

[SPARK-48061][SQL][TESTS] Parameterize max limits of `spark.sql.test.randomDataGenerator` #46305

dongjoon-hyun commented Apr 30, 2024 •

edited