[SPARK-27106][SQL] merge CaseInsensitiveStringMap and DataSourceOptions #24025

cloud-fan · 2019-03-08T15:46:23Z

What changes were proposed in this pull request?

It's a little awkward to have 2 different classes(CaseInsensitiveStringMap and DataSourceOptions) to present the options in data source and catalog API.

This PR merges these 2 classes, while keeping the name CaseInsensitiveStringMap, which is more precise.

How was this patch tested?

existing tests

cloud-fan · 2019-03-08T15:47:35Z

sql/catalyst/src/main/java/org/apache/spark/sql/util/CaseInsensitiveStringMap.java

+   * Returns the boolean value to which the specified key is mapped,
+   * or defaultValue if there is no mapping for the key. The key match is case-insensitive
+   */
+  public boolean getBoolean(String key, boolean defaultValue) {


These 4 methods are from DataSourceOptions, which are pretty general and useful.

cloud-fan · 2019-03-08T15:49:02Z

sql/catalyst/src/test/scala/org/apache/spark/sql/util/CaseInsensitiveStringMapSuite.scala

- * A simple test suite to verify `DataSourceOptions`.
- */
-class DataSourceOptionsSuite extends SparkFunSuite {
+class CaseInsensitiveStringMapSuite extends SparkFunSuite {


It's awkward to write test in Java. I rewrite it in Scala and merge it with the original DataSourceOptionsSuite

cloud-fan · 2019-03-08T15:53:37Z

cc @rdblue @gengliangwang @gatorsmile

gengliangwang · 2019-03-08T17:00:45Z

I think this PR changes too many files...
How about reserve the DataSourceOptions by making it a derived class of CaseInsensitiveStringMap? So that the keys PATH_KEY/CHECK_FILES_EXIST_KEY/etc and their related methods can be also reserved.

dongjoon-hyun · 2019-03-08T17:06:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala


 abstract class FileTable(
    sparkSession: SparkSession,
-    options: DataSourceOptions,
+    options: CaseInsensitiveStringMap,
+    paths: Seq[String],


Hi, @cloud-fan .
Should we change FileTable signature to accept paths additionally for merging DataSourceOptions and CaseInsensitiveStringMap?

it's not a big deal. I did this because we need paths in the OrcDataSourceV2 as well, so we can calculate the paths only once in the OrcDataSourceV2.

Got it. Thanks!

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala

dongjoon-hyun · 2019-03-08T17:10:48Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

      }
-      val checkFilesExistsOption = DataSourceOptions.CHECK_FILES_EXIST_KEY -> "true"
+      // TODO: remove this option.
+      val checkFilesExistsOption = "check_files_exist" -> "true"


Could you file a JIRA and make this as an IDed TODO please?

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala

dongjoon-hyun · 2019-03-08T17:28:50Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/sources/StreamingDataSourceV2Suite.scala

@@ -306,8 +307,8 @@ class StreamingDataSourceV2Suite extends StreamTest {
        testPositiveCaseWithQuery(readSource, writeSource, trigger) { _ =>
          eventually(timeout(streamingTimeout)) {
            // Write options should not be set.
-            assert(LastWriteOptions.options.getBoolean(readOptionName, false) == false)
-            assert(LastReadOptions.options.getBoolean(readOptionName, false))


Since this PR adds CaseInsensitiveStringMap.getBoolean, we don't need to change line 310.

dongjoon-hyun · 2019-03-08T17:29:16Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/sources/StreamingDataSourceV2Suite.scala

@@ -317,8 +318,8 @@ class StreamingDataSourceV2Suite extends StreamTest {
        testPositiveCaseWithQuery(readSource, writeSource, trigger) { _ =>
          eventually(timeout(streamingTimeout)) {
            // Read options should not be set.
-            assert(LastReadOptions.options.getBoolean(writeOptionName, false) == false)
-            assert(LastWriteOptions.options.getBoolean(writeOptionName, false))


ditto for line 321.

SparkQA · 2019-03-08T17:42:05Z

Test build #103216 has finished for PR 24025 at commit b8b3a3c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-03-09T03:55:18Z

So that the keys PATH_KEY/CHECK_FILES_EXIST_KEY/etc and their related methods can be also reserved.

One goal is to remove these pre-defined option keys, as the options should just be a general string-to-string map.

I don't think it's a good idea to keep both CaseInsensitiveStringMap and DataSourceOptions just for keeping code diff small. It will hurt long term maintainability.

SparkQA · 2019-03-09T08:05:01Z

Test build #103254 has finished for PR 24025 at commit c60e2bf.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2019-03-09T08:10:35Z

retest this please

SparkQA · 2019-03-09T14:53:35Z

Test build #103257 has finished for PR 24025 at commit c60e2bf.

This patch fails from timeout after a configured wait of 400m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-09T15:16:46Z

Test build #4601 has started for PR 24025 at commit c60e2bf.

srowen

Generally looks good to me as a cleanup

dongjoon-hyun · 2019-03-10T00:56:40Z

Retest this please.

SparkQA · 2019-03-10T07:37:56Z

Test build #103272 has finished for PR 24025 at commit c60e2bf.

This patch fails from timeout after a configured wait of 400m.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-03-10T21:32:43Z

Retest this please.

dongjoon-hyun · 2019-03-10T21:54:23Z

...c/main/scala/org/apache/spark/sql/execution/streaming/sources/TextSocketSourceProvider.scala

      throw new AnalysisException("Set a port to read from with option(\"port\", ...).")
    }
    Try {
-      params.get("includeTimestamp").orElse("false").toBoolean
+      params.getBoolean("includeTimestamp", false)
    } match {
      case Success(_) =>
      case Failure(_) =>
        throw new AnalysisException("includeTimestamp must be set to either \"true\" or \"false\"")


Hi, @cloud-fan .
It seems that we need to change this Try logic. For invalid values like fasle,

Previously, IllegalArgumentException is thrown by Scala StringLike.parseBoolean

Now, Java Boolean.parseBoolean returns false without exceptions.

good catch!

SparkQA · 2019-03-11T04:16:47Z

Test build #103284 has finished for PR 24025 at commit c60e2bf.

This patch fails from timeout after a configured wait of 400m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-11T06:28:44Z

Test build #103293 has finished for PR 24025 at commit a53748d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-12T07:05:01Z

Test build #103359 has finished for PR 24025 at commit 32fdb64.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-03-12T08:16:36Z

retest this please

gengliangwang · 2019-03-12T10:56:26Z

sql/catalyst/src/main/java/org/apache/spark/sql/util/CaseInsensitiveStringMap.java

+
+  /**
+   * Returns the integer value to which the specified key is mapped,
+   * or defaultValue if there is no mapping for the key. The key match is case-insensitive


Nit: add . at the end of line.

it's too minor to trigger another QA round. I'll fix it in another PR if the current QA round passes.

gengliangwang

LGTM. I search all the java/scala/markdown files and there is no DataSourceOptions now.

SparkQA · 2019-03-12T15:03:58Z

Test build #103366 has finished for PR 24025 at commit 32fdb64.

This patch fails from timeout after a configured wait of 400m.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-03-12T15:51:56Z

Oh, it's weird. So far, there is no successful Jenkins run in this PR.

dongjoon-hyun · 2019-03-12T15:52:58Z

Retest this please.

SparkQA · 2019-03-13T00:44:56Z

Test build #103376 has finished for PR 24025 at commit 32fdb64.

This patch fails from timeout after a configured wait of 400m.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-03-13T02:10:37Z

retest this please

dongjoon-hyun · 2019-03-13T02:50:55Z

Hi, @cloud-fan . Could you check the test failure at RateStreamProviderSuite?

[info] RateStreamProviderSuite:
[info] - RateStreamProvider in registry (14 milliseconds)
[info] - compatible with old path in registry (1 millisecond)
[info] - microbatch - basic *** FAILED *** (10 seconds, 113 milliseconds)
[info]   Timed out waiting for stream: The code passed to failAfter did not complete within 10 seconds.

SparkQA · 2019-03-13T06:40:40Z

Test build #103408 has finished for PR 24025 at commit 71e6ae1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-13T07:05:01Z

Test build #103404 has finished for PR 24025 at commit 32fdb64.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2019-03-13T07:08:25Z

retest this please

gengliangwang · 2019-03-13T08:55:28Z

...main/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamMicroBatchStream.scala

@@ -155,7 +155,7 @@ class RateStreamMicroBatchStream(

  override def toString: String = s"RateStreamV2[rowsPerSecond=$rowsPerSecond, " +
    s"rampUpTimeSeconds=$rampUpTimeSeconds, " +
-    s"numPartitions=${options.get(NUM_PARTITIONS).orElse("default")}"
+    s"numPartitions=${Option(options.get(NUM_PARTITIONS)).getOrElse("default")}"


Nit: options.getOrDefault(NUM_PARTITIONS, "default")

Too minor to update...Hopefully this time all tests are passed.

SparkQA · 2019-03-13T10:28:43Z

Test build #103423 has finished for PR 24025 at commit 71e6ae1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-03-13T11:20:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala

+    Option(map.get("paths")).map { pathStr =>
+      objectMapper.readValue(pathStr, classOf[Array[String]]).toSeq
+    }.orElse(Option(map.get("path")).map(Seq(_))).getOrElse {
+      throw new IllegalArgumentException("'path' must be given when reading files.")


nit: 'path' or 'paths' must be ...

gengliangwang · 2019-03-13T11:47:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala

@@ -44,7 +44,7 @@ trait FileDataSourceV2 extends TableProvider with DataSourceRegister {
    Option(map.get("paths")).map { pathStr =>
      objectMapper.readValue(pathStr, classOf[Array[String]]).toSeq
    }.orElse(Option(map.get("path")).map(Seq(_))).getOrElse {
-      throw new IllegalArgumentException("'path' must be given when reading files.")
+      Nil


protected def getPaths(map: CaseInsensitiveStringMap): Seq[String] = { Option(map.get("paths")).map { pathStr => val objectMapper = new ObjectMapper() objectMapper.readValue(pathStr, classOf[Array[String]]).toSeq }.getOrElse { Option(map.get("path")).toSeq } }

SparkQA · 2019-03-13T15:49:17Z

Test build #103433 has finished for PR 24025 at commit 7b922f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-13T16:34:26Z

Test build #103435 has finished for PR 24025 at commit 4599659.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-03-13T17:18:51Z

thanks, merging to master!

rdblue · 2019-03-13T20:15:44Z

Thanks for working on this, @cloud-fan!

It's a little awkward to have 2 different classes(`CaseInsensitiveStringMap` and `DataSourceOptions`) to present the options in data source and catalog API. This PR merges these 2 classes, while keeping the name `CaseInsensitiveStringMap`, which is more precise. existing tests Closes apache#24025 from cloud-fan/option. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

merge CaseInsensitiveStringMap and DataSourceOptions

b8b3a3c

cloud-fan commented Mar 8, 2019

View reviewed changes

dongjoon-hyun reviewed Mar 8, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala Show resolved Hide resolved

dongjoon-hyun reviewed Mar 8, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala Show resolved Hide resolved

dongjoon-hyun reviewed Mar 8, 2019

View reviewed changes

dongjoon-hyun mentioned this pull request Mar 8, 2019

[SPARK-27085][SQL] Migrate CSV to File Data Source V2 #24005

Closed

address comments

c60e2bf

srowen reviewed Mar 9, 2019

View reviewed changes

dongjoon-hyun reviewed Mar 10, 2019

View reviewed changes

fix boolean option

a53748d

fix test

32fdb64

cloud-fan mentioned this pull request Mar 12, 2019

[SPARK-26594][SQL] DataSourceOptions.asMap should return CaseInsensitiveMap #24062

Closed

gengliangwang reviewed Mar 12, 2019

View reviewed changes

gengliangwang approved these changes Mar 12, 2019

View reviewed changes

fix test

71e6ae1

gengliangwang reviewed Mar 13, 2019

View reviewed changes

viirya reviewed Mar 13, 2019

View reviewed changes

fix test

7b922f9

gengliangwang reviewed Mar 13, 2019

View reviewed changes

cloud-fan added 2 commits March 13, 2019 20:14

Merge remote-tracking branch 'origin/master' into option

5ff7202

address comment

4599659

cloud-fan closed this in 2a80a4c Mar 13, 2019

gengliangwang mentioned this pull request Aug 16, 2019

[SPARK-28757][SQL] File table location should include both values of option path and paths #25473

Closed

[SPARK-27106][SQL] merge CaseInsensitiveStringMap and DataSourceOptions #24025

[SPARK-27106][SQL] merge CaseInsensitiveStringMap and DataSourceOptions #24025

Conversation

cloud-fan commented Mar 8, 2019

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Mar 8, 2019

gengliangwang commented Mar 8, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Mar 8, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 8, 2019

cloud-fan commented Mar 9, 2019

SparkQA commented Mar 9, 2019

dilipbiswal commented Mar 9, 2019

SparkQA commented Mar 9, 2019

SparkQA commented Mar 9, 2019

srowen left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Mar 10, 2019

SparkQA commented Mar 10, 2019

dongjoon-hyun commented Mar 10, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 11, 2019

SparkQA commented Mar 11, 2019

SparkQA commented Mar 12, 2019

HyukjinKwon commented Mar 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gengliangwang left a comment

Choose a reason for hiding this comment

SparkQA commented Mar 12, 2019

dongjoon-hyun commented Mar 12, 2019

dongjoon-hyun commented Mar 12, 2019

SparkQA commented Mar 13, 2019

gengliangwang commented Mar 13, 2019

dongjoon-hyun commented Mar 13, 2019

SparkQA commented Mar 13, 2019

SparkQA commented Mar 13, 2019

dilipbiswal commented Mar 13, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 13, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 13, 2019

SparkQA commented Mar 13, 2019

cloud-fan commented Mar 13, 2019

rdblue commented Mar 13, 2019

gengliangwang commented Mar 8, 2019 •

edited

Loading

dongjoon-hyun Mar 8, 2019 •

edited

Loading