Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27106][SQL] merge CaseInsensitiveStringMap and DataSourceOptions #24025

Closed
wants to merge 8 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

It's a little awkward to have 2 different classes(CaseInsensitiveStringMap and DataSourceOptions) to present the options in data source and catalog API.

This PR merges these 2 classes, while keeping the name CaseInsensitiveStringMap, which is more precise.

How was this patch tested?

existing tests

* Returns the boolean value to which the specified key is mapped,
* or defaultValue if there is no mapping for the key. The key match is case-insensitive
*/
public boolean getBoolean(String key, boolean defaultValue) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 4 methods are from DataSourceOptions, which are pretty general and useful.

* A simple test suite to verify `DataSourceOptions`.
*/
class DataSourceOptionsSuite extends SparkFunSuite {
class CaseInsensitiveStringMapSuite extends SparkFunSuite {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's awkward to write test in Java. I rewrite it in Scala and merge it with the original DataSourceOptionsSuite

@cloud-fan
Copy link
Contributor Author

@gengliangwang
Copy link
Member

gengliangwang commented Mar 8, 2019

I think this PR changes too many files...
How about reserve the DataSourceOptions by making it a derived class of CaseInsensitiveStringMap? So that the keys PATH_KEY/CHECK_FILES_EXIST_KEY/etc and their related methods can be also reserved.


abstract class FileTable(
sparkSession: SparkSession,
options: DataSourceOptions,
options: CaseInsensitiveStringMap,
paths: Seq[String],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @cloud-fan .
Should we change FileTable signature to accept paths additionally for merging DataSourceOptions and CaseInsensitiveStringMap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not a big deal. I did this because we need paths in the OrcDataSourceV2 as well, so we can calculate the paths only once in the OrcDataSourceV2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks!

}
val checkFilesExistsOption = DataSourceOptions.CHECK_FILES_EXIST_KEY -> "true"
// TODO: remove this option.
val checkFilesExistsOption = "check_files_exist" -> "true"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you file a JIRA and make this as an IDed TODO please?

@@ -306,8 +307,8 @@ class StreamingDataSourceV2Suite extends StreamTest {
testPositiveCaseWithQuery(readSource, writeSource, trigger) { _ =>
eventually(timeout(streamingTimeout)) {
// Write options should not be set.
assert(LastWriteOptions.options.getBoolean(readOptionName, false) == false)
assert(LastReadOptions.options.getBoolean(readOptionName, false))
Copy link
Member

@dongjoon-hyun dongjoon-hyun Mar 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR adds CaseInsensitiveStringMap.getBoolean, we don't need to change line 310.

@@ -317,8 +318,8 @@ class StreamingDataSourceV2Suite extends StreamTest {
testPositiveCaseWithQuery(readSource, writeSource, trigger) { _ =>
eventually(timeout(streamingTimeout)) {
// Read options should not be set.
assert(LastReadOptions.options.getBoolean(writeOptionName, false) == false)
assert(LastWriteOptions.options.getBoolean(writeOptionName, false))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto for line 321.

@SparkQA
Copy link

SparkQA commented Mar 8, 2019

Test build #103216 has finished for PR 24025 at commit b8b3a3c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

So that the keys PATH_KEY/CHECK_FILES_EXIST_KEY/etc and their related methods can be also reserved.

One goal is to remove these pre-defined option keys, as the options should just be a general string-to-string map.

I don't think it's a good idea to keep both CaseInsensitiveStringMap and DataSourceOptions just for keeping code diff small. It will hurt long term maintainability.

@SparkQA
Copy link

SparkQA commented Mar 9, 2019

Test build #103254 has finished for PR 24025 at commit c60e2bf.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Mar 9, 2019

Test build #103257 has finished for PR 24025 at commit c60e2bf.

  • This patch fails from timeout after a configured wait of 400m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 9, 2019

Test build #4601 has started for PR 24025 at commit c60e2bf.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me as a cleanup

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Mar 10, 2019

Test build #103272 has finished for PR 24025 at commit c60e2bf.

  • This patch fails from timeout after a configured wait of 400m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

throw new AnalysisException("Set a port to read from with option(\"port\", ...).")
}
Try {
params.get("includeTimestamp").orElse("false").toBoolean
params.getBoolean("includeTimestamp", false)
} match {
case Success(_) =>
case Failure(_) =>
throw new AnalysisException("includeTimestamp must be set to either \"true\" or \"false\"")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @cloud-fan .
It seems that we need to change this Try logic. For invalid values like fasle,

  • Previously, IllegalArgumentException is thrown by Scala StringLike.parseBoolean
  • Now, Java Boolean.parseBoolean returns false without exceptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

@SparkQA
Copy link

SparkQA commented Mar 11, 2019

Test build #103284 has finished for PR 24025 at commit c60e2bf.

  • This patch fails from timeout after a configured wait of 400m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 11, 2019

Test build #103293 has finished for PR 24025 at commit a53748d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 12, 2019

Test build #103359 has finished for PR 24025 at commit 32fdb64.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please


/**
* Returns the integer value to which the specified key is mapped,
* or defaultValue if there is no mapping for the key. The key match is case-insensitive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: add . at the end of line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's too minor to trigger another QA round. I'll fix it in another PR if the current QA round passes.

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I search all the java/scala/markdown files and there is no DataSourceOptions now.

@SparkQA
Copy link

SparkQA commented Mar 12, 2019

Test build #103366 has finished for PR 24025 at commit 32fdb64.

  • This patch fails from timeout after a configured wait of 400m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Oh, it's weird. So far, there is no successful Jenkins run in this PR.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Mar 13, 2019

Test build #103376 has finished for PR 24025 at commit 32fdb64.

  • This patch fails from timeout after a configured wait of 400m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member

retest this please

@dongjoon-hyun
Copy link
Member

Hi, @cloud-fan . Could you check the test failure at RateStreamProviderSuite?

[info] RateStreamProviderSuite:
[info] - RateStreamProvider in registry (14 milliseconds)
[info] - compatible with old path in registry (1 millisecond)
[info] - microbatch - basic *** FAILED *** (10 seconds, 113 milliseconds)
[info]   Timed out waiting for stream: The code passed to failAfter did not complete within 10 seconds.

@SparkQA
Copy link

SparkQA commented Mar 13, 2019

Test build #103408 has finished for PR 24025 at commit 71e6ae1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 13, 2019

Test build #103404 has finished for PR 24025 at commit 32fdb64.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor

retest this please

@@ -155,7 +155,7 @@ class RateStreamMicroBatchStream(

override def toString: String = s"RateStreamV2[rowsPerSecond=$rowsPerSecond, " +
s"rampUpTimeSeconds=$rampUpTimeSeconds, " +
s"numPartitions=${options.get(NUM_PARTITIONS).orElse("default")}"
s"numPartitions=${Option(options.get(NUM_PARTITIONS)).getOrElse("default")}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: options.getOrDefault(NUM_PARTITIONS, "default")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too minor to update...Hopefully this time all tests are passed.

@SparkQA
Copy link

SparkQA commented Mar 13, 2019

Test build #103423 has finished for PR 24025 at commit 71e6ae1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Option(map.get("paths")).map { pathStr =>
objectMapper.readValue(pathStr, classOf[Array[String]]).toSeq
}.orElse(Option(map.get("path")).map(Seq(_))).getOrElse {
throw new IllegalArgumentException("'path' must be given when reading files.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 'path' or 'paths' must be ...

@@ -44,7 +44,7 @@ trait FileDataSourceV2 extends TableProvider with DataSourceRegister {
Option(map.get("paths")).map { pathStr =>
objectMapper.readValue(pathStr, classOf[Array[String]]).toSeq
}.orElse(Option(map.get("path")).map(Seq(_))).getOrElse {
throw new IllegalArgumentException("'path' must be given when reading files.")
Nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  protected def getPaths(map: CaseInsensitiveStringMap): Seq[String] = {
    Option(map.get("paths")).map { pathStr =>
      val objectMapper = new ObjectMapper()
      objectMapper.readValue(pathStr, classOf[Array[String]]).toSeq
    }.getOrElse {
      Option(map.get("path")).toSeq
    }
  }

@SparkQA
Copy link

SparkQA commented Mar 13, 2019

Test build #103433 has finished for PR 24025 at commit 7b922f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 13, 2019

Test build #103435 has finished for PR 24025 at commit 4599659.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

thanks, merging to master!

@cloud-fan cloud-fan closed this in 2a80a4c Mar 13, 2019
@rdblue
Copy link
Contributor

rdblue commented Mar 13, 2019

Thanks for working on this, @cloud-fan!

mccheah pushed a commit to palantir/spark that referenced this pull request May 15, 2019
It's a little awkward to have 2 different classes(`CaseInsensitiveStringMap` and `DataSourceOptions`) to present the options in data source and catalog API.

This PR merges these 2 classes, while keeping the name `CaseInsensitiveStringMap`, which is more precise.

existing tests

Closes apache#24025 from cloud-fan/option.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants