[SPARK-18419][SQL] `JDBCRelation.insert` should not remove Spark options #15863

dongjoon-hyun · 2016-11-12T13:47:28Z

What changes were proposed in this pull request?

Currently, JDBCRelation.insert removes Spark options too early by mistakenly using asConnectionProperties. Spark options like numPartitions should be passed into DataFrameWriter.jdbc correctly. This bug have been hidden because JDBCOptions.asConnectionProperties fails to filter out the mixed-case options. This PR aims to fix both.

JDBCRelation.insert

override def insert(data: DataFrame, overwrite: Boolean): Unit = {
  val url = jdbcOptions.url
  val table = jdbcOptions.table
- val properties = jdbcOptions.asConnectionProperties
+ val properties = jdbcOptions.asProperties
  data.write
    .mode(if (overwrite) SaveMode.Overwrite else SaveMode.Append)
    .jdbc(url, table, properties)

JDBCOptions.asConnectionProperties

scala> import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions
scala> import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
scala> new JDBCOptions(Map("url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10")).asConnectionProperties
res0: java.util.Properties = {numpartitions=10}
scala> new JDBCOptions(new CaseInsensitiveMap(Map("url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10"))).asConnectionProperties
res1: java.util.Properties = {numpartitions=10}

How was this patch tested?

Pass the Jenkins with a new testcase.

dongjoon-hyun · 2016-11-12T13:52:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala

@@ -54,7 +54,6 @@ object JDBCRDD extends Logging {
  def resolveTable(options: JDBCOptions): StructType = {
    val url = options.url
    val table = options.table
-    val properties = options.asConnectionProperties


This is unused.

dongjoon-hyun · 2016-11-12T14:26:31Z

Thank you for review, @HyukjinKwon . Let me see that.

HyukjinKwon · 2016-11-12T14:28:44Z

Oh, it seems not. I just removed my suggestion. I will take a look at this again if it is possible in similar way. Thanks!

dongjoon-hyun · 2016-11-12T14:29:14Z

Yes. It seems not to solve this problem. The problem is the following is case-sensitive.

private val jdbcOptionNames = ArrayBuffer.empty[String]

SparkQA · 2016-11-12T15:59:40Z

Test build #68563 has finished for PR 15863 at commit 2e249ea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-11-12T18:24:57Z

Our user-specified JDBC options/parameters are case sensitive, right?

dongjoon-hyun · 2016-11-12T18:54:30Z

Thank you for review, @gatorsmile .
In the Spark code, JDBCOption is always receiving CaseInsensitiveMap.
For example, here.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L237

The purpose of asConnectionProperties is removing JDBC options defined in JDBCOptions.scala, but it cannot filter numPartitions or others having mixed cases.
This PR aims to filter only JDBC Options case-insensitively. Only the user-specified JDBC options/parameters having the same spelling will be filtered out.

gatorsmile · 2016-11-12T19:12:00Z

Not always. Try this code path.

    val df = spark.createDataFrame(sparkContext.parallelize(arr2x2), schema2)
    df.write.format("jdbc")
      .option("URL", url1)
      .option("dbtable", "TEST.SAVETEST")
      .options(properties.asScala)
      .save()

I think we should make them consistent. Can you fix it in this PR?

dongjoon-hyun · 2016-11-12T19:15:00Z

Oh, sure. I'll investigate it.

dongjoon-hyun · 2016-11-12T19:50:49Z

I updated DataSource consistently to use CaseInsensitiveMap and am running tests.
After testing, this PR will include the followings.

Fix DataSource to use CaseInsensitiveMap.
Fix JDBCOptions.asConnectionProperties to be case-insensitive.

dongjoon-hyun · 2016-11-12T21:12:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

@@ -314,8 +311,7 @@ case class DataSource(
            catalogTable.get,
            catalogTable.get.stats.map(_.sizeInBytes.toLong).getOrElse(0L))
        } else {
-          new InMemoryFileIndex(
-            sparkSession, globbedPaths, options, partitionSchema)
+          new InMemoryFileIndex(sparkSession, globbedPaths, options, partitionSchema)


InMemoryFileIndex should use the original one because it extends PartitioningAwareFileIndex which uses the following line.

protected val hadoopConf = sparkSession.sessionState.newHadoopConfWithOptions(parameters)

Eventually, it created another case-sensitive Map from the CaseInsensitiveMap. In that case, it will fail to retrieve data from that map because the key value is changed as lowercases already in CaseInsensitiveMap.

def newHadoopConfWithOptions(options: Map[String, String]): Configuration = { val hadoopConf = newHadoopConf() options.foreach { case (k, v) => if ((v ne null) && k != "path" && k != "paths") { hadoopConf.set(k, v) } } hadoopConf }

SparkQA · 2016-11-12T22:06:47Z

Test build #68572 has finished for PR 15863 at commit c6cf19c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-11-13T00:24:30Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala

+  test("SPARK-18419 JDBCOption keys should be case-insensitive") {
+    val df = spark.createDataFrame(sparkContext.parallelize(arr2x2), schema2)
+    df.write.format("jdbc")
+      .option("URL", url1)


nit: Use "Url" or "uRL" to show the case-insensitive more explicitly.

Thank you for review, @viirya ! I updated it, too.

SparkQA · 2016-11-13T00:57:29Z

Test build #68574 has finished for PR 15863 at commit ea47d74.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-11-13T01:30:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

-        s.createSource(
-          sparkSession.sqlContext, metadataPath, userSpecifiedSchema, className, options)
+        s.createSource(sparkSession.sqlContext, metadataPath, userSpecifiedSchema, className,
+          caseInsensitiveOptions)


The code style issue.

s.createSource( sparkSession.sqlContext, metadataPath, userSpecifiedSchema, className, caseInsensitiveOptions)

gatorsmile · 2016-11-13T01:33:10Z

@rxin @cloud-fan Could you take a look at this PR, especially the external behavior impact? Thanks!

SparkQA · 2016-11-13T03:35:44Z

Test build #68575 has finished for PR 15863 at commit 493f31c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-11-14T06:11:37Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala

@@ -303,4 +303,13 @@ class JDBCWriteSuite extends SharedSQLContext with BeforeAndAfter {
    assert(e.contains("If 'partitionColumn' is specified then 'lowerBound', 'upperBound'," +
      " and 'numPartitions' are required."))
  }
+
+  test("SPARK-18419 JDBCOption keys should be case-insensitive") {


is it consistent with other options like JsonOtion?

dongjoon-hyun · 2016-11-14T07:01:02Z

Thank you for review, @cloud-fan .

Actually, this PR includes two different ones. The first commit is about a bug and the others (from seconds) is potential improvement.

First, I'll separate them to make each PRs more clearer.

dongjoon-hyun · 2016-11-14T09:17:30Z

@cloud-fan and @gatorsmile .
For the DataSource options issue, I'm working on SPARK-18433 for the followings.

CSVOptions
JDBCOptions
JSONOptions
OrcOptions

SparkQA · 2016-11-14T09:31:58Z

Test build #68606 has finished for PR 15863 at commit 2e249ea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-12-01T18:59:52Z

I close this PR since this is already fixed.

dongjoon-hyun · 2016-12-01T19:05:02Z

Oh, it's still there.

scala> import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions
import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions

scala> import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap

scala> new JDBCOptions(Map("url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10")).asConnectionProperties
res0: java.util.Properties = {numpartitions=10}

scala> new JDBCOptions(new CaseInsensitiveMap(Map("url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10"))).asConnectionProperties
res1: java.util.Properties = {numpartitions=10}

dongjoon-hyun · 2016-12-01T19:18:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala

@@ -129,7 +129,7 @@ object JDBCOptions {
  private val jdbcOptionNames = ArrayBuffer.empty[String]


jdbcOptionNames is the root cause.

shall we make it a Set? seems we only use it for contains check

Right. I'll update like that.

dongjoon-hyun · 2016-12-01T19:19:45Z

@cloud-fan and @gatorsmile .
How do you think about this? This is asConnectionProperties function specific issue.

dongjoon-hyun · 2016-12-01T19:28:28Z

Also, cc @srowen .
Sorry for the confusion.

dongjoon-hyun · 2016-12-01T20:09:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala

@@ -130,7 +130,7 @@ private[sql] case class JDBCRelation(
  override def insert(data: DataFrame, overwrite: Boolean): Unit = {
    val url = jdbcOptions.url
    val table = jdbcOptions.table
-    val properties = jdbcOptions.asConnectionProperties


Also, this should not use asConnectionProperties. writer needs options like numPartitions.

dongjoon-hyun · 2016-12-01T20:24:48Z

I updated PR description and focus.

SparkQA · 2016-12-01T21:59:30Z

Test build #69494 has finished for PR 15863 at commit 3323b42.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-01T22:14:38Z

Test build #69493 has finished for PR 15863 at commit 2e249ea.

This patch fails PySpark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-12-01T22:18:02Z

The only failure seems to be irrelevant. Also, I'm waiting the last running test.

Had test failures in pyspark.streaming.tests with pypy; see logs.

SparkQA · 2016-12-01T23:42:52Z

Test build #69498 has finished for PR 15863 at commit 204594f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-12-02T04:14:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala

+    parameters.foreach { case (k, v) => properties.setProperty(k, v) }
+    properties
+  }
+
  val asConnectionProperties: Properties = {


let's add some document to explain when we should use asConnectionProperties and when asProperties

cloud-fan · 2016-12-02T04:17:26Z

LGTM except 2 comments, including https://github.com/apache/spark/pull/15863/files#r90584811

…insensitive

SparkQA · 2016-12-02T07:27:14Z

Test build #69544 has finished for PR 15863 at commit 2375f7f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-12-02T13:49:24Z

thanks, merging to master/2.1!

## What changes were proposed in this pull request? Currently, `JDBCRelation.insert` removes Spark options too early by mistakenly using `asConnectionProperties`. Spark options like `numPartitions` should be passed into `DataFrameWriter.jdbc` correctly. This bug have been **hidden** because `JDBCOptions.asConnectionProperties` fails to filter out the mixed-case options. This PR aims to fix both. **JDBCRelation.insert** ```scala override def insert(data: DataFrame, overwrite: Boolean): Unit = { val url = jdbcOptions.url val table = jdbcOptions.table - val properties = jdbcOptions.asConnectionProperties + val properties = jdbcOptions.asProperties data.write .mode(if (overwrite) SaveMode.Overwrite else SaveMode.Append) .jdbc(url, table, properties) ``` **JDBCOptions.asConnectionProperties** ```scala scala> import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions scala> import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap scala> new JDBCOptions(Map("url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10")).asConnectionProperties res0: java.util.Properties = {numpartitions=10} scala> new JDBCOptions(new CaseInsensitiveMap(Map("url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10"))).asConnectionProperties res1: java.util.Properties = {numpartitions=10} ``` ## How was this patch tested? Pass the Jenkins with a new testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #15863 from dongjoon-hyun/SPARK-18419. (cherry picked from commit 55d528f) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

dongjoon-hyun · 2016-12-02T15:17:24Z

Thank you for merging, @cloud-fan . Thank you for review, @gatorsmile , @viirya , @HyukjinKwon !

## What changes were proposed in this pull request? Currently, `JDBCRelation.insert` removes Spark options too early by mistakenly using `asConnectionProperties`. Spark options like `numPartitions` should be passed into `DataFrameWriter.jdbc` correctly. This bug have been **hidden** because `JDBCOptions.asConnectionProperties` fails to filter out the mixed-case options. This PR aims to fix both. **JDBCRelation.insert** ```scala override def insert(data: DataFrame, overwrite: Boolean): Unit = { val url = jdbcOptions.url val table = jdbcOptions.table - val properties = jdbcOptions.asConnectionProperties + val properties = jdbcOptions.asProperties data.write .mode(if (overwrite) SaveMode.Overwrite else SaveMode.Append) .jdbc(url, table, properties) ``` **JDBCOptions.asConnectionProperties** ```scala scala> import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions scala> import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap scala> new JDBCOptions(Map("url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10")).asConnectionProperties res0: java.util.Properties = {numpartitions=10} scala> new JDBCOptions(new CaseInsensitiveMap(Map("url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10"))).asConnectionProperties res1: java.util.Properties = {numpartitions=10} ``` ## How was this patch tested? Pass the Jenkins with a new testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#15863 from dongjoon-hyun/SPARK-18419.

dongjoon-hyun commented Nov 12, 2016

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-18419][SQL] Fix JDBCOptions.asConnectionProperties to be case-insensitive~~ [SPARK-18419][SQL] Fix JDBCOptions and DataSource to be caes-insensitive for JDBCOptions keys Nov 12, 2016

dongjoon-hyun changed the title ~~[SPARK-18419][SQL] Fix JDBCOptions and DataSource to be caes-insensitive for JDBCOptions keys~~ [SPARK-18419][SQL] Fix JDBCOptions and DataSource to be case-insensitive for JDBCOptions keys Nov 12, 2016

viirya reviewed Nov 13, 2016

View reviewed changes

gatorsmile reviewed Nov 13, 2016

View reviewed changes

cloud-fan reviewed Nov 14, 2016

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-18419][SQL] Fix JDBCOptions and DataSource to be case-insensitive for JDBCOptions keys~~ [SPARK-18419][SQL] Fix JDBCOptions.asConnectionProperties to be case-insensitive Nov 14, 2016

dongjoon-hyun closed this Dec 1, 2016

dongjoon-hyun reopened this Dec 1, 2016

dongjoon-hyun commented Dec 1, 2016

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-18419][SQL] Fix JDBCOptions.asConnectionProperties to be case-insensitive~~ [SPARK-18419][SQL] JDBCRelation.insert should not remove Spark options Dec 1, 2016

cloud-fan reviewed Dec 2, 2016

View reviewed changes

dongjoon-hyun added 3 commits December 1, 2016 20:23

[SPARK-18419][SQL] Fix JDBCOptions.asConnectionProperties to be case-…

cd06c0f

…insensitive

JDBCRelation.insert should not use asConnectionProperties.

950cab0

Use Set and add docs.

2375f7f

asfgit closed this in 55d528f Dec 2, 2016

dongjoon-hyun deleted the SPARK-18419 branch January 7, 2019 07:03

		@@ -129,7 +129,7 @@ object JDBCOptions {
		private val jdbcOptionNames = ArrayBuffer.empty[String]

[SPARK-18419][SQL] JDBCRelation.insert should not remove Spark options #15863

[SPARK-18419][SQL] JDBCRelation.insert should not remove Spark options #15863

Conversation

dongjoon-hyun commented Nov 12, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

dongjoon-hyun commented Nov 12, 2016

HyukjinKwon commented Nov 12, 2016

dongjoon-hyun commented Nov 12, 2016 • edited Loading

SparkQA commented Nov 12, 2016

gatorsmile commented Nov 12, 2016

dongjoon-hyun commented Nov 12, 2016 • edited Loading

gatorsmile commented Nov 12, 2016

dongjoon-hyun commented Nov 12, 2016

dongjoon-hyun commented Nov 12, 2016

Choose a reason for hiding this comment

SparkQA commented Nov 12, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 13, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Nov 13, 2016

SparkQA commented Nov 13, 2016

Choose a reason for hiding this comment

dongjoon-hyun commented Nov 14, 2016

dongjoon-hyun commented Nov 14, 2016

SparkQA commented Nov 14, 2016

dongjoon-hyun commented Dec 1, 2016

dongjoon-hyun commented Dec 1, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Dec 1, 2016

dongjoon-hyun commented Dec 1, 2016

Choose a reason for hiding this comment

dongjoon-hyun commented Dec 1, 2016

SparkQA commented Dec 1, 2016

SparkQA commented Dec 1, 2016

dongjoon-hyun commented Dec 1, 2016 • edited Loading

SparkQA commented Dec 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Dec 2, 2016

SparkQA commented Dec 2, 2016

cloud-fan commented Dec 2, 2016

dongjoon-hyun commented Dec 2, 2016

[SPARK-18419][SQL] `JDBCRelation.insert` should not remove Spark options #15863

[SPARK-18419][SQL] `JDBCRelation.insert` should not remove Spark options #15863

dongjoon-hyun commented Nov 12, 2016 •

edited

Loading

dongjoon-hyun commented Nov 12, 2016 •

edited

Loading

dongjoon-hyun commented Nov 12, 2016 •

edited

Loading

dongjoon-hyun commented Dec 1, 2016 •

edited

Loading

dongjoon-hyun commented Dec 1, 2016 •

edited

Loading