Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18419][SQL] JDBCRelation.insert should not remove Spark options #15863

Closed
wants to merge 3 commits into from
Closed

[SPARK-18419][SQL] JDBCRelation.insert should not remove Spark options #15863

wants to merge 3 commits into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Nov 12, 2016

What changes were proposed in this pull request?

Currently, JDBCRelation.insert removes Spark options too early by mistakenly using asConnectionProperties. Spark options like numPartitions should be passed into DataFrameWriter.jdbc correctly. This bug have been hidden because JDBCOptions.asConnectionProperties fails to filter out the mixed-case options. This PR aims to fix both.

JDBCRelation.insert

override def insert(data: DataFrame, overwrite: Boolean): Unit = {
  val url = jdbcOptions.url
  val table = jdbcOptions.table
- val properties = jdbcOptions.asConnectionProperties
+ val properties = jdbcOptions.asProperties
  data.write
    .mode(if (overwrite) SaveMode.Overwrite else SaveMode.Append)
    .jdbc(url, table, properties)

JDBCOptions.asConnectionProperties

scala> import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions
scala> import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
scala> new JDBCOptions(Map("url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10")).asConnectionProperties
res0: java.util.Properties = {numpartitions=10}
scala> new JDBCOptions(new CaseInsensitiveMap(Map("url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10"))).asConnectionProperties
res1: java.util.Properties = {numpartitions=10}

How was this patch tested?

Pass the Jenkins with a new testcase.

@@ -54,7 +54,6 @@ object JDBCRDD extends Logging {
def resolveTable(options: JDBCOptions): StructType = {
val url = options.url
val table = options.table
val properties = options.asConnectionProperties
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unused.

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @HyukjinKwon . Let me see that.

@HyukjinKwon
Copy link
Member

Oh, it seems not. I just removed my suggestion. I will take a look at this again if it is possible in similar way. Thanks!

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Nov 12, 2016

Yes. It seems not to solve this problem. The problem is the following is case-sensitive.

private val jdbcOptionNames = ArrayBuffer.empty[String]

@SparkQA
Copy link

SparkQA commented Nov 12, 2016

Test build #68563 has finished for PR 15863 at commit 2e249ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Our user-specified JDBC options/parameters are case sensitive, right?

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Nov 12, 2016

Thank you for review, @gatorsmile .
In the Spark code, JDBCOption is always receiving CaseInsensitiveMap.
For example, here.

The purpose of asConnectionProperties is removing JDBC options defined in JDBCOptions.scala, but it cannot filter numPartitions or others having mixed cases.
This PR aims to filter only JDBC Options case-insensitively. Only the user-specified JDBC options/parameters having the same spelling will be filtered out.

@gatorsmile
Copy link
Member

Not always. Try this code path.

    val df = spark.createDataFrame(sparkContext.parallelize(arr2x2), schema2)
    df.write.format("jdbc")
      .option("URL", url1)
      .option("dbtable", "TEST.SAVETEST")
      .options(properties.asScala)
      .save()

I think we should make them consistent. Can you fix it in this PR?

@dongjoon-hyun
Copy link
Member Author

Oh, sure. I'll investigate it.

@dongjoon-hyun
Copy link
Member Author

I updated DataSource consistently to use CaseInsensitiveMap and am running tests.
After testing, this PR will include the followings.

  • Fix DataSource to use CaseInsensitiveMap.
  • Fix JDBCOptions.asConnectionProperties to be case-insensitive.

@@ -314,8 +311,7 @@ case class DataSource(
catalogTable.get,
catalogTable.get.stats.map(_.sizeInBytes.toLong).getOrElse(0L))
} else {
new InMemoryFileIndex(
sparkSession, globbedPaths, options, partitionSchema)
new InMemoryFileIndex(sparkSession, globbedPaths, options, partitionSchema)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InMemoryFileIndex should use the original one because it extends PartitioningAwareFileIndex which uses the following line.

protected val hadoopConf = sparkSession.sessionState.newHadoopConfWithOptions(parameters)

Eventually, it created another case-sensitive Map from the CaseInsensitiveMap. In that case, it will fail to retrieve data from that map because the key value is changed as lowercases already in CaseInsensitiveMap.

def newHadoopConfWithOptions(options: Map[String, String]): Configuration = {
    val hadoopConf = newHadoopConf()
    options.foreach { case (k, v) =>
      if ((v ne null) && k != "path" && k != "paths") {
        hadoopConf.set(k, v)
      }
    }
    hadoopConf
  }

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-18419][SQL] Fix JDBCOptions.asConnectionProperties to be case-insensitive [SPARK-18419][SQL] Fix JDBCOptions and DataSource to be caes-insensitive for JDBCOptions keys Nov 12, 2016
@SparkQA
Copy link

SparkQA commented Nov 12, 2016

Test build #68572 has finished for PR 15863 at commit c6cf19c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-18419][SQL] Fix JDBCOptions and DataSource to be caes-insensitive for JDBCOptions keys [SPARK-18419][SQL] Fix JDBCOptions and DataSource to be case-insensitive for JDBCOptions keys Nov 12, 2016
test("SPARK-18419 JDBCOption keys should be case-insensitive") {
val df = spark.createDataFrame(sparkContext.parallelize(arr2x2), schema2)
df.write.format("jdbc")
.option("URL", url1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Use "Url" or "uRL" to show the case-insensitive more explicitly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review, @viirya ! I updated it, too.

@SparkQA
Copy link

SparkQA commented Nov 13, 2016

Test build #68574 has finished for PR 15863 at commit ea47d74.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

s.createSource(
sparkSession.sqlContext, metadataPath, userSpecifiedSchema, className, options)
s.createSource(sparkSession.sqlContext, metadataPath, userSpecifiedSchema, className,
caseInsensitiveOptions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code style issue.

s.createSource(
  sparkSession.sqlContext,
  metadataPath, 
  userSpecifiedSchema, 
  className, 
  caseInsensitiveOptions)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

@gatorsmile
Copy link
Member

@rxin @cloud-fan Could you take a look at this PR, especially the external behavior impact? Thanks!

@SparkQA
Copy link

SparkQA commented Nov 13, 2016

Test build #68575 has finished for PR 15863 at commit 493f31c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -303,4 +303,13 @@ class JDBCWriteSuite extends SharedSQLContext with BeforeAndAfter {
assert(e.contains("If 'partitionColumn' is specified then 'lowerBound', 'upperBound'," +
" and 'numPartitions' are required."))
}

test("SPARK-18419 JDBCOption keys should be case-insensitive") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it consistent with other options like JsonOtion?

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @cloud-fan .

Actually, this PR includes two different ones. The first commit is about a bug and the others (from seconds) is potential improvement.

First, I'll separate them to make each PRs more clearer.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-18419][SQL] Fix JDBCOptions and DataSource to be case-insensitive for JDBCOptions keys [SPARK-18419][SQL] Fix JDBCOptions.asConnectionProperties to be case-insensitive Nov 14, 2016
@dongjoon-hyun
Copy link
Member Author

@cloud-fan and @gatorsmile .
For the DataSource options issue, I'm working on SPARK-18433 for the followings.

  • CSVOptions
  • JDBCOptions
  • JSONOptions
  • OrcOptions

@SparkQA
Copy link

SparkQA commented Nov 14, 2016

Test build #68606 has finished for PR 15863 at commit 2e249ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

I close this PR since this is already fixed.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Dec 1, 2016

Oh, it's still there.

scala> import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions
import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions

scala> import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap

scala> new JDBCOptions(Map("url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10")).asConnectionProperties
res0: java.util.Properties = {numpartitions=10}

scala> new JDBCOptions(new CaseInsensitiveMap(Map("url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10"))).asConnectionProperties
res1: java.util.Properties = {numpartitions=10}

@dongjoon-hyun dongjoon-hyun reopened this Dec 1, 2016
@@ -129,7 +129,7 @@ object JDBCOptions {
private val jdbcOptionNames = ArrayBuffer.empty[String]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jdbcOptionNames is the root cause.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we make it a Set? seems we only use it for contains check

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I'll update like that.

@dongjoon-hyun
Copy link
Member Author

@cloud-fan and @gatorsmile .
How do you think about this? This is asConnectionProperties function specific issue.

@dongjoon-hyun
Copy link
Member Author

Also, cc @srowen .
Sorry for the confusion.

@@ -130,7 +130,7 @@ private[sql] case class JDBCRelation(
override def insert(data: DataFrame, overwrite: Boolean): Unit = {
val url = jdbcOptions.url
val table = jdbcOptions.table
val properties = jdbcOptions.asConnectionProperties
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this should not use asConnectionProperties. writer needs options like numPartitions.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-18419][SQL] Fix JDBCOptions.asConnectionProperties to be case-insensitive [SPARK-18419][SQL] JDBCRelation.insert should not remove Spark options Dec 1, 2016
@dongjoon-hyun
Copy link
Member Author

I updated PR description and focus.

@SparkQA
Copy link

SparkQA commented Dec 1, 2016

Test build #69494 has finished for PR 15863 at commit 3323b42.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 1, 2016

Test build #69493 has finished for PR 15863 at commit 2e249ea.

  • This patch fails PySpark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Dec 1, 2016

The only failure seems to be irrelevant. Also, I'm waiting the last running test.

Had test failures in pyspark.streaming.tests with pypy; see logs.

@SparkQA
Copy link

SparkQA commented Dec 1, 2016

Test build #69498 has finished for PR 15863 at commit 204594f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

parameters.foreach { case (k, v) => properties.setProperty(k, v) }
properties
}

val asConnectionProperties: Properties = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add some document to explain when we should use asConnectionProperties and when asProperties

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@cloud-fan
Copy link
Contributor

LGTM except 2 comments, including https://github.com/apache/spark/pull/15863/files#r90584811

@SparkQA
Copy link

SparkQA commented Dec 2, 2016

Test build #69544 has finished for PR 15863 at commit 2375f7f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/2.1!

@asfgit asfgit closed this in 55d528f Dec 2, 2016
asfgit pushed a commit that referenced this pull request Dec 2, 2016
## What changes were proposed in this pull request?

Currently, `JDBCRelation.insert` removes Spark options too early by mistakenly using `asConnectionProperties`. Spark options like `numPartitions` should be passed into `DataFrameWriter.jdbc` correctly. This bug have been **hidden** because `JDBCOptions.asConnectionProperties` fails to filter out the mixed-case options. This PR aims to fix both.

**JDBCRelation.insert**
```scala
override def insert(data: DataFrame, overwrite: Boolean): Unit = {
  val url = jdbcOptions.url
  val table = jdbcOptions.table
- val properties = jdbcOptions.asConnectionProperties
+ val properties = jdbcOptions.asProperties
  data.write
    .mode(if (overwrite) SaveMode.Overwrite else SaveMode.Append)
    .jdbc(url, table, properties)
```

**JDBCOptions.asConnectionProperties**
```scala
scala> import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions
scala> import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
scala> new JDBCOptions(Map("url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10")).asConnectionProperties
res0: java.util.Properties = {numpartitions=10}
scala> new JDBCOptions(new CaseInsensitiveMap(Map("url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10"))).asConnectionProperties
res1: java.util.Properties = {numpartitions=10}
```

## How was this patch tested?

Pass the Jenkins with a new testcase.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #15863 from dongjoon-hyun/SPARK-18419.

(cherry picked from commit 55d528f)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@dongjoon-hyun
Copy link
Member Author

Thank you for merging, @cloud-fan . Thank you for review, @gatorsmile , @viirya , @HyukjinKwon !

robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 15, 2016
## What changes were proposed in this pull request?

Currently, `JDBCRelation.insert` removes Spark options too early by mistakenly using `asConnectionProperties`. Spark options like `numPartitions` should be passed into `DataFrameWriter.jdbc` correctly. This bug have been **hidden** because `JDBCOptions.asConnectionProperties` fails to filter out the mixed-case options. This PR aims to fix both.

**JDBCRelation.insert**
```scala
override def insert(data: DataFrame, overwrite: Boolean): Unit = {
  val url = jdbcOptions.url
  val table = jdbcOptions.table
- val properties = jdbcOptions.asConnectionProperties
+ val properties = jdbcOptions.asProperties
  data.write
    .mode(if (overwrite) SaveMode.Overwrite else SaveMode.Append)
    .jdbc(url, table, properties)
```

**JDBCOptions.asConnectionProperties**
```scala
scala> import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions
scala> import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
scala> new JDBCOptions(Map("url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10")).asConnectionProperties
res0: java.util.Properties = {numpartitions=10}
scala> new JDBCOptions(new CaseInsensitiveMap(Map("url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10"))).asConnectionProperties
res1: java.util.Properties = {numpartitions=10}
```

## How was this patch tested?

Pass the Jenkins with a new testcase.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes apache#15863 from dongjoon-hyun/SPARK-18419.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

Currently, `JDBCRelation.insert` removes Spark options too early by mistakenly using `asConnectionProperties`. Spark options like `numPartitions` should be passed into `DataFrameWriter.jdbc` correctly. This bug have been **hidden** because `JDBCOptions.asConnectionProperties` fails to filter out the mixed-case options. This PR aims to fix both.

**JDBCRelation.insert**
```scala
override def insert(data: DataFrame, overwrite: Boolean): Unit = {
  val url = jdbcOptions.url
  val table = jdbcOptions.table
- val properties = jdbcOptions.asConnectionProperties
+ val properties = jdbcOptions.asProperties
  data.write
    .mode(if (overwrite) SaveMode.Overwrite else SaveMode.Append)
    .jdbc(url, table, properties)
```

**JDBCOptions.asConnectionProperties**
```scala
scala> import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions
scala> import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
scala> new JDBCOptions(Map("url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10")).asConnectionProperties
res0: java.util.Properties = {numpartitions=10}
scala> new JDBCOptions(new CaseInsensitiveMap(Map("url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10"))).asConnectionProperties
res1: java.util.Properties = {numpartitions=10}
```

## How was this patch tested?

Pass the Jenkins with a new testcase.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes apache#15863 from dongjoon-hyun/SPARK-18419.
@dongjoon-hyun dongjoon-hyun deleted the SPARK-18419 branch January 7, 2019 07:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants