Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28152][SQL][2.4] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect #25248

Closed
wants to merge 1 commit into from

Conversation

shivsood
Copy link
Contributor

@shivsood shivsood commented Jul 24, 2019

What changes were proposed in this pull request?

This is a backport of SPARK-28152 to Spark 2.4.

SPARK-28152 PR aims to correct mappings in MsSqlServerDialect. ShortType is mapped to SMALLINT and FloatType is mapped to REAL per JBDC mapping respectively.

ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer.

Refer JBDC mapping for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL".

Some example issue that can happen because of wrong mappings
- Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected.
- Read results in a dataframe with type INTEGER as opposed to ShortType

  • ShortType has a problem in both the the write and read path
  • FloatTypes only have an issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'.

Refer #28151 which contained this fix as one part of a larger PR. Following PR #28151 discussion it was decided to file seperate PRs for each of the fixes.

How was this patch tested?

UnitTest added in JDBCSuite.scala and these were tested.
Integration test updated and passed in MsSqlServerDialect.scala
E2E test done with SQLServer

… for MsSqlServerDialect

## What changes were proposed in this pull request?
This PR aims to correct mappings in `MsSqlServerDialect`. `ShortType` is mapped to `SMALLINT` and `FloatType` is mapped to `REAL` per [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017) respectively.

ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer.

Refer [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017  ) for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL".

Some example issue that can happen because of wrong mappings
    - Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected.
    - Read results in a dataframe with type INTEGER as opposed to ShortType

- ShortType has a problem in both the the write and read path
- FloatTypes only have an issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'.

Refer apache#28151 which contained this fix as one part of a larger PR.  Following PR apache#28151 discussion it was decided to file seperate PRs for each of the fixes.

## How was this patch tested?
UnitTest added in JDBCSuite.scala and these were tested.
Integration test updated and passed in MsSqlServerDialect.scala
E2E test done with SQLServer

Closes apache#25146 from shivsood/float_short_type_fix.

Authored-by: shivsood <shivsood@microsoft.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@dongjoon-hyun
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jul 25, 2019

Test build #108132 has finished for PR 25248 at commit a3020b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @shivsood .

Merged to branch-2.4. (The integration test part is also tested by manually.)

dongjoon-hyun pushed a commit that referenced this pull request Jul 25, 2019
… REAL for MsSqlServerDialect

## What changes were proposed in this pull request?
This is a backport of SPARK-28152 to Spark 2.4.

SPARK-28152 PR aims to correct mappings in `MsSqlServerDialect`. `ShortType` is mapped to `SMALLINT` and `FloatType` is mapped to `REAL` per [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017) respectively.

ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer.

Refer [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017  ) for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL".

Some example issue that can happen because of wrong mappings
    - Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected.
    - Read results in a dataframe with type INTEGER as opposed to ShortType

- ShortType has a problem in both the the write and read path
- FloatTypes only have an issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'.

Refer #28151 which contained this fix as one part of a larger PR.  Following PR #28151 discussion it was decided to file seperate PRs for each of the fixes.

## How was this patch tested?
UnitTest added in JDBCSuite.scala and these were tested.
Integration test updated and passed in MsSqlServerDialect.scala
E2E test done with SQLServer

Closes #25248 from shivsood/PR_28152_2.4.

Authored-by: shivsood <shivsood@microsoft.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@shivsood
Copy link
Contributor Author

+1, LGTM. Thank you, @shivsood .

Merged to branch-2.4. (The integration test part is also tested by manually.)

Thanks @dongjoon-hyun

rluta pushed a commit to rluta/spark that referenced this pull request Sep 17, 2019
… REAL for MsSqlServerDialect

## What changes were proposed in this pull request?
This is a backport of SPARK-28152 to Spark 2.4.

SPARK-28152 PR aims to correct mappings in `MsSqlServerDialect`. `ShortType` is mapped to `SMALLINT` and `FloatType` is mapped to `REAL` per [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017) respectively.

ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer.

Refer [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017  ) for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL".

Some example issue that can happen because of wrong mappings
    - Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected.
    - Read results in a dataframe with type INTEGER as opposed to ShortType

- ShortType has a problem in both the the write and read path
- FloatTypes only have an issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'.

Refer apache#28151 which contained this fix as one part of a larger PR.  Following PR apache#28151 discussion it was decided to file seperate PRs for each of the fixes.

## How was this patch tested?
UnitTest added in JDBCSuite.scala and these were tested.
Integration test updated and passed in MsSqlServerDialect.scala
E2E test done with SQLServer

Closes apache#25248 from shivsood/PR_28152_2.4.

Authored-by: shivsood <shivsood@microsoft.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Sep 26, 2019
… REAL for MsSqlServerDialect

## What changes were proposed in this pull request?
This is a backport of SPARK-28152 to Spark 2.4.

SPARK-28152 PR aims to correct mappings in `MsSqlServerDialect`. `ShortType` is mapped to `SMALLINT` and `FloatType` is mapped to `REAL` per [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017) respectively.

ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer.

Refer [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017  ) for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL".

Some example issue that can happen because of wrong mappings
    - Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected.
    - Read results in a dataframe with type INTEGER as opposed to ShortType

- ShortType has a problem in both the the write and read path
- FloatTypes only have an issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'.

Refer apache#28151 which contained this fix as one part of a larger PR.  Following PR apache#28151 discussion it was decided to file seperate PRs for each of the fixes.

## How was this patch tested?
UnitTest added in JDBCSuite.scala and these were tested.
Integration test updated and passed in MsSqlServerDialect.scala
E2E test done with SQLServer

Closes apache#25248 from shivsood/PR_28152_2.4.

Authored-by: shivsood <shivsood@microsoft.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@gatorsmile
Copy link
Member

@shivsood @dongjoon-hyun We should avoid backporting this PR to the maintenance releases. Users will hit a weird error like .

Caused by: org.apache.spark.sql.AnalysisException: org.apache.spark.sql.jdbc does not allow user-specified schemas.;
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
	at org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anon$1.call(DataSourceStrategy.scala:268)
	at org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anon$1.call(DataSourceStrategy.scala:253)
	at com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4724)

They have to drop the JDBC table stored in metastore and recreate it again to bypass the issue.

When you review similar PRs, please help stop merging such a PR. @cloud-fan @maropu @HyukjinKwon @srowen @dongjoon-hyun

@maropu
Copy link
Member

maropu commented Nov 28, 2019

sure.

@srowen
Copy link
Member

srowen commented Nov 28, 2019

Agree, this behavior change isn't what we want to back-port. It wasn't obvious this would happen, but knowing this, definitely looks like it should be reverted.

@HyukjinKwon
Copy link
Member

Yeah, let's revert.

@HyukjinKwon
Copy link
Member

@dongjoon-hyun is on vacation IIRC. Let me revert it first.

@srowen
Copy link
Member

srowen commented Nov 29, 2019

Hang on a tick. I think there is some question about whether this should be behind a flag? Let me get the conversation online here

@HyukjinKwon
Copy link
Member

Oops, sorry, I already did. I realised that this is already part of Spark 2.4.4.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Nov 29, 2019

Yeah, reverting this is also a breaking change when it's already released on the flip side ...

@HyukjinKwon
Copy link
Member

Okay, I reverted my revert just now. We can discuss more here.

@shivsood
Copy link
Contributor Author

shivsood commented Nov 29, 2019 via email

@HyukjinKwon
Copy link
Member

Yeah, it should be reverted .. but I realised that reverting this is also a breaking change comparing to 2.4.5 (if released).

If somebody creates a table in Spark 2.4.4, they might get different type in Spark 2.4.5. It might matter in roundtrip (e.g., spark.schema(...).jdbc(...)) ... we might need a flag? WDYT guys

@HyukjinKwon
Copy link
Member

cc @zsxwing since I discussed with him as well offline.

@zsxwing
Copy link
Member

zsxwing commented Dec 2, 2019

Yep. As reverting this can cause another behavior change, we should avoid doing this in 2.4.5. It's better to just add a flag in 2.4.5 to allow the users to choose the old behavior.

@maropu
Copy link
Member

maropu commented Dec 2, 2019

Yea, I also think we need to add a flag to cover the issue.

cloud-fan pushed a commit that referenced this pull request Dec 10, 2019
…ema mismatched

### What changes were proposed in this pull request?

Issue better error message when user-specified schema and not match relation schema

### Why are the changes needed?

Inspired by #25248 (comment), user could get a weird error message when type mapping behavior change between Spark schema and datasource schema(e.g. JDBC). Instead of saying "SomeProvider does not allow user-specified schemas.", we'd better tell user what is really happening here to make user be more clearly about the error.

### Does this PR introduce any user-facing change?

Yes, user will see error message changes.

### How was this patch tested?

Updated existed tests.

Closes #26781 from Ngone51/dev-mismatch-schema.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@dongjoon-hyun
Copy link
Member

Hi, @shivsood . Could you make a PR according to the above advice?

@dongjoon-hyun
Copy link
Member

Gentle ping, @shivsood .

@shivsood
Copy link
Contributor Author

Gentle ping, @shivsood .
Can u point me to any guidelines how to add a flag? any #PR that u can point me to that's doing this. Thanks

@dongjoon-hyun
Copy link
Member

The latest example is the following.

@dongjoon-hyun
Copy link
Member

Since there is no ETA from @shivsood , I'll take over the follow-up.

@shivsood
Copy link
Contributor Author

shivsood commented Jan 13, 2020 via email

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 13, 2020

@shivsood . If you read the dev mailing list, 2.4.5 RC1 is scheduled for tomorrow and I'm the release manager of Apache Spark 2.4.5. We need the patch tonight. I'm working on this for you. Never mind.

@shivsood
Copy link
Contributor Author

shivsood commented Jan 13, 2020 via email

@dongjoon-hyun
Copy link
Member

Oh. It's okay. No problem. I mean I'm preparing the PR for this follow-up request. So, you don't need to worry about the schedule. :) Usually, the author and the committer should share the responsibility.

@shivsood
Copy link
Contributor Author

shivsood commented Jan 13, 2020 via email

@dongjoon-hyun
Copy link
Member

dongjoon-hyun added a commit that referenced this pull request Jan 13, 2020
…lect numeric mapping

### What changes were proposed in this pull request?

This is a follow-up for #25248 .

### Why are the changes needed?

The new behavior cannot access the existing table which is created by old behavior.
This PR provides a way to avoid new behavior for the existing users.

### Does this PR introduce any user-facing change?

Yes. This will fix the broken behavior on the existing tables.

### How was this patch tested?

Pass the Jenkins and manually run JDBC integration test.
```
build/mvn install -DskipTests
build/mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 test
```

Closes #27184 from dongjoon-hyun/SPARK-28152-CONF.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
dongjoon-hyun added a commit that referenced this pull request Jan 13, 2020
…lect numeric mapping

This is a follow-up for #25248 .

The new behavior cannot access the existing table which is created by old behavior.
This PR provides a way to avoid new behavior for the existing users.

Yes. This will fix the broken behavior on the existing tables.

Pass the Jenkins and manually run JDBC integration test.
```
build/mvn install -DskipTests
build/mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 test
```

Closes #27184 from dongjoon-hyun/SPARK-28152-CONF.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 28fc043)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
rdblue pushed a commit to Netflix/spark that referenced this pull request Jan 21, 2020
…ema mismatched

### What changes were proposed in this pull request?

Issue better error message when user-specified schema and not match relation schema

### Why are the changes needed?

Inspired by apache/spark#25248 (comment), user could get a weird error message when type mapping behavior change between Spark schema and datasource schema(e.g. JDBC). Instead of saying "SomeProvider does not allow user-specified schemas.", we'd better tell user what is really happening here to make user be more clearly about the error.

### Does this PR introduce any user-facing change?

Yes, user will see error message changes.

### How was this patch tested?

Updated existed tests.

Closes #26781 from Ngone51/dev-mismatch-schema.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
8 participants