[SPARK-28152][SQL][2.4] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect #25248

shivsood · 2019-07-24T19:09:20Z

What changes were proposed in this pull request?

This is a backport of SPARK-28152 to Spark 2.4.

SPARK-28152 PR aims to correct mappings in MsSqlServerDialect. ShortType is mapped to SMALLINT and FloatType is mapped to REAL per JBDC mapping respectively.

ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer.

Refer JBDC mapping for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL".

Some example issue that can happen because of wrong mappings
- Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected.
- Read results in a dataframe with type INTEGER as opposed to ShortType

ShortType has a problem in both the the write and read path
FloatTypes only have an issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'.

Refer #28151 which contained this fix as one part of a larger PR. Following PR #28151 discussion it was decided to file seperate PRs for each of the fixes.

How was this patch tested?

UnitTest added in JDBCSuite.scala and these were tested.
Integration test updated and passed in MsSqlServerDialect.scala
E2E test done with SQLServer

… for MsSqlServerDialect ## What changes were proposed in this pull request? This PR aims to correct mappings in `MsSqlServerDialect`. `ShortType` is mapped to `SMALLINT` and `FloatType` is mapped to `REAL` per [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017) respectively. ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer. Refer [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017 ) for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL". Some example issue that can happen because of wrong mappings - Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected. - Read results in a dataframe with type INTEGER as opposed to ShortType - ShortType has a problem in both the the write and read path - FloatTypes only have an issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'. Refer apache#28151 which contained this fix as one part of a larger PR. Following PR apache#28151 discussion it was decided to file seperate PRs for each of the fixes. ## How was this patch tested? UnitTest added in JDBCSuite.scala and these were tested. Integration test updated and passed in MsSqlServerDialect.scala E2E test done with SQLServer Closes apache#25146 from shivsood/float_short_type_fix. Authored-by: shivsood <shivsood@microsoft.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

dongjoon-hyun · 2019-07-24T20:56:06Z

ok to test

SparkQA · 2019-07-25T02:02:43Z

Test build #108132 has finished for PR 25248 at commit a3020b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you, @shivsood .

Merged to branch-2.4. (The integration test part is also tested by manually.)

… REAL for MsSqlServerDialect ## What changes were proposed in this pull request? This is a backport of SPARK-28152 to Spark 2.4. SPARK-28152 PR aims to correct mappings in `MsSqlServerDialect`. `ShortType` is mapped to `SMALLINT` and `FloatType` is mapped to `REAL` per [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017) respectively. ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer. Refer [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017 ) for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL". Some example issue that can happen because of wrong mappings - Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected. - Read results in a dataframe with type INTEGER as opposed to ShortType - ShortType has a problem in both the the write and read path - FloatTypes only have an issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'. Refer #28151 which contained this fix as one part of a larger PR. Following PR #28151 discussion it was decided to file seperate PRs for each of the fixes. ## How was this patch tested? UnitTest added in JDBCSuite.scala and these were tested. Integration test updated and passed in MsSqlServerDialect.scala E2E test done with SQLServer Closes #25248 from shivsood/PR_28152_2.4. Authored-by: shivsood <shivsood@microsoft.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

shivsood · 2019-07-25T15:57:29Z

+1, LGTM. Thank you, @shivsood .

Merged to branch-2.4. (The integration test part is also tested by manually.)

Thanks @dongjoon-hyun

… REAL for MsSqlServerDialect ## What changes were proposed in this pull request? This is a backport of SPARK-28152 to Spark 2.4. SPARK-28152 PR aims to correct mappings in `MsSqlServerDialect`. `ShortType` is mapped to `SMALLINT` and `FloatType` is mapped to `REAL` per [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017) respectively. ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer. Refer [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017 ) for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL". Some example issue that can happen because of wrong mappings - Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected. - Read results in a dataframe with type INTEGER as opposed to ShortType - ShortType has a problem in both the the write and read path - FloatTypes only have an issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'. Refer apache#28151 which contained this fix as one part of a larger PR. Following PR apache#28151 discussion it was decided to file seperate PRs for each of the fixes. ## How was this patch tested? UnitTest added in JDBCSuite.scala and these were tested. Integration test updated and passed in MsSqlServerDialect.scala E2E test done with SQLServer Closes apache#25248 from shivsood/PR_28152_2.4. Authored-by: shivsood <shivsood@microsoft.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

gatorsmile · 2019-11-28T20:02:23Z

@shivsood @dongjoon-hyun We should avoid backporting this PR to the maintenance releases. Users will hit a weird error like .

Caused by: org.apache.spark.sql.AnalysisException: org.apache.spark.sql.jdbc does not allow user-specified schemas.;
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
	at org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anon$1.call(DataSourceStrategy.scala:268)
	at org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anon$1.call(DataSourceStrategy.scala:253)
	at com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4724)

They have to drop the JDBC table stored in metastore and recreate it again to bypass the issue.

When you review similar PRs, please help stop merging such a PR. @cloud-fan @maropu @HyukjinKwon @srowen @dongjoon-hyun

maropu · 2019-11-28T23:27:26Z

sure.

srowen · 2019-11-28T23:49:47Z

Agree, this behavior change isn't what we want to back-port. It wasn't obvious this would happen, but knowing this, definitely looks like it should be reverted.

HyukjinKwon · 2019-11-29T00:01:33Z

Yeah, let's revert.

HyukjinKwon · 2019-11-29T01:43:22Z

@dongjoon-hyun is on vacation IIRC. Let me revert it first.

srowen · 2019-11-29T01:45:19Z

Hang on a tick. I think there is some question about whether this should be behind a flag? Let me get the conversation online here

HyukjinKwon · 2019-11-29T01:51:07Z

Oops, sorry, I already did. I realised that this is already part of Spark 2.4.4.

HyukjinKwon · 2019-11-29T01:53:28Z

Yeah, reverting this is also a breaking change when it's already released on the flip side ...

HyukjinKwon · 2019-11-29T02:15:14Z

Okay, I reverted my revert just now. We can discuss more here.

shivsood · 2019-11-29T02:15:29Z

Thanks for pointing this out. I wonder how i missed this point on backward compatibility of existing tables. I agree this should be reverted.

…

On Thu, Nov 28, 2019 at 5:53 PM Hyukjin Kwon ***@***.***> wrote: Yeah, reverting it's also a breaking change when it's already released on the flip side ... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#25248?email_source=notifications&email_token=AAMBQMMIZAFLWHHMKJKVFVLQWBYZVA5CNFSM4IGTF7J2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFNVMIY#issuecomment-559633955>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMBQMJ7ABQKQY4ZACAE4PTQWBYZVANCNFSM4IGTF7JQ> .

HyukjinKwon · 2019-11-29T02:18:27Z

Yeah, it should be reverted .. but I realised that reverting this is also a breaking change comparing to 2.4.5 (if released).

If somebody creates a table in Spark 2.4.4, they might get different type in Spark 2.4.5. It might matter in roundtrip (e.g., spark.schema(...).jdbc(...)) ... we might need a flag? WDYT guys

HyukjinKwon · 2019-11-29T02:18:47Z

cc @zsxwing since I discussed with him as well offline.

zsxwing · 2019-12-02T18:00:04Z

Yep. As reverting this can cause another behavior change, we should avoid doing this in 2.4.5. It's better to just add a flag in 2.4.5 to allow the users to choose the old behavior.

maropu · 2019-12-02T22:32:51Z

Yea, I also think we need to add a flag to cover the issue.

…ema mismatched ### What changes were proposed in this pull request? Issue better error message when user-specified schema and not match relation schema ### Why are the changes needed? Inspired by #25248 (comment), user could get a weird error message when type mapping behavior change between Spark schema and datasource schema(e.g. JDBC). Instead of saying "SomeProvider does not allow user-specified schemas.", we'd better tell user what is really happening here to make user be more clearly about the error. ### Does this PR introduce any user-facing change? Yes, user will see error message changes. ### How was this patch tested? Updated existed tests. Closes #26781 from Ngone51/dev-mismatch-schema. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

dongjoon-hyun · 2019-12-14T21:40:00Z

Hi, @shivsood . Could you make a PR according to the above advice?

dongjoon-hyun · 2020-01-11T00:38:07Z

Gentle ping, @shivsood .

shivsood · 2020-01-11T00:45:19Z

Gentle ping, @shivsood .
Can u point me to any guidelines how to add a flag? any #PR that u can point me to that's doing this. Thanks

dongjoon-hyun · 2020-01-11T01:05:26Z

The latest example is the following.

b5bc3e1

dongjoon-hyun · 2020-01-13T03:56:32Z

Since there is no ETA from @shivsood , I'll take over the follow-up.

shivsood · 2020-01-13T04:19:11Z

I will be taking this up this week. ETA end of coming week. Regards, Shiv

…

On Sun, Jan 12, 2020 at 7:56 PM Dongjoon Hyun ***@***.***> wrote: Since there is no ETA from @shivsood <https://github.com/shivsood> , I'll take over the follow-up. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#25248?email_source=notifications&email_token=AAMBQMJDEIR3QYZ7WC5GZZTQ5PQ7DA5CNFSM4IGTF7J2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIXORMA#issuecomment-573499568>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMBQMO2HOM424ZUBDT45JTQ5PQ7DANCNFSM4IGTF7JQ> .

dongjoon-hyun · 2020-01-13T04:34:22Z

@shivsood . If you read the dev mailing list, 2.4.5 RC1 is scheduled for tomorrow and I'm the release manager of Apache Spark 2.4.5. We need the patch tonight. I'm working on this for you. Never mind.

shivsood · 2020-01-13T04:42:26Z

Thanks. I am sorry about my oversight on the date. Regards, Shiv

…

On Sun, Jan 12, 2020 at 8:34 PM Dongjoon Hyun ***@***.***> wrote: @shivsood <https://github.com/shivsood> . If you see the dev mailing list, it's scheduled tomorrow and I'm the release manager of Apache Spark 2.4.5. We need the patch tonight. I'm working on this for you. Never mind. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#25248?email_source=notifications&email_token=AAMBQMKPOLVSZQS7HDARIXLQ5PVNBA5CNFSM4IGTF7J2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIXPYYQ#issuecomment-573504610>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMBQMOPVOEI3RMFR6ZXPRTQ5PVNBANCNFSM4IGTF7JQ> .

dongjoon-hyun · 2020-01-13T04:48:26Z

Oh. It's okay. No problem. I mean I'm preparing the PR for this follow-up request. So, you don't need to worry about the schedule. :) Usually, the author and the committer should share the responsibility.

shivsood · 2020-01-13T04:52:11Z

Thanks a lot! Regards, Shiv

…

On Sun, Jan 12, 2020 at 8:48 PM Dongjoon Hyun ***@***.***> wrote: Oh. It's okay. No problem. I mean I'm preparing the PR for this follow-up request. So, you don't need to worry about the schedule. :) Usually, the author and the committer should share the responsibility. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#25248?email_source=notifications&email_token=AAMBQMOGMY2WI7RF7QBUSHTQ5PXBZA5CNFSM4IGTF7J2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIXQKDQ#issuecomment-573506830>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMBQMJSENHRS6FWDV35OOLQ5PXBZANCNFSM4IGTF7JQ> .

dongjoon-hyun · 2020-01-13T05:36:03Z

The followup PR is ready.

[SPARK-28152][SQL][FOLLOWUP] Add a legacy conf for old MsSqlServerDialect numeric mapping #27184

…lect numeric mapping ### What changes were proposed in this pull request? This is a follow-up for #25248 . ### Why are the changes needed? The new behavior cannot access the existing table which is created by old behavior. This PR provides a way to avoid new behavior for the existing users. ### Does this PR introduce any user-facing change? Yes. This will fix the broken behavior on the existing tables. ### How was this patch tested? Pass the Jenkins and manually run JDBC integration test. ``` build/mvn install -DskipTests build/mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 test ``` Closes #27184 from dongjoon-hyun/SPARK-28152-CONF. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…lect numeric mapping This is a follow-up for #25248 . The new behavior cannot access the existing table which is created by old behavior. This PR provides a way to avoid new behavior for the existing users. Yes. This will fix the broken behavior on the existing tables. Pass the Jenkins and manually run JDBC integration test. ``` build/mvn install -DskipTests build/mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 test ``` Closes #27184 from dongjoon-hyun/SPARK-28152-CONF. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 28fc043) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…ema mismatched ### What changes were proposed in this pull request? Issue better error message when user-specified schema and not match relation schema ### Why are the changes needed? Inspired by apache/spark#25248 (comment), user could get a weird error message when type mapping behavior change between Spark schema and datasource schema(e.g. JDBC). Instead of saying "SomeProvider does not allow user-specified schemas.", we'd better tell user what is really happening here to make user be more clearly about the error. ### Does this PR introduce any user-facing change? Yes, user will see error message changes. ### How was this patch tested? Updated existed tests. Closes #26781 from Ngone51/dev-mismatch-schema. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

shivsood mentioned this pull request Jul 24, 2019

[WIP][SPARK-28152][SQL][2.4] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect #25238

Closed

dongjoon-hyun added the SQL label Jul 24, 2019

dongjoon-hyun approved these changes Jul 25, 2019

View reviewed changes

dongjoon-hyun closed this Jul 25, 2019

Ngone51 mentioned this pull request Dec 6, 2019

[SPARK-30151][SQL] Issue better error message when user-specified schema mismatched #26781

Closed

dongjoon-hyun mentioned this pull request Jan 11, 2020

[WIP] [SPARK-29644][SQL] Fixed ByteType JDBCUtils to map to TinyInt at write read and ShortType on read #27172

Closed

dongjoon-hyun mentioned this pull request Jan 13, 2020

[SPARK-28152][SQL][FOLLOWUP] Add a legacy conf for old MsSqlServerDialect numeric mapping #27184

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-28152][SQL][2.4] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect #25248

[SPARK-28152][SQL][2.4] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect #25248

shivsood commented Jul 24, 2019 •

edited by dongjoon-hyun

Loading

dongjoon-hyun commented Jul 24, 2019

SparkQA commented Jul 25, 2019

dongjoon-hyun left a comment

shivsood commented Jul 25, 2019

gatorsmile commented Nov 28, 2019

maropu commented Nov 28, 2019

srowen commented Nov 28, 2019

HyukjinKwon commented Nov 29, 2019

HyukjinKwon commented Nov 29, 2019

srowen commented Nov 29, 2019

HyukjinKwon commented Nov 29, 2019

HyukjinKwon commented Nov 29, 2019 •

edited

Loading

HyukjinKwon commented Nov 29, 2019

shivsood commented Nov 29, 2019 via email

HyukjinKwon commented Nov 29, 2019

HyukjinKwon commented Nov 29, 2019

zsxwing commented Dec 2, 2019

maropu commented Dec 2, 2019

dongjoon-hyun commented Dec 14, 2019

dongjoon-hyun commented Jan 11, 2020

shivsood commented Jan 11, 2020

dongjoon-hyun commented Jan 11, 2020

dongjoon-hyun commented Jan 13, 2020

shivsood commented Jan 13, 2020 via email

dongjoon-hyun commented Jan 13, 2020 •

edited

Loading

shivsood commented Jan 13, 2020 via email

dongjoon-hyun commented Jan 13, 2020

shivsood commented Jan 13, 2020 via email

dongjoon-hyun commented Jan 13, 2020

[SPARK-28152][SQL][2.4] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect #25248

[SPARK-28152][SQL][2.4] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect #25248

Conversation

shivsood commented Jul 24, 2019 • edited by dongjoon-hyun Loading

What changes were proposed in this pull request?

How was this patch tested?

dongjoon-hyun commented Jul 24, 2019

SparkQA commented Jul 25, 2019

dongjoon-hyun left a comment

Choose a reason for hiding this comment

shivsood commented Jul 25, 2019

gatorsmile commented Nov 28, 2019

maropu commented Nov 28, 2019

srowen commented Nov 28, 2019

HyukjinKwon commented Nov 29, 2019

HyukjinKwon commented Nov 29, 2019

srowen commented Nov 29, 2019

HyukjinKwon commented Nov 29, 2019

HyukjinKwon commented Nov 29, 2019 • edited Loading

HyukjinKwon commented Nov 29, 2019

shivsood commented Nov 29, 2019 via email

HyukjinKwon commented Nov 29, 2019

HyukjinKwon commented Nov 29, 2019

zsxwing commented Dec 2, 2019

maropu commented Dec 2, 2019

dongjoon-hyun commented Dec 14, 2019

dongjoon-hyun commented Jan 11, 2020

shivsood commented Jan 11, 2020

dongjoon-hyun commented Jan 11, 2020

dongjoon-hyun commented Jan 13, 2020

shivsood commented Jan 13, 2020 via email

dongjoon-hyun commented Jan 13, 2020 • edited Loading

shivsood commented Jan 13, 2020 via email

dongjoon-hyun commented Jan 13, 2020

shivsood commented Jan 13, 2020 via email

dongjoon-hyun commented Jan 13, 2020

shivsood commented Jul 24, 2019 •

edited by dongjoon-hyun

Loading

HyukjinKwon commented Nov 29, 2019 •

edited

Loading

dongjoon-hyun commented Jan 13, 2020 •

edited

Loading