[SPARK-28152][SQL] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect #25146

shivsood · 2019-07-13T19:27:31Z

What changes were proposed in this pull request?

This PR aims to correct mappings in MsSqlServerDialect. ShortType is mapped to SMALLINT and FloatType is mapped to REAL per JBDC mapping respectively.

ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer.

Refer JBDC mapping for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL".

Some example issue that can happen because of wrong mappings
- Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected.
- Read results in a dataframe with type INTEGER as opposed to ShortType

ShortType has a problem in both the the write and read path
FloatTypes only have an issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'.

Refer #28151 which contained this fix as one part of a larger PR. Following PR #28151 discussion it was decided to file seperate PRs for each of the fixes.

How was this patch tested?

UnitTest added in JDBCSuite.scala and these were tested.
Integration test updated and passed in MsSqlServerDialect.scala
E2E test done with SQLServer

… mapped correctly for read/write of SQLServer Tables ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer. Some example issue - Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT. Thus a larger table that expected. - Read results in a dataframe with type INTEGER as opposed to ShortType FloatTypes have a issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'. Post fix ShortType is correctly mapped to SMALLINT and FloatType is mapped to REAL

dongjoon-hyun · 2019-07-13T19:52:14Z

ok to test

dongjoon-hyun · 2019-07-13T19:56:47Z

BTW, @shivsood .
Since you are going to contribute more to Apache Spark community, I'll give you some comments on the current PR title.

Fix for SPARK-28152: ShortType and FloatTypes are not mapped correctly for read/write of SQLServer Tables

We use [SPARK-28152] prefix followed by the component tag like [SQL].
We recommend to use the PR title to describe your approach, not a problem. The problem description and the comparison between before and after are done in the PR description.

dongjoon-hyun

Please update the PR title. Also, please revise the PR description. It's malformed. You can reference the commit logs.

shivsood · 2019-07-13T20:02:37Z

BTW, @shivsood .
Since you are going to contribute more to Apache Spark community, I'll give you some comments on the current PR title.

Fix for SPARK-28152: ShortType and FloatTypes are not mapped correctly for read/write of SQLServer Tables
* We use `[SPARK-28152]` prefix followed by the component tag like `[SQL]`.

* We recommend to use the PR title to describe your approach, not a problem. The problem description and the comparison between before and after are done in the PR description.

Thanks dongjoon-hyun. I have fixed this.

dongjoon-hyun · 2019-07-13T20:05:21Z

Thank you, @shivsood !

SparkQA · 2019-07-13T22:12:22Z

Test build #107633 has finished for PR 25146 at commit 49606c8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala

dongjoon-hyun · 2019-07-15T04:07:53Z

Could you elaborate your PR description a little bit more, @shivsood ? Please refer the other commit logs.

shivsood · 2019-07-15T17:08:33Z

Could you elaborate your PR description a little bit more, @shivsood ? Please refer the other commit logs.

Done.

dongjoon-hyun · 2019-07-15T17:20:50Z

Thank you, @shivsood . I updated a little bit more. You can see the difference~

shivsood · 2019-07-15T17:38:35Z

@dongjoon-hyun Thanks. Looks great now!

dongjoon-hyun

+1, LGTM. (Pending Jenkins).
Thank you so much, @shivsood ! I also tested this manually with the integration test.

shivsood · 2019-07-15T18:43:17Z

+1, LGTM. (Pending Jenkins).
Thank you so much, @shivsood ! I also tested this manually with the integration test.
@dongjoon-hyun Thanks. I also manually tested this with integration test in MsSQLServerIntegrationSuite.scala

SparkQA · 2019-07-15T19:10:11Z

Test build #107692 has finished for PR 25146 at commit b25007e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-07-15T19:12:10Z

Merged to master. There is a conflict for the older branches. Could you make a backporting PR against branch-2.4 (and branch-2.3 if you want)?

dongjoon-hyun · 2019-07-15T19:19:10Z

And, congratulation! I added you to Apache Spark Contributor group and assigned SPARK-28152 to you.

shivsood · 2019-07-15T19:46:22Z

Could you make a backporting PR a
Will do. Important to have this fixed on 2.4.
Have to look into the process to create a back-porting PR. Generally i would cherry-pick this PR to 2.4 branch, change as required, test and create a new pull request. Is this right process?

shivsood · 2019-07-15T19:50:49Z

And, congratulation! I added you to Apache Spark Contributor group and assigned SPARK-28152 to you.
@dongjoon-hyun Thanks for all the guidance and reviews. Post this PR i have much better understanding of the process and criteria. Thanks for your patience and guidance.

@wangyum Thanks for suggesting updates to MsSQLServerIntegrationSuite. That's a great test suite for end to end test. Great if it is also part of CI/CD.

dongjoon-hyun · 2019-07-15T20:33:07Z

Yep. If a PR meets the criteria from the beginning, it is merged quickly. I'm looking forward seeing more contributions from you. 😄 Thanks.

… for MsSqlServerDialect ## What changes were proposed in this pull request? This PR aims to correct mappings in `MsSqlServerDialect`. `ShortType` is mapped to `SMALLINT` and `FloatType` is mapped to `REAL` per [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017) respectively. ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer. Refer [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017 ) for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL". Some example issue that can happen because of wrong mappings - Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected. - Read results in a dataframe with type INTEGER as opposed to ShortType - ShortType has a problem in both the the write and read path - FloatTypes only have an issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'. Refer apache#28151 which contained this fix as one part of a larger PR. Following PR apache#28151 discussion it was decided to file seperate PRs for each of the fixes. ## How was this patch tested? UnitTest added in JDBCSuite.scala and these were tested. Integration test updated and passed in MsSqlServerDialect.scala E2E test done with SQLServer Closes apache#25146 from shivsood/float_short_type_fix. Authored-by: shivsood <shivsood@microsoft.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

gatorsmile · 2019-07-21T05:55:09Z

@dongjoon-hyun @shivsood This change requires an update in our migration guide section. Could you submit a follow-up PR for this?

shivsood · 2019-07-22T00:27:00Z

Could you submit a follow-up PR for this?
@gatorsmile Good point. 'll submit the doc PR. Also @dongjoon-hyun had asked a PR for spark 2.4 which is still pending on me. Will do that and follow that up with a document PR. Thanks

… for MsSqlServerDialect ## What changes were proposed in this pull request? This PR aims to correct mappings in `MsSqlServerDialect`. `ShortType` is mapped to `SMALLINT` and `FloatType` is mapped to `REAL` per [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017) respectively. ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer. Refer [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017 ) for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL". Some example issue that can happen because of wrong mappings - Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected. - Read results in a dataframe with type INTEGER as opposed to ShortType - ShortType has a problem in both the the write and read path - FloatTypes only have an issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'. Refer apache#28151 which contained this fix as one part of a larger PR. Following PR apache#28151 discussion it was decided to file seperate PRs for each of the fixes. ## How was this patch tested? UnitTest added in JDBCSuite.scala and these were tested. Integration test updated and passed in MsSqlServerDialect.scala E2E test done with SQLServer Closes apache#25146 from shivsood/float_short_type_fix. Authored-by: shivsood <shivsood@microsoft.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

dongjoon-hyun requested changes Jul 13, 2019

View reviewed changes

dongjoon-hyun added the SQL label Jul 13, 2019

shivsood changed the title ~~Fix for SPARK-28152: ShortType and FloatTypes are not mapped correctly for read/write of SQLServer Tables~~ [SPARK-28152] [SQL] Mapped ShortType to SMALLINT and FloatTypes to REAL for correct read/write of SQLServer Tables Jul 13, 2019

shivsood changed the title ~~[SPARK-28152] [SQL] Mapped ShortType to SMALLINT and FloatTypes to REAL for correct read/write of SQLServer Tables~~ [SPARK-28152] [SQL] Mapped ShortType to SMALLINT and FloatType to REAL for correct read/write of SQLServer Tables Jul 13, 2019

shivsood changed the title ~~[SPARK-28152] [SQL] Mapped ShortType to SMALLINT and FloatType to REAL for correct read/write of SQLServer Tables~~ [SPARK-28152] [SQL] Mapped ShortType to SMALLINT and FloatType to REAL for correct read/write of SQLServer Tables using JDBC connector Jul 13, 2019

dongjoon-hyun changed the title ~~[SPARK-28152] [SQL] Mapped ShortType to SMALLINT and FloatType to REAL for correct read/write of SQLServer Tables using JDBC connector~~ [SPARK-28152][SQL] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect Jul 15, 2019

dongjoon-hyun reviewed Jul 15, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun mentioned this pull request Jul 15, 2019

[SPARK-28151][SQL] Fix MsSqlServerDialect Byte/Short/Float type mappings ( DRAFT) #24969

Closed

Hygiene fix : removing the extra line

b25007e

dongjoon-hyun approved these changes Jul 15, 2019

View reviewed changes

dongjoon-hyun closed this in d8996fd Jul 15, 2019

shivsood mentioned this pull request Jul 24, 2019

[SPARK-28152][SQL][2.4] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect #25248

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-28152][SQL] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect #25146

[SPARK-28152][SQL] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect #25146

shivsood commented Jul 13, 2019 •

edited by dongjoon-hyun

dongjoon-hyun commented Jul 13, 2019

dongjoon-hyun commented Jul 13, 2019

dongjoon-hyun left a comment •

edited

shivsood commented Jul 13, 2019

dongjoon-hyun commented Jul 13, 2019

SparkQA commented Jul 13, 2019

dongjoon-hyun commented Jul 15, 2019 •

edited

shivsood commented Jul 15, 2019

dongjoon-hyun commented Jul 15, 2019

shivsood commented Jul 15, 2019

dongjoon-hyun left a comment

shivsood commented Jul 15, 2019

SparkQA commented Jul 15, 2019

dongjoon-hyun commented Jul 15, 2019 •

edited

dongjoon-hyun commented Jul 15, 2019

shivsood commented Jul 15, 2019

shivsood commented Jul 15, 2019

dongjoon-hyun commented Jul 15, 2019

gatorsmile commented Jul 21, 2019

shivsood commented Jul 22, 2019

[SPARK-28152][SQL] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect #25146

[SPARK-28152][SQL] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect #25146

Conversation

shivsood commented Jul 13, 2019 • edited by dongjoon-hyun

What changes were proposed in this pull request?

How was this patch tested?

dongjoon-hyun commented Jul 13, 2019

dongjoon-hyun commented Jul 13, 2019

dongjoon-hyun left a comment • edited

Choose a reason for hiding this comment

shivsood commented Jul 13, 2019

dongjoon-hyun commented Jul 13, 2019

SparkQA commented Jul 13, 2019

dongjoon-hyun commented Jul 15, 2019 • edited

shivsood commented Jul 15, 2019

dongjoon-hyun commented Jul 15, 2019

shivsood commented Jul 15, 2019

dongjoon-hyun left a comment

Choose a reason for hiding this comment

shivsood commented Jul 15, 2019

SparkQA commented Jul 15, 2019

dongjoon-hyun commented Jul 15, 2019 • edited

dongjoon-hyun commented Jul 15, 2019

shivsood commented Jul 15, 2019

shivsood commented Jul 15, 2019

dongjoon-hyun commented Jul 15, 2019

gatorsmile commented Jul 21, 2019

shivsood commented Jul 22, 2019

shivsood commented Jul 13, 2019 •

edited by dongjoon-hyun

dongjoon-hyun left a comment •

edited

dongjoon-hyun commented Jul 15, 2019 •

edited

dongjoon-hyun commented Jul 15, 2019 •

edited