[WIP][SPARK-28152][SQL][2.4] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect #25238

shivsood · 2019-07-24T01:17:18Z

What changes were proposed in this pull request?

This is a backport of SPARK-28152 to Spark 2.4. Because the fix in 3.0 was based on different base files, following relevant fixes have also been cherry-picked as part of this fix.

[SPARK-27159][SQL] update mssql server dialect to support binary type
SPARK-27168 [SQL][TEST] Add docker integration test for MsSql server

SPARK-28152 corrects mappings in MsSqlServerDialect. Post fix ShortType is mapped to SMALLINT and FloatType is mapped to REAL per JBDC mapping respectively.

ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer.

Refer JBDC mapping for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL".

Some example issue that can happen because of wrong mappings
- Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected.
- Read results in a dataframe with type INTEGER as opposed to ShortType

ShortType has a problem in both the the write and read path
FloatTypes only have an issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'.

How was this patch tested?

Integration test using : ./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 -Dtest=none -DwildcardSuites=org.apache.spark.sql.jdbc.MsSqlServerIntegrationSuite

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review https://spark.apache.org/contributing.html before opening a pull request.

## What changes were proposed in this pull request? Change the binary type mapping from default blob to varbinary(max) for mssql server. https://docs.microsoft.com/en-us/sql/t-sql/data-types/binary-and-varbinary-transact-sql?view=sql-server-2017 ![image](https://user-images.githubusercontent.com/698621/54351715-0e8c8780-468b-11e9-8931-7ecb85c5ad6b.png) ## How was this patch tested? Unit test. Closes apache#24091 from lipzhu/SPARK-27159. Authored-by: Zhu, Lipeng <lipzhu@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

## What changes were proposed in this pull request? This PR aims to add a JDBC integration test for MsSql server. ## How was this patch tested? ``` ./build/mvn clean install -DskipTests ./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 \ -Dtest=none -DwildcardSuites=org.apache.spark.sql.jdbc.MsSqlServerIntegrationSuite ``` Closes apache#24099 from lipzhu/SPARK-27168. Lead-authored-by: Zhu, Lipeng <lipzhu@ebay.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: Lipeng Zhu <lipzhu@icloud.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

… for MsSqlServerDialect ## What changes were proposed in this pull request? This PR aims to correct mappings in `MsSqlServerDialect`. `ShortType` is mapped to `SMALLINT` and `FloatType` is mapped to `REAL` per [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017) respectively. ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer. Refer [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017 ) for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL". Some example issue that can happen because of wrong mappings - Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected. - Read results in a dataframe with type INTEGER as opposed to ShortType - ShortType has a problem in both the the write and read path - FloatTypes only have an issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'. Refer apache#28151 which contained this fix as one part of a larger PR. Following PR apache#28151 discussion it was decided to file seperate PRs for each of the fixes. ## How was this patch tested? UnitTest added in JDBCSuite.scala and these were tested. Integration test updated and passed in MsSqlServerDialect.scala E2E test done with SQLServer Closes apache#25146 from shivsood/float_short_type_fix. Authored-by: shivsood <shivsood@microsoft.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

dongjoon-hyun · 2019-07-24T01:38:59Z

ok to test

dongjoon-hyun · 2019-07-24T01:41:41Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala

@@ -860,6 +860,29 @@ class JDBCSuite extends QueryTest
      Some(TimestampType))
  }

+  test("MsSqlServerDialect jdbc type mapping") {


Okay. Since this is the same with master, we can ignore adding JIRA ID.

Are you referring to "test("MsSqlServerDialect jdbc type mapping")"? The fix updated this function, did not create a new function. For the new function that i added, i have mentioned the JIRA ID.

That's exactly what I meant, @shivsood . The above comment is not about requesting changes. It was supporting your code. Usually, reviewers leave their comments for this other reviewers.

dongjoon-hyun · 2019-07-24T01:43:29Z

Welcome back, @shivsood . But, I guess SPARK-27168 should not be here.

## What changes were proposed in this pull request? Change the binary type mapping from default blob to varbinary(max) for mssql server. https://docs.microsoft.com/en-us/sql/t-sql/data-types/binary-and-varbinary-transact-sql?view=sql-server-2017 ![image](https://user-images.githubusercontent.com/698621/54351715-0e8c8780-468b-11e9-8931-7ecb85c5ad6b.png) ## How was this patch tested? Unit test. Closes apache#24091 from lipzhu/SPARK-27159. Authored-by: Zhu, Lipeng <lipzhu@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

dongjoon-hyun · 2019-07-24T02:09:14Z

Also, SPARK-27159 should not be here. Let me handle them for you.

## What changes were proposed in this pull request? This PR aims to add a JDBC integration test for MsSql server. ## How was this patch tested? ``` ./build/mvn clean install -DskipTests ./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 \ -Dtest=none -DwildcardSuites=org.apache.spark.sql.jdbc.MsSqlServerIntegrationSuite ``` Closes apache#24099 from lipzhu/SPARK-27168. Lead-authored-by: Zhu, Lipeng <lipzhu@ebay.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: Lipeng Zhu <lipzhu@icloud.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

dongjoon-hyun · 2019-07-24T02:21:01Z

Please rebase this PR against branch-2.4, @shivsood .

SparkQA · 2019-07-24T06:23:28Z

Test build #108069 has finished for PR 25238 at commit 277fda0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivsood · 2019-07-24T06:39:47Z

should not be here. Let me handle them
I can submit a separate PR for SPARK-27159. Can u please help me understand the reason. Is it because we will loose attribution as a result of squash when PR is merged. If so that would happen for SPARK-27168 also and i should submit all these cherry-picks as separate PRs. Please let me know and i can fix this.

shivsood · 2019-07-24T06:40:25Z

Making this WIP till i fix these issues.

… for MsSqlServerDialect ## What changes were proposed in this pull request? This PR aims to correct mappings in `MsSqlServerDialect`. `ShortType` is mapped to `SMALLINT` and `FloatType` is mapped to `REAL` per [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017) respectively. ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer. Refer [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017 ) for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL". Some example issue that can happen because of wrong mappings - Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected. - Read results in a dataframe with type INTEGER as opposed to ShortType - ShortType has a problem in both the the write and read path - FloatTypes only have an issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'. Refer apache#28151 which contained this fix as one part of a larger PR. Following PR apache#28151 discussion it was decided to file seperate PRs for each of the fixes. ## How was this patch tested? UnitTest added in JDBCSuite.scala and these were tested. Integration test updated and passed in MsSqlServerDialect.scala E2E test done with SQLServer Closes apache#25146 from shivsood/float_short_type_fix. Authored-by: shivsood <shivsood@microsoft.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…spark into float_byte_type_fix_24

shivsood · 2019-07-24T17:10:35Z

@dongjoon-hyun i will close this PR and raise a new one. I messed up this branch. Do you want me to submit as 3PRs for the following commit or 1 PR? My understanding is 3 PRs. So they would have to be in the following sequence.

[SPARK-27159][SQL] update mssql server dialect to support binary type
[SPARK-27168] [SQL][TEST] Add docker integration test for MsSql server
[SPARK-28152][SQL] Mapped ShortType to SMALLINT and FloatType to REAL for   MsSqlServerDialect

AmplabJenkins · 2019-07-24T17:10:56Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/108119/
Test FAILed.

dongjoon-hyun · 2019-07-24T17:24:11Z

@shivsood . Please check the branch-2.4 first. :)

https://github.com/apache/spark/commits/branch-2.4

shivsood · 2019-07-24T17:49:55Z

@shivsood . Please check the branch-2.4 first. :)
* https://github.com/apache/spark/commits/branch-2.4

@dongjoon-hyun Awesome. thanks for pulling the dependency commit in. 'll submit my PR on top of these. Thanks.

dongjoon-hyun · 2019-07-24T18:04:56Z

Yep. Thanks!

shivsood · 2019-07-24T19:10:16Z

Create a new PR #25248

Zhu, Lipeng and others added 3 commits July 23, 2019 16:40

dongjoon-hyun added the SQL label Jul 24, 2019

dongjoon-hyun reviewed Jul 24, 2019

View reviewed changes

shivsood changed the title ~~[WIP][SPARK-28152][SQL] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect~~ [SPARK-28152][SQL] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect Jul 24, 2019

dongjoon-hyun changed the title ~~[SPARK-28152][SQL] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect~~ [SPARK-28152][SQL][2.4] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect Jul 24, 2019

shivsood changed the title ~~[SPARK-28152][SQL][2.4] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect~~ [WIP][SPARK-28152][SQL][2.4] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect Jul 24, 2019

shivsood added 2 commits July 24, 2019 09:59

Merge branch 'float_byte_type_fix_24' of https://github.com/shivsood/…

7f7adc1

…spark into float_byte_type_fix_24

shivsood closed this Jul 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][SPARK-28152][SQL][2.4] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect #25238

[WIP][SPARK-28152][SQL][2.4] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect #25238

shivsood commented Jul 24, 2019

dongjoon-hyun commented Jul 24, 2019

dongjoon-hyun Jul 24, 2019

shivsood Jul 24, 2019 •

edited

Loading

dongjoon-hyun Jul 24, 2019 •

edited

Loading

shivsood Jul 24, 2019

dongjoon-hyun commented Jul 24, 2019

dongjoon-hyun commented Jul 24, 2019

dongjoon-hyun commented Jul 24, 2019

SparkQA commented Jul 24, 2019

shivsood commented Jul 24, 2019

shivsood commented Jul 24, 2019

shivsood commented Jul 24, 2019 •

edited

Loading

AmplabJenkins commented Jul 24, 2019

dongjoon-hyun commented Jul 24, 2019

shivsood commented Jul 24, 2019

dongjoon-hyun commented Jul 24, 2019

shivsood commented Jul 24, 2019

[WIP][SPARK-28152][SQL][2.4] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect #25238

[WIP][SPARK-28152][SQL][2.4] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect #25238

Conversation

shivsood commented Jul 24, 2019

What changes were proposed in this pull request?

How was this patch tested?

dongjoon-hyun commented Jul 24, 2019

dongjoon-hyun Jul 24, 2019

Choose a reason for hiding this comment

shivsood Jul 24, 2019 • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun Jul 24, 2019 • edited Loading

Choose a reason for hiding this comment

shivsood Jul 24, 2019

Choose a reason for hiding this comment

dongjoon-hyun commented Jul 24, 2019

dongjoon-hyun commented Jul 24, 2019

dongjoon-hyun commented Jul 24, 2019

SparkQA commented Jul 24, 2019

shivsood commented Jul 24, 2019

shivsood commented Jul 24, 2019

shivsood commented Jul 24, 2019 • edited Loading

AmplabJenkins commented Jul 24, 2019

dongjoon-hyun commented Jul 24, 2019

shivsood commented Jul 24, 2019

dongjoon-hyun commented Jul 24, 2019

shivsood commented Jul 24, 2019

shivsood Jul 24, 2019 •

edited

Loading

dongjoon-hyun Jul 24, 2019 •

edited

Loading

shivsood commented Jul 24, 2019 •

edited

Loading