Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][SPARK-28152][SQL][2.4] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect #25238

Closed
wants to merge 7 commits into from

Conversation

shivsood
Copy link
Contributor

What changes were proposed in this pull request?

This is a backport of SPARK-28152 to Spark 2.4. Because the fix in 3.0 was based on different base files, following relevant fixes have also been cherry-picked as part of this fix.

  • [SPARK-27159][SQL] update mssql server dialect to support binary type
  • SPARK-27168 [SQL][TEST] Add docker integration test for MsSql server

SPARK-28152 corrects mappings in MsSqlServerDialect. Post fix ShortType is mapped to SMALLINT and FloatType is mapped to REAL per JBDC mapping respectively.

ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer.

Refer JBDC mapping for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL".

Some example issue that can happen because of wrong mappings
- Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected.
- Read results in a dataframe with type INTEGER as opposed to ShortType

  • ShortType has a problem in both the the write and read path
  • FloatTypes only have an issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'.

How was this patch tested?

Integration test using : ./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 -Dtest=none -DwildcardSuites=org.apache.spark.sql.jdbc.MsSqlServerIntegrationSuite

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review https://spark.apache.org/contributing.html before opening a pull request.

Zhu, Lipeng and others added 3 commits July 23, 2019 16:40
## What changes were proposed in this pull request?

Change the binary type mapping from default blob to varbinary(max) for mssql server.
https://docs.microsoft.com/en-us/sql/t-sql/data-types/binary-and-varbinary-transact-sql?view=sql-server-2017
![image](https://user-images.githubusercontent.com/698621/54351715-0e8c8780-468b-11e9-8931-7ecb85c5ad6b.png)

## How was this patch tested?

Unit test.

Closes apache#24091 from lipzhu/SPARK-27159.

Authored-by: Zhu, Lipeng <lipzhu@ebay.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?

This PR aims to add a JDBC integration test for MsSql server.

## How was this patch tested?

```
./build/mvn clean install -DskipTests
./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 \
-Dtest=none -DwildcardSuites=org.apache.spark.sql.jdbc.MsSqlServerIntegrationSuite
```

Closes apache#24099 from lipzhu/SPARK-27168.

Lead-authored-by: Zhu, Lipeng <lipzhu@ebay.com>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Co-authored-by: Lipeng Zhu <lipzhu@icloud.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
… for MsSqlServerDialect

## What changes were proposed in this pull request?
This PR aims to correct mappings in `MsSqlServerDialect`. `ShortType` is mapped to `SMALLINT` and `FloatType` is mapped to `REAL` per [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017) respectively.

ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer.

Refer [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017  ) for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL".

Some example issue that can happen because of wrong mappings
    - Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected.
    - Read results in a dataframe with type INTEGER as opposed to ShortType

- ShortType has a problem in both the the write and read path
- FloatTypes only have an issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'.

Refer apache#28151 which contained this fix as one part of a larger PR.  Following PR apache#28151 discussion it was decided to file seperate PRs for each of the fixes.

## How was this patch tested?
UnitTest added in JDBCSuite.scala and these were tested.
Integration test updated and passed in MsSqlServerDialect.scala
E2E test done with SQLServer

Closes apache#25146 from shivsood/float_short_type_fix.

Authored-by: shivsood <shivsood@microsoft.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@dongjoon-hyun
Copy link
Member

ok to test

@@ -860,6 +860,29 @@ class JDBCSuite extends QueryTest
Some(TimestampType))
}

test("MsSqlServerDialect jdbc type mapping") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Since this is the same with master, we can ignore adding JIRA ID.

Copy link
Contributor Author

@shivsood shivsood Jul 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to "test("MsSqlServerDialect jdbc type mapping")"? The fix updated this function, did not create a new function. For the new function that i added, i have mentioned the JIRA ID.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Jul 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly what I meant, @shivsood . The above comment is not about requesting changes. It was supporting your code. Usually, reviewers leave their comments for this other reviewers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@dongjoon-hyun
Copy link
Member

Welcome back, @shivsood . But, I guess SPARK-27168 should not be here.

@shivsood shivsood changed the title [WIP][SPARK-28152][SQL] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect [SPARK-28152][SQL] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect Jul 24, 2019
## What changes were proposed in this pull request?

Change the binary type mapping from default blob to varbinary(max) for mssql server.
https://docs.microsoft.com/en-us/sql/t-sql/data-types/binary-and-varbinary-transact-sql?view=sql-server-2017
![image](https://user-images.githubusercontent.com/698621/54351715-0e8c8780-468b-11e9-8931-7ecb85c5ad6b.png)

## How was this patch tested?

Unit test.

Closes apache#24091 from lipzhu/SPARK-27159.

Authored-by: Zhu, Lipeng <lipzhu@ebay.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
@dongjoon-hyun
Copy link
Member

Also, SPARK-27159 should not be here. Let me handle them for you.

## What changes were proposed in this pull request?

This PR aims to add a JDBC integration test for MsSql server.

## How was this patch tested?

```
./build/mvn clean install -DskipTests
./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 \
-Dtest=none -DwildcardSuites=org.apache.spark.sql.jdbc.MsSqlServerIntegrationSuite
```

Closes apache#24099 from lipzhu/SPARK-27168.

Lead-authored-by: Zhu, Lipeng <lipzhu@ebay.com>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Co-authored-by: Lipeng Zhu <lipzhu@icloud.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@dongjoon-hyun
Copy link
Member

Please rebase this PR against branch-2.4, @shivsood .

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-28152][SQL] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect [SPARK-28152][SQL][2.4] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect Jul 24, 2019
@SparkQA
Copy link

SparkQA commented Jul 24, 2019

Test build #108069 has finished for PR 25238 at commit 277fda0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivsood
Copy link
Contributor Author

should not be here. Let me handle them
I can submit a separate PR for SPARK-27159. Can u please help me understand the reason. Is it because we will loose attribution as a result of squash when PR is merged. If so that would happen for SPARK-27168 also and i should submit all these cherry-picks as separate PRs. Please let me know and i can fix this.

@shivsood shivsood changed the title [SPARK-28152][SQL][2.4] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect [WIP][SPARK-28152][SQL][2.4] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect Jul 24, 2019
@shivsood
Copy link
Contributor Author

Making this WIP till i fix these issues.

… for MsSqlServerDialect

## What changes were proposed in this pull request?
This PR aims to correct mappings in `MsSqlServerDialect`. `ShortType` is mapped to `SMALLINT` and `FloatType` is mapped to `REAL` per [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017) respectively.

ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer.

Refer [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017  ) for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL".

Some example issue that can happen because of wrong mappings
    - Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected.
    - Read results in a dataframe with type INTEGER as opposed to ShortType

- ShortType has a problem in both the the write and read path
- FloatTypes only have an issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'.

Refer apache#28151 which contained this fix as one part of a larger PR.  Following PR apache#28151 discussion it was decided to file seperate PRs for each of the fixes.

## How was this patch tested?
UnitTest added in JDBCSuite.scala and these were tested.
Integration test updated and passed in MsSqlServerDialect.scala
E2E test done with SQLServer

Closes apache#25146 from shivsood/float_short_type_fix.

Authored-by: shivsood <shivsood@microsoft.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@shivsood
Copy link
Contributor Author

shivsood commented Jul 24, 2019

@dongjoon-hyun i will close this PR and raise a new one. I messed up this branch. Do you want me to submit as 3PRs for the following commit or 1 PR? My understanding is 3 PRs. So they would have to be in the following sequence.

[SPARK-27159][SQL] update mssql server dialect to support binary type
[SPARK-27168] [SQL][TEST] Add docker integration test for MsSql server
[SPARK-28152][SQL] Mapped ShortType to SMALLINT and FloatType to REAL for   MsSqlServerDialect

@shivsood shivsood closed this Jul 24, 2019
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/108119/
Test FAILed.

@dongjoon-hyun
Copy link
Member

@shivsood . Please check the branch-2.4 first. :)

@shivsood
Copy link
Contributor Author

@shivsood . Please check the branch-2.4 first. :)

* https://github.com/apache/spark/commits/branch-2.4

@dongjoon-hyun Awesome. thanks for pulling the dependency commit in. 'll submit my PR on top of these. Thanks.

@dongjoon-hyun
Copy link
Member

Yep. Thanks!

@shivsood
Copy link
Contributor Author

Create a new PR #25248

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants