Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28152][SQL] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect #25146

Closed
wants to merge 2 commits into from

Conversation

shivsood
Copy link
Contributor

@shivsood shivsood commented Jul 13, 2019

What changes were proposed in this pull request?

This PR aims to correct mappings in MsSqlServerDialect. ShortType is mapped to SMALLINT and FloatType is mapped to REAL per JBDC mapping respectively.

ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer.

Refer JBDC mapping for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL".

Some example issue that can happen because of wrong mappings
- Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected.
- Read results in a dataframe with type INTEGER as opposed to ShortType

  • ShortType has a problem in both the the write and read path
  • FloatTypes only have an issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'.

Refer #28151 which contained this fix as one part of a larger PR. Following PR #28151 discussion it was decided to file seperate PRs for each of the fixes.

How was this patch tested?

UnitTest added in JDBCSuite.scala and these were tested.
Integration test updated and passed in MsSqlServerDialect.scala
E2E test done with SQLServer

… mapped correctly for read/write of SQLServer Tables

ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and
spark data frame being created with unintended types. The issue was observed when validating against SQLServer.

Some example issue
    - Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.
      Thus a larger table that expected.
    - Read results in a dataframe with type INTEGER as opposed to ShortType

FloatTypes have a issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'.

Post fix ShortType is correctly mapped to SMALLINT and FloatType is mapped to REAL
@dongjoon-hyun
Copy link
Member

ok to test

@dongjoon-hyun
Copy link
Member

BTW, @shivsood .
Since you are going to contribute more to Apache Spark community, I'll give you some comments on the current PR title.

Fix for SPARK-28152: ShortType and FloatTypes are not mapped correctly for read/write of SQLServer Tables

  • We use [SPARK-28152] prefix followed by the component tag like [SQL].
  • We recommend to use the PR title to describe your approach, not a problem. The problem description and the comparison between before and after are done in the PR description.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the PR title. Also, please revise the PR description. It's malformed. You can reference the commit logs.

@shivsood shivsood changed the title Fix for SPARK-28152: ShortType and FloatTypes are not mapped correctly for read/write of SQLServer Tables [SPARK-28152] [SQL] Mapped ShortType to SMALLINT and FloatTypes to REAL for correct read/write of SQLServer Tables Jul 13, 2019
@shivsood shivsood changed the title [SPARK-28152] [SQL] Mapped ShortType to SMALLINT and FloatTypes to REAL for correct read/write of SQLServer Tables [SPARK-28152] [SQL] Mapped ShortType to SMALLINT and FloatType to REAL for correct read/write of SQLServer Tables Jul 13, 2019
@shivsood shivsood changed the title [SPARK-28152] [SQL] Mapped ShortType to SMALLINT and FloatType to REAL for correct read/write of SQLServer Tables [SPARK-28152] [SQL] Mapped ShortType to SMALLINT and FloatType to REAL for correct read/write of SQLServer Tables using JDBC connector Jul 13, 2019
@shivsood
Copy link
Contributor Author

BTW, @shivsood .
Since you are going to contribute more to Apache Spark community, I'll give you some comments on the current PR title.

Fix for SPARK-28152: ShortType and FloatTypes are not mapped correctly for read/write of SQLServer Tables

* We use `[SPARK-28152]` prefix followed by the component tag like `[SQL]`.

* We recommend to use the PR title to describe your approach, not a problem. The problem description and the comparison between before and after are done in the PR description.

Thanks dongjoon-hyun. I have fixed this.

@dongjoon-hyun
Copy link
Member

Thank you, @shivsood !

@SparkQA
Copy link

SparkQA commented Jul 13, 2019

Test build #107633 has finished for PR 25146 at commit 49606c8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-28152] [SQL] Mapped ShortType to SMALLINT and FloatType to REAL for correct read/write of SQLServer Tables using JDBC connector [SPARK-28152][SQL] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect Jul 15, 2019
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jul 15, 2019

Could you elaborate your PR description a little bit more, @shivsood ? Please refer the other commit logs.

@shivsood
Copy link
Contributor Author

Could you elaborate your PR description a little bit more, @shivsood ? Please refer the other commit logs.

Done.

@dongjoon-hyun
Copy link
Member

Thank you, @shivsood . I updated a little bit more. You can see the difference~

@shivsood
Copy link
Contributor Author

@dongjoon-hyun Thanks. Looks great now!

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. (Pending Jenkins).
Thank you so much, @shivsood ! I also tested this manually with the integration test.

@shivsood
Copy link
Contributor Author

+1, LGTM. (Pending Jenkins).
Thank you so much, @shivsood ! I also tested this manually with the integration test.
@dongjoon-hyun Thanks. I also manually tested this with integration test in MsSQLServerIntegrationSuite.scala

@SparkQA
Copy link

SparkQA commented Jul 15, 2019

Test build #107692 has finished for PR 25146 at commit b25007e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jul 15, 2019

Merged to master. There is a conflict for the older branches. Could you make a backporting PR against branch-2.4 (and branch-2.3 if you want)?

@dongjoon-hyun
Copy link
Member

And, congratulation! I added you to Apache Spark Contributor group and assigned SPARK-28152 to you.

@shivsood
Copy link
Contributor Author

Could you make a backporting PR a
Will do. Important to have this fixed on 2.4.
Have to look into the process to create a back-porting PR. Generally i would cherry-pick this PR to 2.4 branch, change as required, test and create a new pull request. Is this right process?

@shivsood
Copy link
Contributor Author

And, congratulation! I added you to Apache Spark Contributor group and assigned SPARK-28152 to you.
@dongjoon-hyun Thanks for all the guidance and reviews. Post this PR i have much better understanding of the process and criteria. Thanks for your patience and guidance.

@wangyum Thanks for suggesting updates to MsSQLServerIntegrationSuite. That's a great test suite for end to end test. Great if it is also part of CI/CD.

@dongjoon-hyun
Copy link
Member

Yep. If a PR meets the criteria from the beginning, it is merged quickly. I'm looking forward seeing more contributions from you. 😄 Thanks.

vinodkc pushed a commit to vinodkc/spark that referenced this pull request Jul 18, 2019
… for MsSqlServerDialect

## What changes were proposed in this pull request?
This PR aims to correct mappings in `MsSqlServerDialect`. `ShortType` is mapped to `SMALLINT` and `FloatType` is mapped to `REAL` per [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017) respectively.

ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer.

Refer [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017  ) for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL".

Some example issue that can happen because of wrong mappings
    - Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected.
    - Read results in a dataframe with type INTEGER as opposed to ShortType

- ShortType has a problem in both the the write and read path
- FloatTypes only have an issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'.

Refer apache#28151 which contained this fix as one part of a larger PR.  Following PR apache#28151 discussion it was decided to file seperate PRs for each of the fixes.

## How was this patch tested?
UnitTest added in JDBCSuite.scala and these were tested.
Integration test updated and passed in MsSqlServerDialect.scala
E2E test done with SQLServer

Closes apache#25146 from shivsood/float_short_type_fix.

Authored-by: shivsood <shivsood@microsoft.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@gatorsmile
Copy link
Member

@dongjoon-hyun @shivsood This change requires an update in our migration guide section. Could you submit a follow-up PR for this?

@shivsood
Copy link
Contributor Author

Could you submit a follow-up PR for this?
@gatorsmile Good point. 'll submit the doc PR. Also @dongjoon-hyun had asked a PR for spark 2.4 which is still pending on me. Will do that and follow that up with a document PR. Thanks

shivsood added a commit to shivsood/spark that referenced this pull request Jul 24, 2019
… for MsSqlServerDialect

## What changes were proposed in this pull request?
This PR aims to correct mappings in `MsSqlServerDialect`. `ShortType` is mapped to `SMALLINT` and `FloatType` is mapped to `REAL` per [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017) respectively.

ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer.

Refer [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017  ) for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL".

Some example issue that can happen because of wrong mappings
    - Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected.
    - Read results in a dataframe with type INTEGER as opposed to ShortType

- ShortType has a problem in both the the write and read path
- FloatTypes only have an issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'.

Refer apache#28151 which contained this fix as one part of a larger PR.  Following PR apache#28151 discussion it was decided to file seperate PRs for each of the fixes.

## How was this patch tested?
UnitTest added in JDBCSuite.scala and these were tested.
Integration test updated and passed in MsSqlServerDialect.scala
E2E test done with SQLServer

Closes apache#25146 from shivsood/float_short_type_fix.

Authored-by: shivsood <shivsood@microsoft.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
shivsood added a commit to shivsood/spark that referenced this pull request Jul 24, 2019
… for MsSqlServerDialect

## What changes were proposed in this pull request?
This PR aims to correct mappings in `MsSqlServerDialect`. `ShortType` is mapped to `SMALLINT` and `FloatType` is mapped to `REAL` per [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017) respectively.

ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer.

Refer [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017  ) for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL".

Some example issue that can happen because of wrong mappings
    - Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected.
    - Read results in a dataframe with type INTEGER as opposed to ShortType

- ShortType has a problem in both the the write and read path
- FloatTypes only have an issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'.

Refer apache#28151 which contained this fix as one part of a larger PR.  Following PR apache#28151 discussion it was decided to file seperate PRs for each of the fixes.

## How was this patch tested?
UnitTest added in JDBCSuite.scala and these were tested.
Integration test updated and passed in MsSqlServerDialect.scala
E2E test done with SQLServer

Closes apache#25146 from shivsood/float_short_type_fix.

Authored-by: shivsood <shivsood@microsoft.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
shivsood added a commit to shivsood/spark that referenced this pull request Jul 24, 2019
… for MsSqlServerDialect

## What changes were proposed in this pull request?
This PR aims to correct mappings in `MsSqlServerDialect`. `ShortType` is mapped to `SMALLINT` and `FloatType` is mapped to `REAL` per [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017) respectively.

ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer.

Refer [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017  ) for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL".

Some example issue that can happen because of wrong mappings
    - Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected.
    - Read results in a dataframe with type INTEGER as opposed to ShortType

- ShortType has a problem in both the the write and read path
- FloatTypes only have an issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'.

Refer apache#28151 which contained this fix as one part of a larger PR.  Following PR apache#28151 discussion it was decided to file seperate PRs for each of the fixes.

## How was this patch tested?
UnitTest added in JDBCSuite.scala and these were tested.
Integration test updated and passed in MsSqlServerDialect.scala
E2E test done with SQLServer

Closes apache#25146 from shivsood/float_short_type_fix.

Authored-by: shivsood <shivsood@microsoft.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants