-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-28152][SQL][2.4] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect #25248
Conversation
… for MsSqlServerDialect ## What changes were proposed in this pull request? This PR aims to correct mappings in `MsSqlServerDialect`. `ShortType` is mapped to `SMALLINT` and `FloatType` is mapped to `REAL` per [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017) respectively. ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer. Refer [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017 ) for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL". Some example issue that can happen because of wrong mappings - Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected. - Read results in a dataframe with type INTEGER as opposed to ShortType - ShortType has a problem in both the the write and read path - FloatTypes only have an issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'. Refer apache#28151 which contained this fix as one part of a larger PR. Following PR apache#28151 discussion it was decided to file seperate PRs for each of the fixes. ## How was this patch tested? UnitTest added in JDBCSuite.scala and these were tested. Integration test updated and passed in MsSqlServerDialect.scala E2E test done with SQLServer Closes apache#25146 from shivsood/float_short_type_fix. Authored-by: shivsood <shivsood@microsoft.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
ok to test |
Test build #108132 has finished for PR 25248 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @shivsood .
Merged to branch-2.4. (The integration test part is also tested by manually.)
… REAL for MsSqlServerDialect ## What changes were proposed in this pull request? This is a backport of SPARK-28152 to Spark 2.4. SPARK-28152 PR aims to correct mappings in `MsSqlServerDialect`. `ShortType` is mapped to `SMALLINT` and `FloatType` is mapped to `REAL` per [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017) respectively. ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer. Refer [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017 ) for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL". Some example issue that can happen because of wrong mappings - Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected. - Read results in a dataframe with type INTEGER as opposed to ShortType - ShortType has a problem in both the the write and read path - FloatTypes only have an issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'. Refer #28151 which contained this fix as one part of a larger PR. Following PR #28151 discussion it was decided to file seperate PRs for each of the fixes. ## How was this patch tested? UnitTest added in JDBCSuite.scala and these were tested. Integration test updated and passed in MsSqlServerDialect.scala E2E test done with SQLServer Closes #25248 from shivsood/PR_28152_2.4. Authored-by: shivsood <shivsood@microsoft.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Thanks @dongjoon-hyun |
… REAL for MsSqlServerDialect ## What changes were proposed in this pull request? This is a backport of SPARK-28152 to Spark 2.4. SPARK-28152 PR aims to correct mappings in `MsSqlServerDialect`. `ShortType` is mapped to `SMALLINT` and `FloatType` is mapped to `REAL` per [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017) respectively. ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer. Refer [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017 ) for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL". Some example issue that can happen because of wrong mappings - Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected. - Read results in a dataframe with type INTEGER as opposed to ShortType - ShortType has a problem in both the the write and read path - FloatTypes only have an issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'. Refer apache#28151 which contained this fix as one part of a larger PR. Following PR apache#28151 discussion it was decided to file seperate PRs for each of the fixes. ## How was this patch tested? UnitTest added in JDBCSuite.scala and these were tested. Integration test updated and passed in MsSqlServerDialect.scala E2E test done with SQLServer Closes apache#25248 from shivsood/PR_28152_2.4. Authored-by: shivsood <shivsood@microsoft.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
… REAL for MsSqlServerDialect ## What changes were proposed in this pull request? This is a backport of SPARK-28152 to Spark 2.4. SPARK-28152 PR aims to correct mappings in `MsSqlServerDialect`. `ShortType` is mapped to `SMALLINT` and `FloatType` is mapped to `REAL` per [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017) respectively. ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer. Refer [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017 ) for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL". Some example issue that can happen because of wrong mappings - Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected. - Read results in a dataframe with type INTEGER as opposed to ShortType - ShortType has a problem in both the the write and read path - FloatTypes only have an issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'. Refer apache#28151 which contained this fix as one part of a larger PR. Following PR apache#28151 discussion it was decided to file seperate PRs for each of the fixes. ## How was this patch tested? UnitTest added in JDBCSuite.scala and these were tested. Integration test updated and passed in MsSqlServerDialect.scala E2E test done with SQLServer Closes apache#25248 from shivsood/PR_28152_2.4. Authored-by: shivsood <shivsood@microsoft.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@shivsood @dongjoon-hyun We should avoid backporting this PR to the maintenance releases. Users will hit a weird error like .
They have to drop the JDBC table stored in metastore and recreate it again to bypass the issue. When you review similar PRs, please help stop merging such a PR. @cloud-fan @maropu @HyukjinKwon @srowen @dongjoon-hyun |
sure. |
Agree, this behavior change isn't what we want to back-port. It wasn't obvious this would happen, but knowing this, definitely looks like it should be reverted. |
Yeah, let's revert. |
@dongjoon-hyun is on vacation IIRC. Let me revert it first. |
Hang on a tick. I think there is some question about whether this should be behind a flag? Let me get the conversation online here |
Oops, sorry, I already did. I realised that this is already part of Spark 2.4.4. |
Yeah, reverting this is also a breaking change when it's already released on the flip side ... |
Okay, I reverted my revert just now. We can discuss more here. |
Thanks for pointing this out. I wonder how i missed this point on backward
compatibility of existing tables. I agree this should be reverted.
…On Thu, Nov 28, 2019 at 5:53 PM Hyukjin Kwon ***@***.***> wrote:
Yeah, reverting it's also a breaking change when it's already released on
the flip side ...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25248?email_source=notifications&email_token=AAMBQMMIZAFLWHHMKJKVFVLQWBYZVA5CNFSM4IGTF7J2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFNVMIY#issuecomment-559633955>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMBQMJ7ABQKQY4ZACAE4PTQWBYZVANCNFSM4IGTF7JQ>
.
|
Yeah, it should be reverted .. but I realised that reverting this is also a breaking change comparing to 2.4.5 (if released). If somebody creates a table in Spark 2.4.4, they might get different type in Spark 2.4.5. It might matter in roundtrip (e.g., |
cc @zsxwing since I discussed with him as well offline. |
Yep. As reverting this can cause another behavior change, we should avoid doing this in 2.4.5. It's better to just add a flag in 2.4.5 to allow the users to choose the old behavior. |
Yea, I also think we need to add a flag to cover the issue. |
…ema mismatched ### What changes were proposed in this pull request? Issue better error message when user-specified schema and not match relation schema ### Why are the changes needed? Inspired by #25248 (comment), user could get a weird error message when type mapping behavior change between Spark schema and datasource schema(e.g. JDBC). Instead of saying "SomeProvider does not allow user-specified schemas.", we'd better tell user what is really happening here to make user be more clearly about the error. ### Does this PR introduce any user-facing change? Yes, user will see error message changes. ### How was this patch tested? Updated existed tests. Closes #26781 from Ngone51/dev-mismatch-schema. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Hi, @shivsood . Could you make a PR according to the above advice? |
Gentle ping, @shivsood . |
|
The latest example is the following. |
Since there is no ETA from @shivsood , I'll take over the follow-up. |
I will be taking this up this week. ETA end of coming week.
Regards,
Shiv
…On Sun, Jan 12, 2020 at 7:56 PM Dongjoon Hyun ***@***.***> wrote:
Since there is no ETA from @shivsood <https://github.com/shivsood> , I'll
take over the follow-up.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25248?email_source=notifications&email_token=AAMBQMJDEIR3QYZ7WC5GZZTQ5PQ7DA5CNFSM4IGTF7J2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIXORMA#issuecomment-573499568>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMBQMO2HOM424ZUBDT45JTQ5PQ7DANCNFSM4IGTF7JQ>
.
|
@shivsood . If you read the dev mailing list, |
Thanks.
I am sorry about my oversight on the date.
Regards,
Shiv
…On Sun, Jan 12, 2020 at 8:34 PM Dongjoon Hyun ***@***.***> wrote:
@shivsood <https://github.com/shivsood> . If you see the dev mailing
list, it's scheduled tomorrow and I'm the release manager of Apache Spark
2.4.5. We need the patch tonight. I'm working on this for you. Never mind.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25248?email_source=notifications&email_token=AAMBQMKPOLVSZQS7HDARIXLQ5PVNBA5CNFSM4IGTF7J2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIXPYYQ#issuecomment-573504610>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMBQMOPVOEI3RMFR6ZXPRTQ5PVNBANCNFSM4IGTF7JQ>
.
|
Oh. It's okay. No problem. I mean I'm preparing the PR for this follow-up request. So, you don't need to worry about the schedule. :) Usually, the author and the committer should share the responsibility. |
Thanks a lot!
Regards,
Shiv
…On Sun, Jan 12, 2020 at 8:48 PM Dongjoon Hyun ***@***.***> wrote:
Oh. It's okay. No problem. I mean I'm preparing the PR for this follow-up
request. So, you don't need to worry about the schedule. :) Usually, the
author and the committer should share the responsibility.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25248?email_source=notifications&email_token=AAMBQMOGMY2WI7RF7QBUSHTQ5PXBZA5CNFSM4IGTF7J2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIXQKDQ#issuecomment-573506830>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMBQMJSENHRS6FWDV35OOLQ5PXBZANCNFSM4IGTF7JQ>
.
|
The followup PR is ready. |
…lect numeric mapping ### What changes were proposed in this pull request? This is a follow-up for #25248 . ### Why are the changes needed? The new behavior cannot access the existing table which is created by old behavior. This PR provides a way to avoid new behavior for the existing users. ### Does this PR introduce any user-facing change? Yes. This will fix the broken behavior on the existing tables. ### How was this patch tested? Pass the Jenkins and manually run JDBC integration test. ``` build/mvn install -DskipTests build/mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 test ``` Closes #27184 from dongjoon-hyun/SPARK-28152-CONF. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…lect numeric mapping This is a follow-up for #25248 . The new behavior cannot access the existing table which is created by old behavior. This PR provides a way to avoid new behavior for the existing users. Yes. This will fix the broken behavior on the existing tables. Pass the Jenkins and manually run JDBC integration test. ``` build/mvn install -DskipTests build/mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 test ``` Closes #27184 from dongjoon-hyun/SPARK-28152-CONF. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 28fc043) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…ema mismatched ### What changes were proposed in this pull request? Issue better error message when user-specified schema and not match relation schema ### Why are the changes needed? Inspired by apache/spark#25248 (comment), user could get a weird error message when type mapping behavior change between Spark schema and datasource schema(e.g. JDBC). Instead of saying "SomeProvider does not allow user-specified schemas.", we'd better tell user what is really happening here to make user be more clearly about the error. ### Does this PR introduce any user-facing change? Yes, user will see error message changes. ### How was this patch tested? Updated existed tests. Closes #26781 from Ngone51/dev-mismatch-schema. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This is a backport of SPARK-28152 to Spark 2.4.
SPARK-28152 PR aims to correct mappings in
MsSqlServerDialect
.ShortType
is mapped toSMALLINT
andFloatType
is mapped toREAL
per JBDC mapping respectively.ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer.
Refer JBDC mapping for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL".
Some example issue that can happen because of wrong mappings
- Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected.
- Read results in a dataframe with type INTEGER as opposed to ShortType
Refer #28151 which contained this fix as one part of a larger PR. Following PR #28151 discussion it was decided to file seperate PRs for each of the fixes.
How was this patch tested?
UnitTest added in JDBCSuite.scala and these were tested.
Integration test updated and passed in MsSqlServerDialect.scala
E2E test done with SQLServer