Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28151][SQL] Fix MsSqlServerDialect Byte/Short/Float type mappings ( DRAFT) #24969

Closed
wants to merge 6 commits into from

Conversation

shivsood
Copy link
Contributor

@shivsood shivsood commented Jun 25, 2019

Abandoning this PR. Will create new one that have specific fixes

  1. PR for Float Type and ShortType Fix
  2. PR for ByteType Fix.

Fix for ByteType, ShortType and FloatTypes are not correctly mapped for read/write of SQLServer tables
###ByteType issue
Writing dataframe with column type BYTETYPE fails when using JDBC connector for SQL Server. Append and Read of tables also fail. The problem is due

  1. (Write path) Incorrect mapping of BYTETYPE in getCommonJDBCType() in jdbcutils.scala where BYTETYPE gets mapped to BYTE text. It should be mapped to TINYINT
    case ByteType => Option(JdbcType("BYTE", java.sql.Types.TINYINT))

In getCatalystType() ( JDBC to Catalyst type mapping) TINYINT is mapped to INTEGER, while it should be mapped to BYTETYPE. Mapping to integer is ok from the point of view of upcasting, but will lead to 4 byte allocation rather than 1 byte for BYTETYPE.

  1. (read path) Read path ends up calling makeGetter(dt: DataType, metadata: Metadata). The function sets the value in RDD row. The value is set per the data type. Here there is no mapping for BYTETYPE and thus results will result in an error when getCatalystType() is fixed.

Note : These issues were found when reading/writing with SQLServer. Will be submitting a PR soon to fix these mappings in MSSQLServerDialect.

Error seen when writing table

(JDBC Write failed,com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or variable #2: Cannot find data type BYTE.)
com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or variable #2: Cannot find data type BYTE.
com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254)
com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608)
com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:859)
..

###ShortType and FloatType issue
ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types.

Some example issue

Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT. Thus a larger table that expected.
read results in a dataframe with type INTEGER as opposed to ShortType

FloatTypes have a issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'.

What changes were proposed in this pull request?

(MsSqlServerDialect.scala ) BYTETYPE is mapped to "TINYINT", SHORTTYPE is mapped to small int
(JdbcUtils.scala) reading ByteTYPE is added to makeGetter to enable reading after ByteType is translated to TINYINT.

How was this patch tested?

Unit test - Added and passed.
Integration Test - Tested end to end with SQLServer using JDBC connector.

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review https://spark.apache.org/contributing.html before opening a pull request.

…ta sources.Relevant unit tests are added.

Issues:
Writing dataframe with column type BYTETYPE fails when using JDBC connector for SQL Server. Append and Read of tables also fail. The problem is due  to

1. (Write path) Incorrect mapping of BYTETYPE in getCommonJDBCType() in jdbcutils.scala where BYTETYPE gets mapped to BYTE text. It should be mapped to TINYINT
case ByteType => Option(JdbcType("BYTE", java.sql.Types.TINYINT))

In getCatalystType() ( JDBC to Catalyst type mapping) TINYINT is mapped to INTEGER, while it should be mapped to BYTETYPE. Mapping to integer is ok from the point of view of upcasting, but will lead to 4 byte allocation rather than 1 byte for BYTETYPE.

2. (Read path) Read path ends up calling makeGetter(dt: DataType, metadata: Metadata). The function sets the value in RDD row. The value is set per the data type. Here there is no mapping for BYTETYPE and thus results will result in an error when getCatalystType() is fixed.

Note : These issues were found when reading/writing with SQLServer.

Error seen when writing table

(JDBC Write failed,com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or variable apache#2: Cannot find data type BYTE.)
com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or variable apache#2: Cannot find data type BYTE.
com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254)
com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608)
com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:859)
 ..
ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and
spark data frame being created with unintended types.

Some example issue

    Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT. Thus a larger
    table that expected. Read results in a dataframe with type INTEGER as opposed to ShortType

FloatTypes have a issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'.
But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped
to 'DoubleType' rather than 'FloatType'.
@shivsood shivsood changed the title [SPARK-28151] ByteType is not correctly mapped for read/write of SQLServer tables [SPARK-28151] ByteType, ShortType and FloatTypes are not correctly mapped for read/write of SQLServer tables Jun 25, 2019
Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know SQL Server, so can't really evaluate it, but seems like a plausible change.

@@ -39,6 +45,8 @@ private object MsSqlServerDialect extends JdbcDialect {
case StringType => Some(JdbcType("NVARCHAR(MAX)", java.sql.Types.NVARCHAR))
case BooleanType => Some(JdbcType("BIT", java.sql.Types.BIT))
case BinaryType => Some(JdbcType("VARBINARY(MAX)", java.sql.Types.VARBINARY))
case ByteType => Option(JdbcType("TINYINT", java.sql.Types.TINYINT))
case ShortType => Option(JdbcType("SMALLINT", java.sql.Types.SMALLINT))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does FloatType need to map back to REAL or does that already happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, FloatType is correctly mapped to REAL in def getCommonJDBCType(dt: DataType)

…NT and FLOAT in MsSqlServerIntegrationSuite.scala. 2. code cleanup in MsSqlServerDialect.scala 3. Fixed issue in JdbcUtils where tinyInt was being written as Int and failing the 'basic write' test cases in IntegrationSuite.
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-28151] ByteType, ShortType and FloatTypes are not correctly mapped for read/write of SQLServer tables [SPARK-28151][SQL] ByteType, ShortType and FloatTypes are not correctly mapped for read/write of SQLServer tables Jul 5, 2019
@dongjoon-hyun
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jul 5, 2019

Test build #107268 has finished for PR 24969 at commit b874527.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @shivsood . Thank you for your contribution.
Could you run dev/scalastyle in your environment and fix all errors?

Currently, this PR has the followings.

========================================================================
Running Scala style checks
========================================================================
[info] Checking Scala style using SBT with these profiles:  -Phadoop-2.7 -Pkubernetes -Phive-thriftserver -Phadoop-cloud -Pkinesis-asl -Pyarn -Pspark-ganglia-lgpl -Phive -Pmesos
Scalastyle checks failed at following occurrences:
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala:909:68: No space after token ,
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala:909:78: No space after token ,
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala:909:80: No space after token ,
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala:911:69: No space after token ,
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala:911:80: No space after token ,
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala:911:82: No space after token ,
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala:913:65: No space after token ,
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala:913:72: No space after token ,
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala:913:74: No space after token ,
[error] Total time: 27 s, completed Jul 5, 2019 1:08:48 AM

@SparkQA
Copy link

SparkQA commented Jul 8, 2019

Test build #107364 has finished for PR 24969 at commit 8a7d960.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivsood
Copy link
Contributor Author

shivsood commented Jul 8, 2019

Test build #107364 has finished for PR 24969 at commit 8a7d960.

* This patch **fails to build**.

* This patch merges cleanly.

* This patch adds no public classes.

Resolved this. I am not using this exact command below to build locally and thus did not see that. Have to see the right/strict way to build that shows all all error locally.

[info] Done packaging.
java.lang.RuntimeException: 1 fatal warnings
at scala.sys.package$.error(package.scala:27)
at SparkBuild$$anonfun$sharedSettings$21.apply(SparkBuild.scala:315)
at SparkBuild$$anonfun$sharedSettings$21.apply(SparkBuild.scala:284)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
at sbt.std.Transform$$anon$4.work(System.scala:63)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
at sbt.Execute.work(Execute.scala:237)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[error] (sql/compile:compile) 1 fatal warnings
[error] Total time: 161 s, completed Jul 8, 2019 10:57:09 AM
[error] running /home/jenkins/workspace/SparkPullRequestBuilder/build/sbt -Phadoop-2.7 -Pkubernetes -Phive-thriftserver -Phadoop-cloud -Pkinesis-asl -Pyarn -Pspark-ganglia-lgpl -Phive -Pmesos test:package streaming-kinesis-asl-assembly/assembly ; received return code 1
Attempting to post to Github...

Post successful.
Build step 'Execute shell' marked build as failure
Archiving artifacts
Recording test results
ERROR: Step ?Publish JUnit test result report? failed: No test report files were found. Configuration error?
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/107364/
Test FAILed.
Finished: FAILURE

@SparkQA
Copy link

SparkQA commented Jul 8, 2019

Test build #4816 has finished for PR 24969 at commit 8a7d960.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

…hould have seen this as a build failure earlier.
@SparkQA
Copy link

SparkQA commented Jul 8, 2019

Test build #107370 has finished for PR 24969 at commit 6bf6e3d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-28151][SQL] ByteType, ShortType and FloatTypes are not correctly mapped for read/write of SQLServer tables [SPARK-28151][SQL] Fix MsSqlServerDialect Byte/Short/Float type mappings Jul 9, 2019
@@ -550,7 +550,7 @@ object JdbcUtils extends Logging {

case ByteType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setInt(pos + 1, row.getByte(pos))
stmt.setByte(pos + 1, row.getByte(pos))
Copy link
Member

@dongjoon-hyun dongjoon-hyun Jul 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is orthogonal to MySqlServerDialect. Can we proceed this PR without this line?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, we can proceed this first as another PR with a proper regression test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun JDBCUtils.scala fix is required for completeness of the ByteType Fix. Without this fix Integrations test in "external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSqlServerIntegrationSuite.scala" will fail.

I dont think the "MsSqlServerIntegrationSuite.scala" integration tests are running in CI/CD. I was running these locally following a review suggestion. Is that's true, i can still remove this fix and submit this as part of another PR.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Jul 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, you need to follow the second option I gave. Please make another PR. We need to review that first before this PR.

Or, we can proceed this first as another PR with a proper regression test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun Can u please elaborate on proper regression test. I have test cases added/fixed for ByteType in JDBCSuite.scala and MsSqlServerIntegrationSuite.scala. Are you expecting some other test? Can u please point me to type of test.

Another way to cleanly split this PR would be as follows.

  1. Remove ByteType fix from this PR and jdbcutils.scala fix also get removed. This PR will only have fix for FloatType and Short.
  2. 2nd PR with complete ByteType fix with more elaborate integration test.

I prefer this for better seperation of concern in each of the PRs , while still retaining the completeness. Let me know what you think.

Thanks for your patience and very helpful guidance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is like your 2nd option.

2nd PR with complete ByteType fix with more elaborate integration test.

Here, the proper regression test means a test causing failure for the other Dialects. (e.g. PostgreSQL/MySQL/...) because JdbcUtils is shared across Dialects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun Would create 2 PRs then. PR1 ( this PR) i will remove ByteType and JDBCUtilis fix. Will create PR2 for ByteType and JDBCUtilis fix with full integration test. PR2 will take time as i have to investigate how to test for other dialects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Please try as you want. Then, ping me later when the PR is ready.

@dongjoon-hyun
Copy link
Member

@shivsood . JdbcUtils.scala change should be handled in a separate PR.

@SparkQA
Copy link

SparkQA commented Jul 9, 2019

Test build #107412 has finished for PR 24969 at commit f2700eb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jul 12, 2019

@shivsood do you want to update this PR accordingly?

@shivsood shivsood closed this Jul 12, 2019
@shivsood shivsood reopened this Jul 12, 2019
@shivsood shivsood changed the title [SPARK-28151][SQL] Fix MsSqlServerDialect Byte/Short/Float type mappings [SPARK-28151][SQL] Fix MsSqlServerDialect Byte/Short/Float type mappings ( DRAFT) Jul 12, 2019
@shivsood
Copy link
Contributor Author

@shivsood do you want to update this PR accordingly?
'll abandon this PR and start with a new/clean one. Have update the summary.

@shivsood shivsood closed this Jul 12, 2019
@srowen
Copy link
Member

srowen commented Jul 15, 2019

@shivsood did you open a new PR?
This one looked pretty fine overall minus moving out some JdbcUtils changes. I don't want to lose your improvement.

@dongjoon-hyun
Copy link
Member

Yes. We are working on #25146 .

@shivsood
Copy link
Contributor Author

@srowen Working on #25146 for FloatType and ShortType change. Will raise a PR for ByteType change this week ( It need more testing and i am finding my way thru adding E2E regression test)

@nkronenfeld
Copy link
Contributor

Was a PR ever submitted on the Byte/TinyInt problem? If not, does anyone know why? We are still getting burned by this.

@frgomes
Copy link

frgomes commented Jul 7, 2020

@dongjoon-hyun @shivsood:: I'm getting this exception below which makes me consider that maybe the fix was not applied??

com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or variable #7: Cannot find data type BYTE.

Was the fix for ByteType ever submitted/applied? To which branch, please?

Thanks

@srowen
Copy link
Member

srowen commented Jul 7, 2020

#25146 ? see above

@frgomes
Copy link

frgomes commented Jul 7, 2020

@srowen :: #25146 only covers ShortType (smallint) and FloatType (real). ByteType (tinyint) seems to be covered here and only here. However, I'm facing an issue which makes me believe that the fix somehow is not implemented in the runtime I'm using.

@frgomes
Copy link

frgomes commented Jul 7, 2020

@srowen : Feedback: As suspected, the issue was between the keyboard and the chair. Maybe I was not casting properly, or maybe a typo or maybe some equality test somewhere... whatever... when I cast like this .cast(pyspark.sql.types.ByteType()), it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants