-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-54205][CONNECT] Supports Decimal type data in SparkConnectResultSet #52947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-54205][CONNECT] Supports Decimal type data in SparkConnectResultSet #52947
Conversation
|
@pan3793 I see you have been working on this story, perhaps it's better that you review it |
| case DoubleType => 24 | ||
| case StringType => | ||
| getPrecision(field) | ||
| case DecimalType.Fixed(p, s) => p + (if (s == 0) 0 else 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should consider more cases, for example,
// scale = precision => only fractional digits. +3 for decimal point, sign and leading zero
case DecimalType.Fixed(p, s) if s == p => p + 3
// scale = 0 => only integral part. +1 for sign
case DecimalType.Fixed(p, s) if s == 0 => p + 1
// scale > 0 => both integral and fractional part. +2 for sign and decimal point
case DecimalType.Fixed(p, s) => p + 2There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's a good point, I totally forgot the sign and the leading zero scenarios. I will probably also have some test cases to cover this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the logic for the edge cases, and added a few test cases to cover them.
...client/jdbc/src/main/scala/org/apache/spark/sql/connect/client/jdbc/util/JdbcTypeUtils.scala
Outdated
Show resolved
Hide resolved
|
@cty123 thank you for working on this new feature. |
…nnect/client/jdbc/util/JdbcTypeUtils.scala Co-authored-by: Cheng Pan <pan3793@gmail.com>
| case StringType => | ||
| getPrecision(field) | ||
| case DecimalType.Fixed(p, s) => p + (if (s == 0) 0 else 1) | ||
| // precision + sign(+/-) + leading zero + decimal point, like DECIMAL(5,5) = -0.12345 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sign - display, but + does not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the comment to make it more clear that the bit is for negative sign.
pan3793
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, cc @LuciferYang, could you take a look?
| override def getBigDecimal(columnIndex: Int): java.math.BigDecimal = | ||
| throw new SQLFeatureNotSupportedException | ||
| override def getBigDecimal(columnIndex: Int): java.math.BigDecimal = { | ||
| if (currentRow.isNullAt(columnIndex - 1)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Should we first check
checkOpen()? - Should we ensure that
columnIndexis not out of bounds?
also cc @pan3793
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense, but I think we can defer this to an independent patch since other get* methods have the same issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I am experimenting this with other JDBC driver implementations.
-
Currently if it's out of the bound, it would throw out of bound exception,
[info] - get decimal type *** FAILED *** (78 milliseconds)
[info] java.lang.ArrayIndexOutOfBoundsException: Index 998 out of bounds for length 1
[info] at org.apache.spark.sql.catalyst.expressions.GenericRow.get(rows.scala:37)
[info] at org.apache.spark.sql.Row.isNullAt(Row.scala:216)
[info] at org.apache.spark.sql.Row.isNullAt$(Row.scala:216)
[info] at org.apache.spark.sql.catalyst.expressions.GenericRow.isNullAt(rows.scala:28)
Do we still need to check it? Or should we capture it and wrap it under another exception?
And yes, I think we can have a separate PR to address the 2 issues. And for 2, if we check the bound ourselves, I would do it inside isNullAt function, instead of calling another checker function inside each getter function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handling this issue uniformly in a separate pr is fine for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tested, you are right.
-
For every getter function, if the statement is closed, the ResultSet should be unusable. I have verified this with MySQL driver and Postgresql driver.
-
Right now when index goes out of bound, it throws
java.lang.ArrayIndexOutOfBoundsException, but based on the specification onjava.sql.ResultSetwhich is implemented bySparkConnectResultSetclass, it should throwjava.sql.SQLException
* @throws SQLException if the columnIndex is not valid;
Maybe we can create a separate jira to fix all the getter functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cty123 thanks for investigating it, I created SPARK-53484 for this, please go ahead to create a PR to fix this if you'd like to.
…ltSet ### What changes were proposed in this pull request? Spark connect has supported JDBC protocol with a few commonly used SQL data types. But currently it's missing the support for Decimal data which is also very commonly used to store money objects. I would like to have it support Decimal data type. ### Why are the changes needed? Right now, a user is able to read Decimal data from SQL by converting the data to string, and then parse the string into Java BigDecimal object. But since JDBC driver is already able to fetch the data as Java BigDecimal type, we can save the effort converting it back and forth. Instead, we just pass through the data we obtain from the raw JDBC result set. ### Does this PR introduce _any_ user-facing change? It's part of a new feature under Spark connect JDBC support. ### How was this patch tested? I have created a test new unit test named **'get decimal type'** and it covers my changes. Also the test case aligns with the tests for fetching other data types. ### Was this patch authored or co-authored using generative AI tooling? No Closes #52947 from cty123/cty123/support-spark-connect-decimaltype. Lead-authored-by: cty123 <ctychen2216@gmail.com> Co-authored-by: cty <ctychen2216@gmail.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit 73b50e9) Signed-off-by: yangjie01 <yangjie01@baidu.com>
|
@cty123 Could you please provide your Jira account name so that I can assign this ticket to you? |
|
Hey, I don't have a jira account, maybe you can assign to @pan3793? |
|
@cty123, the spark community uses this to credit your contribution. you can apply a JIRA account, @LuciferYang should have permission for approval |
|
Thank you, @cty123 and all! |
… checkOpen and check index boundary ### What changes were proposed in this pull request? This PR aims to do a minor correction on the current get* functions from `SparkConnectResultSet` class. As previously discussed in the PR #52947, >1. For every getter function, if the statement is closed, the ResultSet should be unusable. I have verified this with MySQL driver and Postgresql driver. > >2. Right now when index goes out of bound, it throws `java.lang.ArrayIndexOutOfBoundsException`, but based on the specification on `java.sql.ResultSet` which is implemented by `SparkConnectResultSet` class, it should throw `java.sql.SQLException` > >``` > * throws SQLException if the columnIndex is not valid; >``` This PR proposes a unified wrapper function called `getColumnValue(columnIndex: Int)` that aims to wrap the `checkOpen` as well as the index boundary check. ### Why are the changes needed? Currently the get* functions don't follow the expected behaviors of `java.sql.ResultSet`. It's technically not a big problem, but since the `SparkConnectResultSet` aims to implement the `java.sql.ResultSet` class, it should strictly follow the specification documented on the interface definition. ### Does this PR introduce _any_ user-facing change? This PR is a small fix related to a new feature introduced recently. ### How was this patch tested? I added 2 tests each covering a bullet point I named above. These 2 functions calls all the get* functions inside `SparkConnectResultSet` class to make sure the correct exception(java.sql.SQLException) is thrown. ### Was this patch authored or co-authored using generative AI tooling? No Closes #52988 from cty123/cty123/address-spark-connect-getters. Lead-authored-by: cty123 <ctychen2216@gmail.com> Co-authored-by: cty <ctychen2216@gmail.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>
… checkOpen and check index boundary ### What changes were proposed in this pull request? This PR aims to do a minor correction on the current get* functions from `SparkConnectResultSet` class. As previously discussed in the PR #52947, >1. For every getter function, if the statement is closed, the ResultSet should be unusable. I have verified this with MySQL driver and Postgresql driver. > >2. Right now when index goes out of bound, it throws `java.lang.ArrayIndexOutOfBoundsException`, but based on the specification on `java.sql.ResultSet` which is implemented by `SparkConnectResultSet` class, it should throw `java.sql.SQLException` > >``` > * throws SQLException if the columnIndex is not valid; >``` This PR proposes a unified wrapper function called `getColumnValue(columnIndex: Int)` that aims to wrap the `checkOpen` as well as the index boundary check. ### Why are the changes needed? Currently the get* functions don't follow the expected behaviors of `java.sql.ResultSet`. It's technically not a big problem, but since the `SparkConnectResultSet` aims to implement the `java.sql.ResultSet` class, it should strictly follow the specification documented on the interface definition. ### Does this PR introduce _any_ user-facing change? This PR is a small fix related to a new feature introduced recently. ### How was this patch tested? I added 2 tests each covering a bullet point I named above. These 2 functions calls all the get* functions inside `SparkConnectResultSet` class to make sure the correct exception(java.sql.SQLException) is thrown. ### Was this patch authored or co-authored using generative AI tooling? No Closes #52988 from cty123/cty123/address-spark-connect-getters. Lead-authored-by: cty123 <ctychen2216@gmail.com> Co-authored-by: cty <ctychen2216@gmail.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit c21d5a4) Signed-off-by: yangjie01 <yangjie01@baidu.com>
…ltSet ### What changes were proposed in this pull request? Spark connect has supported JDBC protocol with a few commonly used SQL data types. But currently it's missing the support for Decimal data which is also very commonly used to store money objects. I would like to have it support Decimal data type. ### Why are the changes needed? Right now, a user is able to read Decimal data from SQL by converting the data to string, and then parse the string into Java BigDecimal object. But since JDBC driver is already able to fetch the data as Java BigDecimal type, we can save the effort converting it back and forth. Instead, we just pass through the data we obtain from the raw JDBC result set. ### Does this PR introduce _any_ user-facing change? It's part of a new feature under Spark connect JDBC support. ### How was this patch tested? I have created a test new unit test named **'get decimal type'** and it covers my changes. Also the test case aligns with the tests for fetching other data types. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#52947 from cty123/cty123/support-spark-connect-decimaltype. Lead-authored-by: cty123 <ctychen2216@gmail.com> Co-authored-by: cty <ctychen2216@gmail.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>
… checkOpen and check index boundary ### What changes were proposed in this pull request? This PR aims to do a minor correction on the current get* functions from `SparkConnectResultSet` class. As previously discussed in the PR apache#52947, >1. For every getter function, if the statement is closed, the ResultSet should be unusable. I have verified this with MySQL driver and Postgresql driver. > >2. Right now when index goes out of bound, it throws `java.lang.ArrayIndexOutOfBoundsException`, but based on the specification on `java.sql.ResultSet` which is implemented by `SparkConnectResultSet` class, it should throw `java.sql.SQLException` > >``` > * throws SQLException if the columnIndex is not valid; >``` This PR proposes a unified wrapper function called `getColumnValue(columnIndex: Int)` that aims to wrap the `checkOpen` as well as the index boundary check. ### Why are the changes needed? Currently the get* functions don't follow the expected behaviors of `java.sql.ResultSet`. It's technically not a big problem, but since the `SparkConnectResultSet` aims to implement the `java.sql.ResultSet` class, it should strictly follow the specification documented on the interface definition. ### Does this PR introduce _any_ user-facing change? This PR is a small fix related to a new feature introduced recently. ### How was this patch tested? I added 2 tests each covering a bullet point I named above. These 2 functions calls all the get* functions inside `SparkConnectResultSet` class to make sure the correct exception(java.sql.SQLException) is thrown. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#52988 from cty123/cty123/address-spark-connect-getters. Lead-authored-by: cty123 <ctychen2216@gmail.com> Co-authored-by: cty <ctychen2216@gmail.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>
What changes were proposed in this pull request?
Spark connect has supported JDBC protocol with a few commonly used SQL data types. But currently it's missing the support for Decimal data which is also very commonly used to store money objects. I would like to have it support Decimal data type.
Why are the changes needed?
Right now, a user is able to read Decimal data from SQL by converting the data to string, and then parse the string into Java BigDecimal object. But since JDBC driver is already able to fetch the data as Java BigDecimal type, we can save the effort converting it back and forth. Instead, we just pass through the data we obtain from the raw JDBC result set.
Does this PR introduce any user-facing change?
It's part of a new feature under Spark connect JDBC support.
How was this patch tested?
I have created a test new unit test named 'get decimal type' and it covers my changes. Also the test case aligns with the tests for fetching other data types.
Was this patch authored or co-authored using generative AI tooling?
No