-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23291][SQL][R][BRANCH-2.3] R's substr should not reduce starting position by 1 when calling Scala API #21250
Conversation
…by 1 when calling Scala API ## What changes were proposed in this pull request? Seems R's substr API treats Scala substr API as zero based and so subtracts the given starting position by 1. Because Scala's substr API also accepts zero-based starting position (treated as the first element), so the current R's substr test results are correct as they all use 1 as starting positions. ## How was this patch tested? Modified tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#20464 from viirya/SPARK-23291.
cc @cloud-fan, @yanboliang, @felixcheung and @viirya. |
Test build #90272 has finished for PR 21250 at commit
|
docs/sparkr.md
Outdated
@@ -663,3 +663,7 @@ You can inspect the search path in R with [`search()`](https://stat.ethz.ch/R-ma | |||
- The `stringsAsFactors` parameter was previously ignored with `collect`, for example, in `collect(createDataFrame(iris), stringsAsFactors = TRUE))`. It has been corrected. | |||
- For `summary`, option for statistics to compute has been added. Its output is changed from that from `describe`. | |||
- A warning can be raised if versions of SparkR package and the Spark JVM do not match. | |||
|
|||
## Upgrading to Spark 2.3.1 and above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark
-> SparkR
docs/sparkr.md
Outdated
|
||
## Upgrading to Spark 2.3.1 and above | ||
|
||
- The `start` parameter of `substr` method was wrongly subtracted by one, previously. In other words, the index specified by `start` parameter was considered as 0-base. This can lead to inconsistent substring results and also does not match with the behaviour with `substr` in R. It has been fixed so the `start` parameter of `substr` method is now 1-base, e.g., therefore to get the same result as `substr(df$a, 2, 5)`, it should be changed to `substr(df$a, 1, 4)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should mention the version more explicitly, e.g.
In SparkR 2.3.0 and earlier, the `start` parameter ... In version 2.3.1 and later, ... As an example, `substr(lit('abcdef'), 2, 5)` would result to `abc` in SparkR 2.3.0, and in SparkR 2.3.1, the result would be ...
Test build #90310 has finished for PR 21250 at commit
|
docs/sparkr.md
Outdated
|
||
## Upgrading to SparkR 2.3.1 and above | ||
|
||
- In SparkR 2.3.0 and earlier, the `start` parameter of `substr` method was wrongly subtracted by one, previously. In other words, the index specified by `start` parameter was considered as 0-base. This can lead to inconsistent substring results and also does not match with the behaviour with `substr` in R. In version 2.3.1 and later, it has been fixed so the `start` parameter of `substr` method is now 1-base. As an example, `substr(lit('abcdef'), 2, 4))` would result to `abc` in SparkR 2.3.0, and the result would be `bcd` in SparkR 2.3.1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please make sure substr(lit('abcdef'), 2, 4))
is valid in Spark R, I didn't check it with Spark R document when writing it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like .. collect(select(createDataFrame(iris), substr(lit('abcdef'), 2, 4)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just double checked:
master:
> collect(select(createDataFrame(iris), substr(lit('abcdef'), 2, 4)))
...
1 bcd
...
2.3.0:
> collect(select(createDataFrame(iris), substr(lit('abcdef'), 2, 4)))
...
1 abc
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
docs/sparkr.md
Outdated
|
||
## Upgrading to SparkR 2.3.1 and above | ||
|
||
- In SparkR 2.3.0 and earlier, the `start` parameter of `substr` method was wrongly subtracted by one, previously. In other words, the index specified by `start` parameter was considered as 0-base. This can lead to inconsistent substring results and also does not match with the behaviour with `substr` in R. In version 2.3.1 and later, it has been fixed so the `start` parameter of `substr` method is now 1-base. As an example, `substr(lit('abcdef'), 2, 4))` would result to `abc` in SparkR 2.3.0, and the result would be `bcd` in SparkR 2.3.1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a little simplification:
the `start` parameter of `substr` method was wrongly subtracted by one and considered as 0-based. This leads to ...
LGTM |
Test build #90325 has finished for PR 21250 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ng position by 1 when calling Scala API ## What changes were proposed in this pull request? This PR backports 24b5c69 and #21249 There's no conflict but I opened this just to run the test and for sure. See the discussion in https://issues.apache.org/jira/browse/SPARK-23291 ## How was this patch tested? Jenkins tests. Author: hyukjinkwon <gurwls223@apache.org> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #21250 from HyukjinKwon/SPARK-23291-backport.
Thanks, @cloud-fan, @yanboliang, @felixcheung, @dongjoon-hyun and @viirya. |
Thank you for writing the migration guide |
What changes were proposed in this pull request?
This PR backports 24b5c69 and #21249
There's no conflict but I opened this just to run the test and for sure.
See the discussion in https://issues.apache.org/jira/browse/SPARK-23291
How was this patch tested?
Jenkins tests.