[SPARK-23291][SQL][R][BRANCH-2.3] R's substr should not reduce starting position by 1 when calling Scala API #21250

HyukjinKwon · 2018-05-06T09:03:02Z

What changes were proposed in this pull request?

This PR backports 24b5c69 and #21249

There's no conflict but I opened this just to run the test and for sure.

See the discussion in https://issues.apache.org/jira/browse/SPARK-23291

How was this patch tested?

Jenkins tests.

…by 1 when calling Scala API ## What changes were proposed in this pull request? Seems R's substr API treats Scala substr API as zero based and so subtracts the given starting position by 1. Because Scala's substr API also accepts zero-based starting position (treated as the first element), so the current R's substr test results are correct as they all use 1 as starting positions. ## How was this patch tested? Modified tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#20464 from viirya/SPARK-23291.

HyukjinKwon · 2018-05-06T09:03:18Z

cc @cloud-fan, @yanboliang, @felixcheung and @viirya.

SparkQA · 2018-05-06T09:48:42Z

Test build #90272 has finished for PR 21250 at commit ffd4c7b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-05-07T07:33:52Z

docs/sparkr.md

@@ -663,3 +663,7 @@ You can inspect the search path in R with [`search()`](https://stat.ethz.ch/R-ma
 - The `stringsAsFactors` parameter was previously ignored with `collect`, for example, in `collect(createDataFrame(iris), stringsAsFactors = TRUE))`. It has been corrected.
 - For `summary`, option for statistics to compute has been added. Its output is changed from that from `describe`.
 - A warning can be raised if versions of SparkR package and the Spark JVM do not match.
+
+## Upgrading to Spark 2.3.1 and above


Spark -> SparkR

cloud-fan · 2018-05-07T07:38:27Z

docs/sparkr.md

+
+## Upgrading to Spark 2.3.1 and above
+
+ - The `start` parameter of `substr` method was wrongly subtracted by one, previously. In other words, the index specified by `start` parameter was considered as 0-base. This can lead to inconsistent substring results and also does not match with the behaviour with `substr` in R. It has been fixed so the `start` parameter of `substr` method is now 1-base, e.g., therefore to get the same result as `substr(df$a, 2, 5)`, it should be changed to `substr(df$a, 1, 4)`.


we should mention the version more explicitly, e.g.

In SparkR 2.3.0 and earlier, the `start` parameter ... In version 2.3.1 and later, ... As an example, `substr(lit('abcdef'), 2, 5)` would result to `abc` in SparkR 2.3.0, and in SparkR 2.3.1, the result would be ...

SparkQA · 2018-05-07T09:19:30Z

Test build #90310 has finished for PR 21250 at commit dd6c329.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-05-07T12:28:02Z

docs/sparkr.md

+
+## Upgrading to SparkR 2.3.1 and above
+
+ - In SparkR 2.3.0 and earlier, the `start` parameter of `substr` method was wrongly subtracted by one, previously. In other words, the index specified by `start` parameter was considered as 0-base. This can lead to inconsistent substring results and also does not match with the behaviour with `substr` in R. In version 2.3.1 and later, it has been fixed so the `start` parameter of `substr` method is now 1-base. As an example, `substr(lit('abcdef'), 2, 4))` would result to `abc` in SparkR 2.3.0, and the result would be `bcd` in SparkR 2.3.1.


please make sure substr(lit('abcdef'), 2, 4)) is valid in Spark R, I didn't check it with Spark R document when writing it...

I checked it :)

like .. collect(select(createDataFrame(iris), substr(lit('abcdef'), 2, 4)))

Just double checked:

master:

> collect(select(createDataFrame(iris), substr(lit('abcdef'), 2, 4))) ... 1 bcd ...

2.3.0:

> collect(select(createDataFrame(iris), substr(lit('abcdef'), 2, 4))) ... 1 abc ...

viirya

LGTM

cloud-fan · 2018-05-07T14:21:35Z

docs/sparkr.md

+
+## Upgrading to SparkR 2.3.1 and above
+
+ - In SparkR 2.3.0 and earlier, the `start` parameter of `substr` method was wrongly subtracted by one, previously. In other words, the index specified by `start` parameter was considered as 0-base. This can lead to inconsistent substring results and also does not match with the behaviour with `substr` in R. In version 2.3.1 and later, it has been fixed so the `start` parameter of `substr` method is now 1-base. As an example, `substr(lit('abcdef'), 2, 4))` would result to `abc` in SparkR 2.3.0, and the result would be `bcd` in SparkR 2.3.1.


a little simplification:

the `start` parameter of `substr` method was wrongly subtracted by one and considered as 0-based. This leads to ...

cloud-fan · 2018-05-07T14:22:01Z

LGTM

SparkQA · 2018-05-07T16:15:25Z

Test build #90325 has finished for PR 21250 at commit a7c8037.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. :)

yanboliang

LGTM

…ng position by 1 when calling Scala API ## What changes were proposed in this pull request? This PR backports 24b5c69 and #21249 There's no conflict but I opened this just to run the test and for sure. See the discussion in https://issues.apache.org/jira/browse/SPARK-23291 ## How was this patch tested? Jenkins tests. Author: hyukjinkwon <gurwls223@apache.org> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #21250 from HyukjinKwon/SPARK-23291-backport.

HyukjinKwon · 2018-05-08T00:02:10Z

Thanks, @cloud-fan, @yanboliang, @felixcheung, @dongjoon-hyun and @viirya.

gatorsmile · 2018-05-08T22:11:10Z

Thank you for writing the migration guide

viirya and others added 2 commits May 6, 2018 16:59

Update SparkR migration note for SPARK-23291

ffd4c7b

cloud-fan reviewed May 7, 2018

View reviewed changes

Address comments

dd6c329

cloud-fan reviewed May 7, 2018

View reviewed changes

viirya approved these changes May 7, 2018

View reviewed changes

cloud-fan reviewed May 7, 2018

View reviewed changes

Address a comment

a7c8037

dongjoon-hyun approved these changes May 7, 2018

View reviewed changes

yanboliang approved these changes May 7, 2018

View reviewed changes

HyukjinKwon closed this May 7, 2018

HyukjinKwon deleted the SPARK-23291-backport branch October 16, 2018 12:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23291][SQL][R][BRANCH-2.3] R's substr should not reduce starting position by 1 when calling Scala API #21250

[SPARK-23291][SQL][R][BRANCH-2.3] R's substr should not reduce starting position by 1 when calling Scala API #21250

HyukjinKwon commented May 6, 2018 •

edited

Loading

HyukjinKwon commented May 6, 2018

SparkQA commented May 6, 2018

cloud-fan May 7, 2018

cloud-fan May 7, 2018

SparkQA commented May 7, 2018

cloud-fan May 7, 2018

HyukjinKwon May 7, 2018

HyukjinKwon May 7, 2018 •

edited

Loading

HyukjinKwon May 7, 2018

viirya left a comment

cloud-fan May 7, 2018

cloud-fan commented May 7, 2018

SparkQA commented May 7, 2018

dongjoon-hyun left a comment

yanboliang left a comment

HyukjinKwon commented May 8, 2018 •

edited

Loading

gatorsmile commented May 8, 2018


		## Upgrading to Spark 2.3.1 and above

		- The `start` parameter of `substr` method was wrongly subtracted by one, previously. In other words, the index specified by `start` parameter was considered as 0-base. This can lead to inconsistent substring results and also does not match with the behaviour with `substr` in R. It has been fixed so the `start` parameter of `substr` method is now 1-base, e.g., therefore to get the same result as `substr(df$a, 2, 5)`, it should be changed to `substr(df$a, 1, 4)`.


		## Upgrading to SparkR 2.3.1 and above

		- In SparkR 2.3.0 and earlier, the `start` parameter of `substr` method was wrongly subtracted by one, previously. In other words, the index specified by `start` parameter was considered as 0-base. This can lead to inconsistent substring results and also does not match with the behaviour with `substr` in R. In version 2.3.1 and later, it has been fixed so the `start` parameter of `substr` method is now 1-base. As an example, `substr(lit('abcdef'), 2, 4))` would result to `abc` in SparkR 2.3.0, and the result would be `bcd` in SparkR 2.3.1.

[SPARK-23291][SQL][R][BRANCH-2.3] R's substr should not reduce starting position by 1 when calling Scala API #21250

[SPARK-23291][SQL][R][BRANCH-2.3] R's substr should not reduce starting position by 1 when calling Scala API #21250

Conversation

HyukjinKwon commented May 6, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented May 6, 2018

SparkQA commented May 6, 2018

cloud-fan May 7, 2018

Choose a reason for hiding this comment

cloud-fan May 7, 2018

Choose a reason for hiding this comment

SparkQA commented May 7, 2018

cloud-fan May 7, 2018

Choose a reason for hiding this comment

HyukjinKwon May 7, 2018

Choose a reason for hiding this comment

HyukjinKwon May 7, 2018 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon May 7, 2018

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

cloud-fan May 7, 2018

Choose a reason for hiding this comment

cloud-fan commented May 7, 2018

SparkQA commented May 7, 2018

dongjoon-hyun left a comment

Choose a reason for hiding this comment

yanboliang left a comment

Choose a reason for hiding this comment

HyukjinKwon commented May 8, 2018 • edited Loading

gatorsmile commented May 8, 2018

HyukjinKwon commented May 6, 2018 •

edited

Loading

HyukjinKwon May 7, 2018 •

edited

Loading

HyukjinKwon commented May 8, 2018 •

edited

Loading