Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23291][SQL][R][BRANCH-2.3] R's substr should not reduce starting position by 1 when calling Scala API #21250

Closed

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented May 6, 2018

What changes were proposed in this pull request?

This PR backports 24b5c69 and #21249

There's no conflict but I opened this just to run the test and for sure.

See the discussion in https://issues.apache.org/jira/browse/SPARK-23291

How was this patch tested?

Jenkins tests.

viirya and others added 2 commits May 6, 2018 16:59
…by 1 when calling Scala API

## What changes were proposed in this pull request?

Seems R's substr API treats Scala substr API as zero based and so subtracts the given starting position by 1.

Because Scala's substr API also accepts zero-based starting position (treated as the first element), so the current R's substr test results are correct as they all use 1 as starting positions.

## How was this patch tested?

Modified tests.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes apache#20464 from viirya/SPARK-23291.
@HyukjinKwon
Copy link
Member Author

cc @cloud-fan, @yanboliang, @felixcheung and @viirya.

@SparkQA
Copy link

SparkQA commented May 6, 2018

Test build #90272 has finished for PR 21250 at commit ffd4c7b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

docs/sparkr.md Outdated
@@ -663,3 +663,7 @@ You can inspect the search path in R with [`search()`](https://stat.ethz.ch/R-ma
- The `stringsAsFactors` parameter was previously ignored with `collect`, for example, in `collect(createDataFrame(iris), stringsAsFactors = TRUE))`. It has been corrected.
- For `summary`, option for statistics to compute has been added. Its output is changed from that from `describe`.
- A warning can be raised if versions of SparkR package and the Spark JVM do not match.

## Upgrading to Spark 2.3.1 and above
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark -> SparkR

docs/sparkr.md Outdated

## Upgrading to Spark 2.3.1 and above

- The `start` parameter of `substr` method was wrongly subtracted by one, previously. In other words, the index specified by `start` parameter was considered as 0-base. This can lead to inconsistent substring results and also does not match with the behaviour with `substr` in R. It has been fixed so the `start` parameter of `substr` method is now 1-base, e.g., therefore to get the same result as `substr(df$a, 2, 5)`, it should be changed to `substr(df$a, 1, 4)`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should mention the version more explicitly, e.g.

In SparkR 2.3.0 and earlier, the `start` parameter ... In version 2.3.1 and later, ... As an example, `substr(lit('abcdef'), 2, 5)` would result to `abc` in SparkR 2.3.0, and in SparkR 2.3.1, the result would be ...

@SparkQA
Copy link

SparkQA commented May 7, 2018

Test build #90310 has finished for PR 21250 at commit dd6c329.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

docs/sparkr.md Outdated

## Upgrading to SparkR 2.3.1 and above

- In SparkR 2.3.0 and earlier, the `start` parameter of `substr` method was wrongly subtracted by one, previously. In other words, the index specified by `start` parameter was considered as 0-base. This can lead to inconsistent substring results and also does not match with the behaviour with `substr` in R. In version 2.3.1 and later, it has been fixed so the `start` parameter of `substr` method is now 1-base. As an example, `substr(lit('abcdef'), 2, 4))` would result to `abc` in SparkR 2.3.0, and the result would be `bcd` in SparkR 2.3.1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make sure substr(lit('abcdef'), 2, 4)) is valid in Spark R, I didn't check it with Spark R document when writing it...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked it :)

Copy link
Member Author

@HyukjinKwon HyukjinKwon May 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like .. collect(select(createDataFrame(iris), substr(lit('abcdef'), 2, 4)))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checked:

master:

> collect(select(createDataFrame(iris), substr(lit('abcdef'), 2, 4)))
...
1                       bcd
...

2.3.0:

> collect(select(createDataFrame(iris), substr(lit('abcdef'), 2, 4)))
...
1                       abc
...

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

docs/sparkr.md Outdated

## Upgrading to SparkR 2.3.1 and above

- In SparkR 2.3.0 and earlier, the `start` parameter of `substr` method was wrongly subtracted by one, previously. In other words, the index specified by `start` parameter was considered as 0-base. This can lead to inconsistent substring results and also does not match with the behaviour with `substr` in R. In version 2.3.1 and later, it has been fixed so the `start` parameter of `substr` method is now 1-base. As an example, `substr(lit('abcdef'), 2, 4))` would result to `abc` in SparkR 2.3.0, and the result would be `bcd` in SparkR 2.3.1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a little simplification:

the `start` parameter of `substr` method was wrongly subtracted by one and considered as 0-based. This leads to ...

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented May 7, 2018

Test build #90325 has finished for PR 21250 at commit a7c8037.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. :)

Copy link
Contributor

@yanboliang yanboliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

asfgit pushed a commit that referenced this pull request May 7, 2018
…ng position by 1 when calling Scala API

## What changes were proposed in this pull request?

This PR backports 24b5c69 and #21249

There's no conflict but I opened this just to run the test and for sure.

See the discussion in https://issues.apache.org/jira/browse/SPARK-23291

## How was this patch tested?

Jenkins tests.

Author: hyukjinkwon <gurwls223@apache.org>
Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #21250 from HyukjinKwon/SPARK-23291-backport.
@HyukjinKwon HyukjinKwon closed this May 7, 2018
@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented May 8, 2018

Thanks, @cloud-fan, @yanboliang, @felixcheung, @dongjoon-hyun and @viirya.

@gatorsmile
Copy link
Member

Thank you for writing the migration guide

@HyukjinKwon HyukjinKwon deleted the SPARK-23291-backport branch October 16, 2018 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants