Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6812][SparkR] filter() on DataFrame does not work as expected. #5938

Closed
wants to merge 1 commit into from

Conversation

sun-rui
Copy link
Contributor

@sun-rui sun-rui commented May 6, 2015

According to the R manual: https://stat.ethz.ch/R-manual/R-devel/library/base/html/Startup.html,
" if a function .First is found on the search path, it is executed as .First(). Finally, function .First.sys() in the base package is run. This calls require to attach the default packages specified by options("defaultPackages")."
In .First() in profile/shell.R, we load SparkR package. This means SparkR package is loaded before default packages. If there are same names in default packages, they will overwrite those in SparkR. This is why filter() in SparkR is masked by filter() in stats, which is usually in the default package list.
We need to make sure SparkR is loaded after default packages. The solution is to append SparkR to default packages, instead of loading SparkR in .First().

BTW, I'd like to discuss our policy on how to solve name conflict. Previously, we rename API names from Scala API if there is name conflict with base or other commonly-used packages. However, from long term perspective, this is not good for API stability, because we can't predict name conflicts, for example, if in the future a name added in base package conflicts with an API in SparkR? So the better policy is to keep API name same as Scala's without worrying about name conflicts. When users use SparkR, they should load SparkR as last package, so that all API names are effective. Use can explicitly use :: to refer to hidden names from other packages. If we agree on this, I can submit a JIRA issue to change back some rename API methods, for example, DataFrame.sortDF().

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 6, 2015

Test build #31967 has started for PR 5938 at commit b569145.

@shivaram
Copy link
Contributor

shivaram commented May 6, 2015

For some reason Jenkins didn't post the test result. Lets try again
Jenkins, retest this please

@shivaram
Copy link
Contributor

shivaram commented May 6, 2015

Thanks @sun-rui -- Lets move the naming policy discussion to the JIRA 6812

@shivaram
Copy link
Contributor

shivaram commented May 6, 2015

@shaneknapp could you check why Jenkins doesn't retest this ? Am I not on the whitelist anymore ?

@pwendell
Copy link
Contributor

pwendell commented May 6, 2015

I think github API is just failing:

 > api_response: {"message": "No server is currently available to service your request. Sorry about that. Please try resubmitting your request and contact us if the problem persists."}

@shaneknapp
Copy link
Contributor

@shivaram -- you're an admin. looks like what patrick is saying is the cause.

jenkins, test this please

@shivaram
Copy link
Contributor

shivaram commented May 6, 2015

jenkins, retest this please

(third time lucky ?)

@shaneknapp
Copy link
Contributor

jenkins, test this please

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 6, 2015

Test build #32024 has started for PR 5938 at commit b569145.

@SparkQA
Copy link

SparkQA commented May 6, 2015

Test build #32024 has finished for PR 5938 at commit b569145.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32024/
Test FAILed.

@shivaram
Copy link
Contributor

shivaram commented May 6, 2015

Tests are failing because of a hive test issue. I'll wait till that get resolved before re-testing. I tried the change locally and it works fine. LGTM

@shivaram
Copy link
Contributor

shivaram commented May 6, 2015

Jenkins, retest this please

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 6, 2015

Test build #32035 has started for PR 5938 at commit b569145.

@SparkQA
Copy link

SparkQA commented May 6, 2015

Test build #32035 has finished for PR 5938 at commit b569145.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32035/
Test FAILed.

@shivaram
Copy link
Contributor

shivaram commented May 6, 2015

Sigh - random test failures.

Jenkins, retest this please

@shivaram
Copy link
Contributor

shivaram commented May 6, 2015

Jenkins, retest this please

cc @shaneknapp

@shaneknapp
Copy link
Contributor

sigh.

@shaneknapp
Copy link
Contributor

jenkins, test this please

@shaneknapp
Copy link
Contributor

(i think we're due a jenkins restart tomorrow morning)

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@shaneknapp
Copy link
Contributor

@shivaram -- i seriously have no clue as to what's going on... you're an overall GHPRB admin for amplab jenkins, as well as for the spark PRB (as am i). anyways, after the spate of errors over the last week (post-power-outage), i'll schedule an emergency restart for tomorrow morning.

@shaneknapp
Copy link
Contributor

@shivaram -- can you try and trigger another build? i've increased the log chattiness and hopefully i can see what's happening now.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32054/
Test FAILed.

@shivaram
Copy link
Contributor

shivaram commented May 7, 2015

Jenkins, retest this please

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 7, 2015

Test build #32061 has started for PR 5938 at commit b569145.

@shaneknapp
Copy link
Contributor

well, it works now... i didn't actually touch a thing! :\

@SparkQA
Copy link

SparkQA commented May 7, 2015

Test build #32061 has finished for PR 5938 at commit b569145.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32061/
Test PASSed.

@shivaram
Copy link
Contributor

shivaram commented May 7, 2015

Thanks @sun-rui - Merging this.

asfgit pushed a commit that referenced this pull request May 7, 2015
According to the R manual: https://stat.ethz.ch/R-manual/R-devel/library/base/html/Startup.html,
" if a function .First is found on the search path, it is executed as .First(). Finally, function .First.sys() in the base package is run. This calls require to attach the default packages specified by options("defaultPackages")."
In .First() in profile/shell.R, we load SparkR package. This means SparkR package is loaded before default packages. If there are same names in default packages, they will overwrite those in SparkR. This is why filter() in SparkR is masked by filter() in stats, which is usually in the default package list.
We need to make sure SparkR is loaded after default packages. The solution is to append SparkR to default packages, instead of loading SparkR in .First().

BTW, I'd like to discuss our policy on how to solve name conflict. Previously, we rename API names from Scala API if there is name conflict with base or other commonly-used packages. However, from long term perspective, this is not good for API stability, because we can't predict name conflicts, for example, if in the future a name added in base package conflicts with an API in SparkR? So the better policy is to keep API name same as Scala's without worrying about name conflicts. When users use SparkR, they should load SparkR as last package, so that all API names are effective. Use can explicitly use :: to refer to hidden names from other packages. If we agree on this, I can submit a JIRA issue to change back some rename API methods, for example, DataFrame.sortDF().

Author: Sun Rui <rui.sun@intel.com>

Closes #5938 from sun-rui/SPARK-6812 and squashes the following commits:

b569145 [Sun Rui] [SPARK-6812][SparkR] filter() on DataFrame does not work as expected.

(cherry picked from commit 9cfa9a5)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
@asfgit asfgit closed this in 9cfa9a5 May 7, 2015
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
According to the R manual: https://stat.ethz.ch/R-manual/R-devel/library/base/html/Startup.html,
" if a function .First is found on the search path, it is executed as .First(). Finally, function .First.sys() in the base package is run. This calls require to attach the default packages specified by options("defaultPackages")."
In .First() in profile/shell.R, we load SparkR package. This means SparkR package is loaded before default packages. If there are same names in default packages, they will overwrite those in SparkR. This is why filter() in SparkR is masked by filter() in stats, which is usually in the default package list.
We need to make sure SparkR is loaded after default packages. The solution is to append SparkR to default packages, instead of loading SparkR in .First().

BTW, I'd like to discuss our policy on how to solve name conflict. Previously, we rename API names from Scala API if there is name conflict with base or other commonly-used packages. However, from long term perspective, this is not good for API stability, because we can't predict name conflicts, for example, if in the future a name added in base package conflicts with an API in SparkR? So the better policy is to keep API name same as Scala's without worrying about name conflicts. When users use SparkR, they should load SparkR as last package, so that all API names are effective. Use can explicitly use :: to refer to hidden names from other packages. If we agree on this, I can submit a JIRA issue to change back some rename API methods, for example, DataFrame.sortDF().

Author: Sun Rui <rui.sun@intel.com>

Closes apache#5938 from sun-rui/SPARK-6812 and squashes the following commits:

b569145 [Sun Rui] [SPARK-6812][SparkR] filter() on DataFrame does not work as expected.
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
According to the R manual: https://stat.ethz.ch/R-manual/R-devel/library/base/html/Startup.html,
" if a function .First is found on the search path, it is executed as .First(). Finally, function .First.sys() in the base package is run. This calls require to attach the default packages specified by options("defaultPackages")."
In .First() in profile/shell.R, we load SparkR package. This means SparkR package is loaded before default packages. If there are same names in default packages, they will overwrite those in SparkR. This is why filter() in SparkR is masked by filter() in stats, which is usually in the default package list.
We need to make sure SparkR is loaded after default packages. The solution is to append SparkR to default packages, instead of loading SparkR in .First().

BTW, I'd like to discuss our policy on how to solve name conflict. Previously, we rename API names from Scala API if there is name conflict with base or other commonly-used packages. However, from long term perspective, this is not good for API stability, because we can't predict name conflicts, for example, if in the future a name added in base package conflicts with an API in SparkR? So the better policy is to keep API name same as Scala's without worrying about name conflicts. When users use SparkR, they should load SparkR as last package, so that all API names are effective. Use can explicitly use :: to refer to hidden names from other packages. If we agree on this, I can submit a JIRA issue to change back some rename API methods, for example, DataFrame.sortDF().

Author: Sun Rui <rui.sun@intel.com>

Closes apache#5938 from sun-rui/SPARK-6812 and squashes the following commits:

b569145 [Sun Rui] [SPARK-6812][SparkR] filter() on DataFrame does not work as expected.
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
According to the R manual: https://stat.ethz.ch/R-manual/R-devel/library/base/html/Startup.html,
" if a function .First is found on the search path, it is executed as .First(). Finally, function .First.sys() in the base package is run. This calls require to attach the default packages specified by options("defaultPackages")."
In .First() in profile/shell.R, we load SparkR package. This means SparkR package is loaded before default packages. If there are same names in default packages, they will overwrite those in SparkR. This is why filter() in SparkR is masked by filter() in stats, which is usually in the default package list.
We need to make sure SparkR is loaded after default packages. The solution is to append SparkR to default packages, instead of loading SparkR in .First().

BTW, I'd like to discuss our policy on how to solve name conflict. Previously, we rename API names from Scala API if there is name conflict with base or other commonly-used packages. However, from long term perspective, this is not good for API stability, because we can't predict name conflicts, for example, if in the future a name added in base package conflicts with an API in SparkR? So the better policy is to keep API name same as Scala's without worrying about name conflicts. When users use SparkR, they should load SparkR as last package, so that all API names are effective. Use can explicitly use :: to refer to hidden names from other packages. If we agree on this, I can submit a JIRA issue to change back some rename API methods, for example, DataFrame.sortDF().

Author: Sun Rui <rui.sun@intel.com>

Closes apache#5938 from sun-rui/SPARK-6812 and squashes the following commits:

b569145 [Sun Rui] [SPARK-6812][SparkR] filter() on DataFrame does not work as expected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants