Skip to content

Commit

Permalink
Merge branch 'master' of https://github.com/apache/spark into SPARK-1…
Browse files Browse the repository at this point in the history
…8408-yunn-api-improvements
  • Loading branch information
Yunni committed Nov 28, 2016
2 parents f0ebcb7 + 9c03c56 commit e198080
Show file tree
Hide file tree
Showing 351 changed files with 5,690 additions and 3,080 deletions.
2 changes: 1 addition & 1 deletion .github/PULL_REQUEST_TEMPLATE
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.
Please review http://spark.apache.org/contributing.html before opening a pull request.
4 changes: 2 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
## Contributing to Spark

*Before opening a pull request*, review the
[Contributing to Spark wiki](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark).
[Contributing to Spark guide](http://spark.apache.org/contributing.html).
It lists steps that are required before creating a PR. In particular, consider:

- Is the change important and ready enough to ask the community to spend time reviewing?
- Have you searched for existing, related JIRAs and pull requests?
- Is this a new feature that can stand alone as a [third party project](https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects) ?
- Is this a new feature that can stand alone as a [third party project](http://spark.apache.org/third-party-projects.html) ?
- Is the change being proposed clearly explained and motivated?

When you contribute code, you affirm that the contribution is your original work and that you
Expand Down
2 changes: 1 addition & 1 deletion R/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ sparkR.session()

#### Making changes to SparkR

The [instructions](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark) for making contributions to Spark also apply to SparkR.
The [instructions](http://spark.apache.org/contributing.html) for making contributions to Spark also apply to SparkR.
If you only make R file changes (i.e. no Scala changes) then you can just re-install the R package using `R/install-dev.sh` and test your changes.
Once you have made your changes, please include unit tests for them and run existing unit tests using the `R/run-tests.sh` script as described below.

Expand Down
2 changes: 1 addition & 1 deletion R/pkg/DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Authors@R: c(person("Shivaram", "Venkataraman", role = c("aut", "cre"),
email = "felixcheung@apache.org"),
person(family = "The Apache Software Foundation", role = c("aut", "cph")))
URL: http://www.apache.org/ http://spark.apache.org/
BugReports: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-ContributingBugReports
BugReports: http://spark.apache.org/contributing.html
Depends:
R (>= 3.0),
methods
Expand Down
6 changes: 4 additions & 2 deletions R/pkg/R/DataFrame.R
Original file line number Diff line number Diff line change
Expand Up @@ -2541,7 +2541,8 @@ generateAliasesForIntersectedCols <- function (x, intersectedColNames, suffix) {
#'
#' Return a new SparkDataFrame containing the union of rows in this SparkDataFrame
#' and another SparkDataFrame. This is equivalent to \code{UNION ALL} in SQL.
#' Note that this does not remove duplicate rows across the two SparkDataFrames.
#'
#' Note: This does not remove duplicate rows across the two SparkDataFrames.
#'
#' @param x A SparkDataFrame
#' @param y A SparkDataFrame
Expand Down Expand Up @@ -2584,7 +2585,8 @@ setMethod("unionAll",
#' Union two or more SparkDataFrames
#'
#' Union two or more SparkDataFrames. This is equivalent to \code{UNION ALL} in SQL.
#' Note that this does not remove duplicate rows across the two SparkDataFrames.
#'
#' Note: This does not remove duplicate rows across the two SparkDataFrames.
#'
#' @param x a SparkDataFrame.
#' @param ... additional SparkDataFrame(s).
Expand Down
7 changes: 4 additions & 3 deletions R/pkg/R/functions.R
Original file line number Diff line number Diff line change
Expand Up @@ -2296,7 +2296,7 @@ setMethod("n", signature(x = "Column"),
#' A pattern could be for instance \preformatted{dd.MM.yyyy} and could return a string like '18.03.1993'. All
#' pattern letters of \code{java.text.SimpleDateFormat} can be used.
#'
#' NOTE: Use when ever possible specialized functions like \code{year}. These benefit from a
#' Note: Use when ever possible specialized functions like \code{year}. These benefit from a
#' specialized implementation.
#'
#' @param y Column to compute on.
Expand Down Expand Up @@ -2341,7 +2341,7 @@ setMethod("from_utc_timestamp", signature(y = "Column", x = "character"),
#' Locate the position of the first occurrence of substr column in the given string.
#' Returns null if either of the arguments are null.
#'
#' NOTE: The position is not zero based, but 1 based index. Returns 0 if substr
#' Note: The position is not zero based, but 1 based index. Returns 0 if substr
#' could not be found in str.
#'
#' @param y column to check
Expand Down Expand Up @@ -2779,7 +2779,8 @@ setMethod("window", signature(x = "Column"),
#' locate
#'
#' Locate the position of the first occurrence of substr.
#' NOTE: The position is not zero based, but 1 based index. Returns 0 if substr
#'
#' Note: The position is not zero based, but 1 based index. Returns 0 if substr
#' could not be found in str.
#'
#' @param substr a character string to be matched.
Expand Down
21 changes: 16 additions & 5 deletions R/pkg/R/mllib.R
Original file line number Diff line number Diff line change
Expand Up @@ -278,8 +278,10 @@ setMethod("glm", signature(formula = "formula", family = "ANY", data = "SparkDat

#' @param object a fitted generalized linear model.
#' @return \code{summary} returns a summary object of the fitted model, a list of components
#' including at least the coefficients, null/residual deviance, null/residual degrees
#' of freedom, AIC and number of iterations IRLS takes.
#' including at least the coefficients matrix (which includes coefficients, standard error
#' of coefficients, t value and p value), null/residual deviance, null/residual degrees of
#' freedom, AIC and number of iterations IRLS takes. If there are collinear columns
#' in you data, the coefficients matrix only provides coefficients.
#'
#' @rdname spark.glm
#' @export
Expand All @@ -303,9 +305,18 @@ setMethod("summary", signature(object = "GeneralizedLinearRegressionModel"),
} else {
dataFrame(callJMethod(jobj, "rDevianceResiduals"))
}
coefficients <- matrix(coefficients, ncol = 4)
colnames(coefficients) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)")
rownames(coefficients) <- unlist(features)
# If the underlying WeightedLeastSquares using "normal" solver, we can provide
# coefficients, standard error of coefficients, t value and p value. Otherwise,
# it will be fitted by local "l-bfgs", we can only provide coefficients.
if (length(features) == length(coefficients)) {
coefficients <- matrix(coefficients, ncol = 1)
colnames(coefficients) <- c("Estimate")
rownames(coefficients) <- unlist(features)
} else {
coefficients <- matrix(coefficients, ncol = 4)
colnames(coefficients) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)")
rownames(coefficients) <- unlist(features)
}
ans <- list(deviance.resid = deviance.resid, coefficients = coefficients,
dispersion = dispersion, null.deviance = null.deviance,
deviance = deviance, df.null = df.null, df.residual = df.residual,
Expand Down
20 changes: 14 additions & 6 deletions R/pkg/R/sparkR.R
Original file line number Diff line number Diff line change
Expand Up @@ -373,8 +373,13 @@ sparkR.session <- function(
overrideEnvs(sparkConfigMap, paramMap)
}

deployMode <- ""
if (exists("spark.submit.deployMode", envir = sparkConfigMap)) {
deployMode <- sparkConfigMap[["spark.submit.deployMode"]]
}

if (!exists(".sparkRjsc", envir = .sparkREnv)) {
retHome <- sparkCheckInstall(sparkHome, master)
retHome <- sparkCheckInstall(sparkHome, master, deployMode)
if (!is.null(retHome)) sparkHome <- retHome
sparkExecutorEnvMap <- new.env()
sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, sparkExecutorEnvMap,
Expand Down Expand Up @@ -550,24 +555,27 @@ processSparkPackages <- function(packages) {
#
# @param sparkHome directory to find Spark package.
# @param master the Spark master URL, used to check local or remote mode.
# @param deployMode whether to deploy your driver on the worker nodes (cluster)
# or locally as an external client (client).
# @return NULL if no need to update sparkHome, and new sparkHome otherwise.
sparkCheckInstall <- function(sparkHome, master) {
sparkCheckInstall <- function(sparkHome, master, deployMode) {
if (!isSparkRShell()) {
if (!is.na(file.info(sparkHome)$isdir)) {
msg <- paste0("Spark package found in SPARK_HOME: ", sparkHome)
message(msg)
NULL
} else {
if (!nzchar(master) || isMasterLocal(master)) {
msg <- paste0("Spark not found in SPARK_HOME: ",
sparkHome)
if (isMasterLocal(master)) {
msg <- paste0("Spark not found in SPARK_HOME: ", sparkHome)
message(msg)
packageLocalDir <- install.spark()
packageLocalDir
} else {
} else if (isClientMode(master) || deployMode == "client") {
msg <- paste0("Spark not found in SPARK_HOME: ",
sparkHome, "\n", installInstruction("remote"))
stop(msg)
} else {
NULL
}
}
} else {
Expand Down
4 changes: 4 additions & 0 deletions R/pkg/R/utils.R
Original file line number Diff line number Diff line change
Expand Up @@ -777,6 +777,10 @@ isMasterLocal <- function(master) {
grepl("^local(\\[([0-9]+|\\*)\\])?$", master, perl = TRUE)
}

isClientMode <- function(master) {
grepl("([a-z]+)-client$", master, perl = TRUE)
}

isSparkRShell <- function() {
grepl(".*shell\\.R$", Sys.getenv("R_PROFILE_USER"), perl = TRUE)
}
Expand Down
9 changes: 9 additions & 0 deletions R/pkg/inst/tests/testthat/test_mllib.R
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,15 @@ test_that("spark.glm summary", {
df <- suppressWarnings(createDataFrame(data))
regStats <- summary(spark.glm(df, b ~ a1 + a2, regParam = 1.0))
expect_equal(regStats$aic, 14.00976, tolerance = 1e-4) # 14.00976 is from summary() result

# Test spark.glm works on collinear data
A <- matrix(c(1, 2, 3, 4, 2, 4, 6, 8), 4, 2)
b <- c(1, 2, 3, 4)
data <- as.data.frame(cbind(A, b))
df <- createDataFrame(data)
stats <- summary(spark.glm(df, b ~ . - 1))
coefs <- unlist(stats$coefficients)
expect_true(all(abs(c(0.5, 0.25) - coefs) < 1e-4))
})

test_that("spark.glm save/load", {
Expand Down
46 changes: 46 additions & 0 deletions R/pkg/inst/tests/testthat/test_sparkR.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

context("functions in sparkR.R")

test_that("sparkCheckInstall", {
# "local, yarn-client, mesos-client" mode, SPARK_HOME was set correctly,
# and the SparkR job was submitted by "spark-submit"
sparkHome <- paste0(tempdir(), "/", "sparkHome")
dir.create(sparkHome)
master <- ""
deployMode <- ""
expect_true(is.null(sparkCheckInstall(sparkHome, master, deployMode)))
unlink(sparkHome, recursive = TRUE)

# "yarn-cluster, mesos-cluster" mode, SPARK_HOME was not set,
# and the SparkR job was submitted by "spark-submit"
sparkHome <- ""
master <- ""
deployMode <- ""
expect_true(is.null(sparkCheckInstall(sparkHome, master, deployMode)))

# "yarn-client, mesos-client" mode, SPARK_HOME was not set
sparkHome <- ""
master <- "yarn-client"
deployMode <- ""
expect_error(sparkCheckInstall(sparkHome, master, deployMode))
sparkHome <- ""
master <- ""
deployMode <- "client"
expect_error(sparkCheckInstall(sparkHome, master, deployMode))
})
2 changes: 1 addition & 1 deletion R/pkg/inst/tests/testthat/test_sparkSQL.R
Original file line number Diff line number Diff line change
Expand Up @@ -2684,7 +2684,7 @@ test_that("Call DataFrameWriter.load() API in Java without path and check argume
# It makes sure that we can omit path argument in read.df API and then it calls
# DataFrameWriter.load() without path.
expect_error(read.df(source = "json"),
paste("Error in loadDF : analysis error - Unable to infer schema for JSON at .",
paste("Error in loadDF : analysis error - Unable to infer schema for JSON.",
"It must be specified manually"))
expect_error(read.df("arbitrary_path"), "Error in loadDF : analysis error - Path does not exist")
expect_error(read.json("arbitrary_path"), "Error in json : analysis error - Path does not exist")
Expand Down
11 changes: 6 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,9 @@ To build Spark and its example programs, run:
You can build Spark using more than one thread by using the -T option with Maven, see ["Parallel builds in Maven 3"](https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3).
More detailed documentation is available from the project site, at
["Building Spark"](http://spark.apache.org/docs/latest/building-spark.html).
For developing Spark using an IDE, see [Eclipse](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Eclipse)
and [IntelliJ](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ).

For general development tips, including info on developing Spark using an IDE, see
[http://spark.apache.org/developer-tools.html](the Useful Developer Tools page).

## Interactive Scala Shell

Expand Down Expand Up @@ -80,7 +81,7 @@ can be run using:
./dev/run-tests

Please see the guidance on how to
[run tests for a module, or individual tests](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools).
[run tests for a module, or individual tests](http://spark.apache.org/developer-tools.html#individual-tests).

## A Note About Hadoop Versions

Expand All @@ -100,5 +101,5 @@ in the online documentation for an overview on how to configure Spark.

## Contributing

Please review the [Contribution to Spark](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark)
wiki for information on how to get started contributing to the project.
Please review the [Contribution to Spark guide](http://spark.apache.org/contributing.html)
for information on how to get started contributing to the project.
Original file line number Diff line number Diff line change
Expand Up @@ -378,14 +378,14 @@ public long cleanUpAllAllocatedMemory() {
for (MemoryConsumer c: consumers) {
if (c != null && c.getUsed() > 0) {
// In case of failed task, it's normal to see leaked memory
logger.warn("leak " + Utils.bytesToString(c.getUsed()) + " memory from " + c);
logger.debug("unreleased " + Utils.bytesToString(c.getUsed()) + " memory from " + c);
}
}
consumers.clear();

for (MemoryBlock page : pageTable) {
if (page != null) {
logger.warn("leak a page: " + page + " in task " + taskAttemptId);
logger.debug("unreleased page: " + page + " in task " + taskAttemptId);
memoryManager.tungstenMemoryAllocator().free(page);
}
}
Expand Down

0 comments on commit e198080

Please sign in to comment.