Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13010] [ML] [SparkR] Implement a simple wrapper of AFTSurvivalRegression in SparkR #11447

Closed
wants to merge 6 commits into from

Conversation

yanboliang
Copy link
Contributor

What changes were proposed in this pull request?

Implement a simple wrapper of AFTSurvivalRegression named survreg in SparkR.
cc @mengxr

How was this patch tested?

unit test.
Output of R:

> library(survival)
> model <- survreg(Surv(futime, fustat) ~ ecog.ps + rx, ovarian, dist="weibull")
> summary(model)
Call:
survreg(formula = Surv(futime, fustat) ~ ecog.ps + rx, data = ovarian, 
    dist = "weibull")
             Value Std. Error      z        p
(Intercept)  6.897      1.178  5.857 4.72e-09
ecog.ps     -0.385      0.527 -0.731 4.65e-01
rx           0.529      0.529  0.999 3.18e-01
Log(scale)  -0.123      0.252 -0.489 6.25e-01

Scale= 0.884 

Output of SparkR:

> df <- createDataFrame(sqlContext, ovarian)
> model <- survreg(Surv(futime, fustat) ~ ecog_ps + rx, df)
> summary(model)
$coefficients
                 Value
(Intercept)  6.8966932
ecog_ps     -0.3850426
rx           0.5286455
Log(scale)  -0.1234418

#' model <- survreg(Surv(futime, fustat) ~ ecog_ps + rx, df)
#' summary(model)
#'}
setMethod("survreg", signature(formula = "formula", data = "DataFrame"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only support "weibull" distribution in AFTSurvivalRegression currently, so we don't need arguments dist like R's survreg until we supporting more distributions.

@SparkQA
Copy link

SparkQA commented Mar 1, 2016

Test build #52246 has finished for PR 11447 at commit 7849fd0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Mar 1, 2016

@hhbyyh Could you help review this PR?

@SparkQA
Copy link

SparkQA commented Mar 2, 2016

Test build #52314 has finished for PR 11447 at commit e149bd8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hhbyyh
Copy link
Contributor

hhbyyh commented Mar 4, 2016

@yanboliang I tried your implementation with

library(survival)
 data(ovarian)
 df <- createDataFrame(sqlContext, ovarian)
 model <- survreg(Surv(futime, fustat) ~ ecog_ps + rx, df)

and met the error

Error in as.data.frame.default(data, optional = TRUE) : 
  cannot coerce class "structure("DataFrame", package = "SparkR")" to a data.frame

I can run the code in ut though

  data <- list(list(4,1,0,0), list(3,1,2,0), list(1,1,1,0),
          list(1,0,1,0), list(2,1,1,1), list(2,1,0,1), list(3,0,0,1))
  df <- createDataFrame(sqlContext, data, c("time", "status", "x", "sex"))
  model <- survreg(Surv(time, status) ~ x + sex, df)

Is there something I'm missing. Thanks.

@yanboliang
Copy link
Contributor Author

@hhbyyh Thanks for your reviewing.
I have also encountered this issue, because our survreg was masked by survreg in survival package under that condition. There are two solutions can make it work well:

library(survival)
data(ovarian)
df <- createDataFrame(sqlContext, ovarian)
library(SparkR)
model <- survreg(Surv(futime, fustat) ~ ecog_ps + rx, df)

or

library(survival)
data(ovarian)
df <- createDataFrame(sqlContext, ovarian)
model <- SparkR::survreg(Surv(futime, fustat) ~ ecog_ps + rx, df)

We will recommend the former one.
You will not load library survival if you use your own dataset, so it will not be masked in general case. But I think we should clarify this in API doc.

@hhbyyh
Copy link
Contributor

hhbyyh commented Mar 4, 2016

Thanks. I'll make a pass more closely tomorrow.

@mengxr
Copy link
Contributor

mengxr commented Mar 4, 2016

@yanboliang The glm interface won't shadow stat::glm. Could you check the implementation there? This might be relevant: #9582.

@felixcheung
Copy link
Member

Please check out this test - you could library(survival) and then execute what is in this test to see if it will conflict/mask it

var censorCol: String = null

val regex = "^Surv\\(([^,]+),([^,]+)\\)\\s*\\~\\s*(.+)".r
try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

met MatchError during test with " Surv ( futime, fustat ) ~ ecog.ps + tx - rx". Maybe we should write the regex to the comment.

@hhbyyh
Copy link
Contributor

hhbyyh commented Mar 5, 2016

only some minor comments.

@Since("2.0.0")
class AFTSurvivalRegressionSummary private[regression] (
@Since("2.0.0") @transient val predictions: DataFrame,
@Since("2.0.0") val predictionCol: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please help me understand where is the usage of predictionCol, labelCol featuresCol?

@SparkQA
Copy link

SparkQA commented Mar 17, 2016

Test build #53431 has finished for PR 11447 at commit 7aabd2e.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 17, 2016

Test build #53433 has finished for PR 11447 at commit ba01668.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 17, 2016

Test build #53434 has finished for PR 11447 at commit 267e75d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

@hhbyyh @mengxr The PR is ready for another pass.
survival::survreg and SparkR::survreg share the same first argument type (formula). So I don't think we can avoid shadowing the method. This is similar with naiveBayes in #11486 . It make sense that some functions from the package loaded first are masked by those in the package loaded after, if they are not functions of base or stats package. I have test case to ensure that we can use the prefix to call both functions.

@SparkQA
Copy link

SparkQA commented Mar 18, 2016

Test build #53528 has finished for PR 11447 at commit 2a9893b.

  • This patch fails R style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 18, 2016

Test build #53530 has finished for PR 11447 at commit 900c85f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

rCoefs <- as.vector(coef(rModel))
rScale <- rModel$scale

expect_true(all(abs(rCoefs - coefs) < 1e-4))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expect_equal(coefs, rCoefs, tolerance = 1e-4), which generates better error messages

@mengxr
Copy link
Contributor

mengxr commented Mar 22, 2016

@yanboliang I did some factoring of naiveBayes implementation in #11890. I feel it makes the code less coupled if we have separate wrappers for each algorithm we port to SparkR instead of put everything under SparkRWrappers. If we store the training DataFrame in the survival analysis wrapper, I think we don't need to create AFTSurvivalRegressionSummary class.

@yanboliang
Copy link
Contributor Author

@mengxr OK, I will update this PR following #11890 . Thanks!

@yanboliang
Copy link
Contributor Author

Let's move to #11932

@yanboliang yanboliang closed this Mar 24, 2016
@yanboliang yanboliang deleted the spark-13010 branch March 24, 2016 10:41
ghost pushed a commit to dbtsai/spark that referenced this pull request Mar 25, 2016
…gression in SparkR

## What changes were proposed in this pull request?
This PR continues the work in apache#11447, we implemented the wrapper of ```AFTSurvivalRegression``` named ```survreg``` in SparkR.

## How was this patch tested?
Test against output from R package survival's survreg.

cc mengxr felixcheung

Close apache#11447

Author: Yanbo Liang <ybliang8@gmail.com>

Closes apache#11932 from yanboliang/spark-13010-new.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants