Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8518] [ML] Log-linear models for survival analysis #8611

Closed
wants to merge 19 commits into from

Conversation

yanboliang
Copy link
Contributor

Accelerated Failure Time (AFT) model is the most commonly used and easy to parallel method of survival analysis for censored survival data. It is the log-linear model based on the Weibull distribution of the survival time.
Users can refer to the R function survreg to compare the model and predict to compare the prediction. There are different kinds of model prediction, I have just select the type response which is default used for R.

@yanboliang yanboliang changed the title [SPARK-8518] [ML] Log-linear models for survival analysis [WIP] [SPARK-8518] [ML] Log-linear models for survival analysis Sep 5, 2015
@SparkQA
Copy link

SparkQA commented Sep 5, 2015

Test build #42039 has finished for PR 8611 at commit 7f2a31d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class AFTRegression(override val uid: String)

@SparkQA
Copy link

SparkQA commented Sep 6, 2015

Test build #42073 has finished for PR 8611 at commit d0a29bb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class AFTRegression(override val uid: String)

@SparkQA
Copy link

SparkQA commented Sep 7, 2015

Test build #42091 has finished for PR 8611 at commit c30a1c2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class AFTRegression(override val uid: String)

@SparkQA
Copy link

SparkQA commented Sep 7, 2015

Test build #42092 has finished for PR 8611 at commit 4223793.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class AFTRegression(override val uid: String)

@rotationsymmetry
Copy link
Contributor

@yanboliang

As sigma is restricted to be positive, in the L-BLGS algorithm, shall we optimize over log(sigma)? This way, there is no more restriction on the possible value of sigma and usual help with convergence.

@yanboliang
Copy link
Contributor Author

@rotationsymmetry I agree with you, updated PR. Thanks for your comments.

@SparkQA
Copy link

SparkQA commented Sep 8, 2015

Test build #42128 has finished for PR 8611 at commit 28ca4c0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class AFTRegression(override val uid: String)

asfgit pushed a commit that referenced this pull request Sep 9, 2015
Add WeibullGenerator for RandomDataGenerator.
#8611 need use WeibullGenerator to generate random data based on Weibull distribution.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8622 from yanboliang/spark-10464.
@SparkQA
Copy link

SparkQA commented Sep 9, 2015

Test build #42200 has finished for PR 8611 at commit b7dd012.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang yanboliang changed the title [WIP] [SPARK-8518] [ML] Log-linear models for survival analysis [SPARK-8518] [ML] Log-linear models for survival analysis Sep 9, 2015
@mengxr
Copy link
Contributor

mengxr commented Sep 9, 2015

@rotationsymmetry Do you have time to make a pass?

@rotationsymmetry
Copy link
Contributor

@mengxr Yes, I have done a quick run and added some comments the past weekend. I will do a more thorough pass today/tomorrow.


def this() = this(Identifiable.randomUID("aftReg"))

/** @group setParam */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide documentation for public API. Similar for other getters and setters if not inherited.

@mengxr
Copy link
Contributor

mengxr commented Sep 15, 2015

I made a brief scan over the code. @rotationsymmetry Could you make another pass?

* L(\beta,\sigma)=\prod_{i=1}^n[\frac{1}{\sigma}f_{0}
* (\frac{\log{t_{i}}-x^{'}\beta}{\sigma})]^{\delta_{i}}S_{0}
* (\frac{\log{t_{i}}-x^{'}\beta}{\sigma})^{1-\delta_{i}}
* }}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delta in the formula is the event indicator. Since we decide to use the censored indicator, I recommend we clarify this in the formula.

@rotationsymmetry
Copy link
Contributor

@yanboliang @mengxr I just made another round. Thanks!

@yanboliang
Copy link
Contributor Author

@mengxr @rotationsymmetry Thanks for your comments. I have updated the PR, Would you mind to make another pass?

@SparkQA
Copy link

SparkQA commented Sep 17, 2015

Test build #42603 has finished for PR 8611 at commit dc1d59f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class AFTSurvivalRegression @Since("1.6.0") (@Since("1.6.0") override val uid: String)

/**
* Param for censored column name.
* The value of this column could be 0 or 1.
* If the value is 1, it means the event has occurred i.e. uncensored; otherwise it censored.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing. The column name is called censored but value 1 means uncensored. I searched for some survival analysis datasets, where "censor" is commonly used as the column name. See https://www.umass.edu/statdata/statdata/stat-survival.html. However, in those datasets, 1 actually means occurred, i.e., uncensored. So my recommendation would be renaming the param name back to censorCol and keep the current doc. That is 1 for uncensored and 0 for censored. And note that this is to be consistent with existing datasets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it censored -> censored

@mengxr
Copy link
Contributor

mengxr commented Sep 17, 2015

@yanboliang Let's rename censoredCol back to cencorCol and fix small issues I mentioned inline. Then this PR should be good to go. We can make another PR to add quantileCol.

* @param features List of features for this data point.
* @param label Label for this data point.
* @param censored Indicator of the event has occurred or not. If the value is 1, it means
* the event has occurred i.e. uncensored; otherwise it censored.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove it

@SparkQA
Copy link

SparkQA commented Sep 18, 2015

Test build #42637 has finished for PR 8611 at commit aa37878.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class AFTSurvivalRegression @Since("1.6.0") (@Since("1.6.0") override val uid: String)
    • require(censor == 1.0 || censor == 0.0, "censor of class AFTPoint must be 1.0 or 0.0")

@mengxr
Copy link
Contributor

mengxr commented Sep 18, 2015

LGTM. Merged into master. Thanks! I created https://issues.apache.org/jira/browse/SPARK-10686, SPARK-10687, SPARK-10688, SPARK-10689 for follow-up work.

@asfgit asfgit closed this in 98f1ea6 Sep 18, 2015
@yanboliang yanboliang deleted the spark-8518 branch September 18, 2015 07:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants