New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8518] [ML] Log-linear models for survival analysis #8611
Conversation
Test build #42039 has finished for PR 8611 at commit
|
Test build #42073 has finished for PR 8611 at commit
|
Test build #42091 has finished for PR 8611 at commit
|
Test build #42092 has finished for PR 8611 at commit
|
As sigma is restricted to be positive, in the L-BLGS algorithm, shall we optimize over log(sigma)? This way, there is no more restriction on the possible value of sigma and usual help with convergence. |
@rotationsymmetry I agree with you, updated PR. Thanks for your comments. |
Test build #42128 has finished for PR 8611 at commit
|
Add WeibullGenerator for RandomDataGenerator. #8611 need use WeibullGenerator to generate random data based on Weibull distribution. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8622 from yanboliang/spark-10464.
Test build #42200 has finished for PR 8611 at commit
|
@rotationsymmetry Do you have time to make a pass? |
@mengxr Yes, I have done a quick run and added some comments the past weekend. I will do a more thorough pass today/tomorrow. |
|
||
def this() = this(Identifiable.randomUID("aftReg")) | ||
|
||
/** @group setParam */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please provide documentation for public API. Similar for other getters and setters if not inherited.
I made a brief scan over the code. @rotationsymmetry Could you make another pass? |
* L(\beta,\sigma)=\prod_{i=1}^n[\frac{1}{\sigma}f_{0} | ||
* (\frac{\log{t_{i}}-x^{'}\beta}{\sigma})]^{\delta_{i}}S_{0} | ||
* (\frac{\log{t_{i}}-x^{'}\beta}{\sigma})^{1-\delta_{i}} | ||
* }}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delta in the formula is the event indicator. Since we decide to use the censored indicator, I recommend we clarify this in the formula.
@yanboliang @mengxr I just made another round. Thanks! |
@mengxr @rotationsymmetry Thanks for your comments. I have updated the PR, Would you mind to make another pass? |
Test build #42603 has finished for PR 8611 at commit
|
/** | ||
* Param for censored column name. | ||
* The value of this column could be 0 or 1. | ||
* If the value is 1, it means the event has occurred i.e. uncensored; otherwise it censored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing. The column name is called censored
but value 1
means uncensored
. I searched for some survival analysis datasets, where "censor" is commonly used as the column name. See https://www.umass.edu/statdata/statdata/stat-survival.html. However, in those datasets, 1
actually means occurred, i.e., uncensored. So my recommendation would be renaming the param name back to censorCol
and keep the current doc. That is 1
for uncensored and 0
for censored. And note that this is to be consistent with existing datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it censored
-> censored
@yanboliang Let's rename |
* @param features List of features for this data point. | ||
* @param label Label for this data point. | ||
* @param censored Indicator of the event has occurred or not. If the value is 1, it means | ||
* the event has occurred i.e. uncensored; otherwise it censored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove it
Test build #42637 has finished for PR 8611 at commit
|
LGTM. Merged into master. Thanks! I created https://issues.apache.org/jira/browse/SPARK-10686, SPARK-10687, SPARK-10688, SPARK-10689 for follow-up work. |
Accelerated Failure Time (AFT) model is the most commonly used and easy to parallel method of survival analysis for censored survival data. It is the log-linear model based on the Weibull distribution of the survival time.
Users can refer to the R function
survreg
to compare the model andpredict
to compare the prediction. There are different kinds of model prediction, I have just select the typeresponse
which is default used for R.