# dmlc / xgboost

Open
wants to merge 49 commits into
from
Open

wants to merge 49 commits into from
+724 −7

## Conversation

### avinashbarnwal commented Aug 12, 2019 • edited

 Hi, Please find the Accelerated Failure time loss for Survival Modeling. Survival analysis is a "censored regression" where the goal is to learn time-to-event function. This is similar to the common regression analysis where data-points are uncensored. Time-to-event modeling is critical for understanding users/companies behaviors not limited to credit, cancer, and attrition risks. Supports 4 kinds of datasets - Left, Right, Interval Censored and Uncensored. Normal, Logistic and Extreme Distributions for underlying error distribution. This project is part of the Google Summer of Code - 2019. AFT-Xgboost Relevant Documents Example in Python to run -  res = {} dtrain = xgboost.DMatrix(X) dtrain.set_float_info("label_lower_bound",y_lower) dtrain.set_float_info("label_upper_bound",y_higher) dtest = xgboost.DMatrix(X_val) dtest.set_float_info("label_lower_bound",y_lower_val) dtest.set_float_info("label_upper_bound",y_higher_val) bst = xgboost.train(params,dtrain,num_boost_round=100,evals=[(dtrain,"train"),(dtest,"test")],evals_result=res) params = {'learning_rate':0.1, 'aft_noise_distribution' : 'normal', 'aft_sigma': 1.0,'eval_metric':'aft-nloglik@normal,1.0','objective':"aft:survival"} bst = xgboost.train(params,dtrain,num_boost_round=100,evals=[(dtrain,"train"),(dtest,"test")],evals_result=res)  For more details - avinashbarnwal#1.
mentioned this pull request Aug 12, 2019
reviewed
src/common/survival_util.cc Outdated Show resolved Hide resolved
reviewed
Show resolved Hide resolved
changed the title Survival analysis1 Add Accelerated Failure Time loss for survival analysis task Aug 12, 2019
self-assigned this Aug 12, 2019
requested review from trivialfis and hcho3 and removed request for trivialfis Aug 12, 2019
reviewed
src/common/survival_util.cc Outdated Show resolved Hide resolved
Collaborator

### hcho3 commented Aug 12, 2019 • edited

 @avinashbarnwal In the PR description, can you add a short one-paragraph description of what survival analysis is? Something like: survival analysis is a new kind of learning task where we would like to predict a time to certain event. The time-to-event labels are often censored, i.e. we only know which intervals the label falls in and do not know its exact value. See https://eng.uber.com/modeling-censored-time-to-event-data-using-pyro/ for a real-world example.
reviewed
src/common/survival_util.cc Outdated Show resolved Hide resolved

### tdhock commented Aug 12, 2019 • edited

 survival analysis is a new kind of learning task where we would like to predict a time to certain event  I would describe it as "censored regression" or more specifically "regression with censored outputs" because the goal is still to learn a (real-valued) regression function; this emphasizes the similarity with usual regression, where all outputs are un-censored.
Collaborator

### hcho3 commented Aug 12, 2019 • edited

 Also add a short Python example to the description: dtrain = xgboost.DMatrix(X) dtrain.set_float_info("label_lower_bound", y_lower) dtrain.set_float_info("label_upper_bound", y_higher) dtest = xgboost.DMatrix(X_test) dtest.set_float_info("label_lower_bound", y_lower_test) dtest.set_float_info("label_upper_bound", y_higher_test) bst = xgboost.train(params, dtrain, num_boost_round=100, evals=[(dtrain,"train"), (dtest,"test")])
added the label Aug 12, 2019
Collaborator

### hcho3 commented Aug 12, 2019

 @tdhock Thanks for your suggestion. Yes, "censored regression" sounds reasonable.
reviewed
src/common/survival_util.cc Outdated Show resolved Hide resolved
Member

### trivialfis commented Aug 14, 2019 • edited

 I'm not familiar with survival models, just skimmed through the survey. Are there other recommended materials concentrating on theoretical part? ;-)
Author

### avinashbarnwal commented Aug 14, 2019

 Hi @trivialfis, Please find the good lecture notes for learning survival modeling - https://www4.stat.ncsu.edu/~dzhang2/st745/index.html. One of the motivating books- https://www.amazon.com/Applied-Survival-Analysis-Time-Event/dp/0471754994. Prof. @tdhock and @hcho3 might give a better reference for understanding theoretical survival modeling.

### tdhock commented Aug 14, 2019

 would be good if @avinashbarnwal could write a latex/PDF vignette in the xgboost R pkg describing the loss functions that he implemented

### tdhock commented Aug 14, 2019

 they are the same as in R's survival::survreg, there are some docs on that man page, but the math formulas come from http://members.cbio.mines-paristech.fr/~thocking/survival.pdf
Member

### trivialfis commented Aug 14, 2019

 @avinashbarnwal @tdhock Thanks for the good references. Will try to catch up.
Author

### avinashbarnwal commented Aug 14, 2019

 Hi Prof. @tdhock and @hcho3, I will start writing loss functions in latex/PDF vignette for the xgboost R pkg.
Author

### avinashbarnwal commented Aug 15, 2019

 Hi Prof. @tdhock, Please let me know if it is fine to make the vignette-like this https://cran.r-project.org/web/packages/xgboost/vignettes/xgboostfromJSON.html.

### tdhock commented Aug 15, 2019

 typically for vignettes with lots of math I prefer writing Rnw source which is rendered to tex / pdf. It is possible to include simple math in Rmd which is rendered on a web page using mathjax, but in my experience complex equations (e.g. optimization problems) do not render well on web pages. Examples of both are here: https://github.com/tdhock/PeakSegDisk/tree/master/vignettes https://github.com/tdhock/PeakSegDisk/blob/master/vignettes/Examples.Rnw has some equations including an optimization problem and it is compiled to pdf. https://github.com/tdhock/PeakSegDisk/blob/master/vignettes/Worst_case.Rmd#L48 has some simple equations and is compiled to html
reviewed
include/xgboost/data.h Outdated Show resolved Hide resolved
Author

### avinashbarnwal commented Aug 24, 2019

 Hi Prof. @tdhock and @hcho3, Please find R-vignette below and let me know your thoughts. http://rpubs.com/avinashbarnwal123/aft
Collaborator

### hcho3 commented Aug 26, 2019 • edited

 @tdhock Do the datasets follow log-normal AFT distribution? The errors are not decreasing when we choose log-logistic and log-weibull. See http://rpubs.com/avinashbarnwal123/aft

### tdhock commented Aug 26, 2019

 they are real data sets so we don't know their "true" distribution. However in previous experience with linear models, I have observed that a loss function with quadratic tails (like the normal distribution) works better than linear tails (like the logistic) … On Mon, Aug 26, 2019 at 9:36 AM Philip Hyunsu Cho ***@***.***> wrote: @tdhock Are the datasets follow log-normal AFT distribution? The errors are not decreasing when we choose log-logistic and log-weibull. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4763?email_source=notifications&email_token=AAHDX4SAKQD6MWWKEA6772DQGQBBTA5CNFSM4ILDM5N2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5E5E2I#issuecomment-524931689>, or mute the thread .
Author

### avinashbarnwal commented Aug 26, 2019 • edited

 Hi Prof. @tdhock and @hcho3, I have updated the vignette - http://rpubs.com/avinashbarnwal123/aft. It works for the last dataset - H3K36me3_AM_immune. Please check last fold. This might be not clear because of the scale. It works for both Logistic and Extreme. I think we need datasets like that where it works.
 [WIP] Add lower and upper bounds on the label for survival analysis 
 1a0d73d 
and others added 9 commits Aug 13, 2019
 changes to variable names and pow 
 9d689f2 
 changes to Logistic Pdf 
 7ae3c98 
 changes to parameter 
 69c95c5 
 Expose lower and upper bound labels to R package 
 96d8360 
 Use example weights; normalize log likelihood metric 
 29550db 
 changes to CHECK 
 24635bc 
 changes to logistic hessian to standard formula 
 01cb150 
 changes to logistic formula 
 4002399 
 Comply with coding style guideline 
 cf272f5 
force-pushed the avinashbarnwal:survival_analysis1 branch from 5b6d707 to cf272f5 Sep 19, 2019
Collaborator

### hcho3 commented Sep 19, 2019

 Rebased against the latest master. I'll address @RAMitchell's remark soon, by putting y_lower and y_upper in a std::map
added 2 commits Sep 19, 2019
 Revert back Rabit submodule 
 ab1271d 
 Revert dmlc-core submodule 
 23890f4 
changed the title Add Accelerated Failure Time loss for survival analysis task [WIP] Add Accelerated Failure Time loss for survival analysis task Sep 19, 2019
added 3 commits Sep 19, 2019
 Comply with coding style guideline (clang-tidy) 
 2b5654a 
 Fix an error in AFTLoss::Gradient() 
 d59bd92 
 Add missing files to amalgamation 
 f1ee3a7 
reviewed
Member

### trivialfis left a comment

 Looking forward to it. Still haven't got the time to learn the materials. I may need to work harder ...
 } double AFTLoss::Hessian(double y_lower, double y_higher, double y_pred, double sigma) { double z;

Member

#### avinashbarnwal Sep 25, 2019

Author

Show resolved Hide resolved
 struct AFTParam : public dmlc::Parameter { AFTDistributionType aft_noise_distribution; float aft_sigma;

#### trivialfis Sep 25, 2019

Member

This is a personal preference, gamma looks good on paper, but in code it doesn't ... I would go with something more specific like distribution_scale_factor (still not a good name) and document it with the gamma parameter used in xxx paper.

#### avinashbarnwal Sep 25, 2019

Author

Did you mean aft_sigma?

Show resolved Hide resolved
added 2 commits Oct 3, 2019
 Address @RAMitchell's comment: minimize future change in MetaInfo int… 
 9ede4b4 
…erface
 Fix lint 
 56a0530 
Collaborator

### hcho3 commented Oct 4, 2019

 @RAMitchell Can you take a look at the change I made to MetaInfo? Now we have a generic key-value store for 1D vectors.
 Fix compilation error on 32-bit target, when size_t == bst_uint 
 bfa838c 
changed the title [WIP] Add Accelerated Failure Time loss for survival analysis task Add Accelerated Failure Time loss for survival analysis task Oct 6, 2019
mentioned this pull request Oct 15, 2019
mentioned this pull request Oct 17, 2019
 Allocate sufficient memory to hold extra label info 
 a668d7a 
Member

### trivialfis commented Dec 12, 2019

 Feel free to reach me for any needed help here. I spent some time with ngboost, might overlaps with this one. Will dig into these interesting things once next release is sorted out. ;-)
Collaborator

### hcho3 commented Dec 12, 2019

 @trivialfis This is blocked by the binary MetaInfo format. I need to get to it.
 Use OpenMP to speed up 
 7460dca 
mentioned this pull request Jan 8, 2020