### avinashbarnwal commented Aug 12, 2019 • edited

 Hi, Please find the Accelerated Failure time loss for Survival Modeling. Survival analysis is a "censored regression" where the goal is to learn time-to-event function. This is similar to the common regression analysis where data-points are uncensored. Time-to-event modeling is critical for understanding users/companies behaviors not limited to credit, cancer, and attrition risks. Supports 4 kinds of datasets - Left, Right, Interval Censored and Uncensored. Normal, Logistic and Extreme Distributions for underlying error distribution. This project is part of the Google Summer of Code - 2019. AFT-Xgboost Relevant Documents Example in Python to run -  res = {} dtrain = xgboost.DMatrix(X) dtrain.set_float_info("label_lower_bound",y_lower) dtrain.set_float_info("label_upper_bound",y_higher) dtest = xgboost.DMatrix(X_val) dtest.set_float_info("label_lower_bound",y_lower_val) dtest.set_float_info("label_upper_bound",y_higher_val) bst = xgboost.train(params,dtrain,num_boost_round=100,evals=[(dtrain,"train"),(dtest,"test")],evals_result=res) params = {'learning_rate':0.1, 'aft_noise_distribution' : 'normal', 'aft_sigma': 1.0,'eval_metric':'aft-nloglik@normal,1.0','objective':"aft:survival"} bst = xgboost.train(params,dtrain,num_boost_round=100,evals=[(dtrain,"train"),(dtest,"test")],evals_result=res)  For more details - avinashbarnwal#1.
Collaborator

### hcho3 commented Aug 12, 2019 • edited

 @avinashbarnwal In the PR description, can you add a short one-paragraph description of what survival analysis is? Something like: survival analysis is a new kind of learning task where we would like to predict a time to certain event. The time-to-event labels are often censored, i.e. we only know which intervals the label falls in and do not know its exact value. See https://eng.uber.com/modeling-censored-time-to-event-data-using-pyro/ for a real-world example.
### tdhock commented Aug 12, 2019 • edited

 survival analysis is a new kind of learning task where we would like to predict a time to certain event  I would describe it as "censored regression" or more specifically "regression with censored outputs" because the goal is still to learn a (real-valued) regression function; this emphasizes the similarity with usual regression, where all outputs are un-censored.
Collaborator

### hcho3 commented Aug 12, 2019 • edited

 Also add a short Python example to the description: dtrain = xgboost.DMatrix(X) dtrain.set_float_info("label_lower_bound", y_lower) dtrain.set_float_info("label_upper_bound", y_higher) dtest = xgboost.DMatrix(X_test) dtest.set_float_info("label_lower_bound", y_lower_test) dtest.set_float_info("label_upper_bound", y_higher_test) bst = xgboost.train(params, dtrain, num_boost_round=100, evals=[(dtrain,"train"), (dtest,"test")])
Collaborator

### hcho3 commented Aug 12, 2019

 @tdhock Thanks for your suggestion. Yes, "censored regression" sounds reasonable.
Member

### trivialfis commented Aug 14, 2019 • edited

 I'm not familiar with survival models, just skimmed through the survey. Are there other recommended materials concentrating on theoretical part? ;-)
Author

### avinashbarnwal commented Aug 14, 2019

 Hi @trivialfis, Please find the good lecture notes for learning survival modeling - https://www4.stat.ncsu.edu/~dzhang2/st745/index.html. One of the motivating books- https://www.amazon.com/Applied-Survival-Analysis-Time-Event/dp/0471754994. Prof. @tdhock and @hcho3 might give a better reference for understanding theoretical survival modeling.

### tdhock commented Aug 14, 2019

 would be good if @avinashbarnwal could write a latex/PDF vignette in the xgboost R pkg describing the loss functions that he implemented

### tdhock commented Aug 14, 2019

 they are the same as in R's survival::survreg, there are some docs on that man page, but the math formulas come from http://members.cbio.mines-paristech.fr/~thocking/survival.pdf
Member

### trivialfis commented Aug 14, 2019

 @avinashbarnwal @tdhock Thanks for the good references. Will try to catch up.
Author

### avinashbarnwal commented Aug 14, 2019

 Hi Prof. @tdhock and @hcho3, I will start writing loss functions in latex/PDF vignette for the xgboost R pkg.
Author

### avinashbarnwal commented Aug 15, 2019

 Hi Prof. @tdhock, Please let me know if it is fine to make the vignette-like this https://cran.r-project.org/web/packages/xgboost/vignettes/xgboostfromJSON.html.

### tdhock commented Aug 15, 2019

 typically for vignettes with lots of math I prefer writing Rnw source which is rendered to tex / pdf. It is possible to include simple math in Rmd which is rendered on a web page using mathjax, but in my experience complex equations (e.g. optimization problems) do not render well on web pages. Examples of both are here: https://github.com/tdhock/PeakSegDisk/tree/master/vignettes https://github.com/tdhock/PeakSegDisk/blob/master/vignettes/Examples.Rnw has some equations including an optimization problem and it is compiled to pdf. https://github.com/tdhock/PeakSegDisk/blob/master/vignettes/Worst_case.Rmd#L48 has some simple equations and is compiled to html
Author

### avinashbarnwal commented Aug 24, 2019

 Hi Prof. @tdhock and @hcho3, Please find R-vignette below and let me know your thoughts. http://rpubs.com/avinashbarnwal123/aft
Collaborator

### hcho3 commented Aug 26, 2019 • edited

 @tdhock Do the datasets follow log-normal AFT distribution? The errors are not decreasing when we choose log-logistic and log-weibull. See http://rpubs.com/avinashbarnwal123/aft

### tdhock commented Aug 26, 2019

 they are real data sets so we don't know their "true" distribution. However in previous experience with linear models, I have observed that a loss function with quadratic tails (like the normal distribution) works better than linear tails (like the logistic) … On Mon, Aug 26, 2019 at 9:36 AM Philip Hyunsu Cho ***@***.***> wrote: @tdhock Are the datasets follow log-normal AFT distribution? The errors are not decreasing when we choose log-logistic and log-weibull. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4763?email_source=notifications&email_token=AAHDX4SAKQD6MWWKEA6772DQGQBBTA5CNFSM4ILDM5N2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5E5E2I#issuecomment-524931689>, or mute the thread .
Author

### avinashbarnwal commented Aug 26, 2019 • edited

 Hi Prof. @tdhock and @hcho3, I have updated the vignette - http://rpubs.com/avinashbarnwal123/aft. It works for the last dataset - H3K36me3_AM_immune. Please check last fold. This might be not clear because of the scale. It works for both Logistic and Extreme. I think we need datasets like that where it works.
Collaborator

### hcho3 commented Sep 19, 2019

 Rebased against the latest master. I'll address @RAMitchell's remark soon, by putting y_lower and y_upper in a std::map
Member

### trivialfis left a comment

 Looking forward to it. Still haven't got the time to learn the materials. I may need to work harder ...
 } double AFTLoss::Hessian(double y_lower, double y_higher, double y_pred, double sigma) { double z;

Member

#### avinashbarnwal Sep 25, 2019

Author

 struct AFTParam : public dmlc::Parameter { AFTDistributionType aft_noise_distribution; float aft_sigma;

#### trivialfis Sep 25, 2019

Member

This is a personal preference, gamma looks good on paper, but in code it doesn't ... I would go with something more specific like distribution_scale_factor (still not a good name) and document it with the gamma parameter used in xxx paper.

#### avinashbarnwal Sep 25, 2019

Author

Did you mean aft_sigma?

Collaborator

### hcho3 commented Oct 4, 2019

 @RAMitchell Can you take a look at the change I made to MetaInfo? Now we have a generic key-value store for 1D vectors.
### trivialfis commented Dec 12, 2019

 Feel free to reach me for any needed help here. I spent some time with ngboost, might overlaps with this one. Will dig into these interesting things once next release is sorted out. ;-)
Collaborator

### hcho3 commented Dec 12, 2019

 @trivialfis This is blocked by the binary MetaInfo format. I need to get to it.
