Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summary of GSOC2019 calls #1

Open
avinashbarnwal opened this issue May 21, 2019 · 121 comments
Open

Summary of GSOC2019 calls #1

avinashbarnwal opened this issue May 21, 2019 · 121 comments

Comments

@avinashbarnwal
Copy link
Owner

avinashbarnwal commented May 21, 2019

21MAY2019
Relevant Information -

  1. What should be xgb dmatrix for Interval regression ?
    Ans :- Create two column for all the times
    2-column matrix representation of outputs=labels.
  2. un-censored output, event is obserrved, e.g. y_i = 5. (5, 5)

survival::Surv(10, 10, type="interval2")
[1] 10

  1. left-censored output, event not observed, but we know it happened some time before t_i=5. y_i = (-Inf = \underline y_i, 5=t_i=\overline y_i)

survival::Surv(-Inf, 10, type="interval2")
[1] 10-

  1. Right-censored output, event not observed , but we know it happened some time after t_i =5. y_i = (5,inf)

survival::Surv(5, Inf, type="interval2")
[1] 5+

  1. Interval-censored output, event not observed, but we know it happened some time in y_i = (5,10)

survival::Surv(5, 10, type="interval2")
[1] [5, 10]

There is no need for the survival object. This is more of a legacy.

  1. How to handle sigma in AFT?
    Ans:- Treat as a hyperparameter.

  2. What is the dimension of the predicted value of interval regression?
    Ans:- It is always one real value, not interval.

Relevant Document
https://github.com/tdhock/aft-poster/blob/master/HOCKING-AFT.pdf
http://members.cbio.mines-paristech.fr/~thocking/survival.pdf
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-269

Please Check.
@tdhock, @hcho3

@hcho3
Copy link
Collaborator

hcho3 commented May 28, 2019

May 28, 2019 Update

  1. Done: re-created the loss charts from Toby’s paper to show that loss functions are reasonable
  2. Done: submitted RFC to public XGBoost repo
  3. In-Progress: Python Proof-of-Concept for end-to-end gradient boosting with survival loss
  4. In-Progress: Derive formulas for AFT

@avinashbarnwal @tdhock

@tdhock
Copy link

tdhock commented May 29, 2019

hi @avinashbarnwal can you please post the source code and rendered images for the loss charts you created?

@hcho3
Copy link
Collaborator

hcho3 commented May 29, 2019

@tdhock
Copy link

tdhock commented May 29, 2019

looks like a good start but the plot is incorrect for uncensored data. it should look like the square loss around the label for the normal distribution. also you should try creating a facetted ggplot, with one panel per censoring type. for inspiration here is the code I used for the 1-page AFT poster, https://github.com/tdhock/aft-poster/blob/master/figure-loss.R

@avinashbarnwal
Copy link
Owner Author

Thanks, Prof. @tdhock I am looking into it.

@tdhock
Copy link

tdhock commented May 31, 2019

looks better @avinashbarnwal ! glad to see that you got the facetted ggplots working.

however it looks like there is a problem with your computation of the logistic loss for the uncensored output -- it should be minimal at the label.

  • the x axis title says log scale but the ticks indicate linear scale (i.e the distance between 2.5 and 5.0 is the same as between 5.0 and 7.5)
  • the x axis title also says that you are plotting the predicted survival time in days but I don't think that is right -- aren't you actually plotting the predicted center of the gaussian/logistic = log(predicted time) ?
  • the legend says logLogistic and logNormal but you may want to change that to normal / logistic for simplicity -- it is usually easier to deal with the real valued labels rather than the positive valued labels. technically if t_i>0 is the survival time and y_i = log(t_i) \in R is the real-valued log(survival time) then the model assumes t_i ~ lognormal( f(x_i), \sigma^2) which is the same thing as y_i ~ normal( f(x_i), \sigma^2)
  • please increase the number of grid points at which you evaluate the loss functions (the curves are pretty choppy)
  • please increase the size of the interval censored output, so we can see the loss function approach zero and flatten out at the center of the limits.

@tdhock
Copy link

tdhock commented May 31, 2019

also for next week's homework please check your formulas for the gradient in the overleaf. add a row of plots to the figure that shows the loss functions. use facet_grid(fun ~ type) -- rows for different functions (loss, gradient, hessian)

  1. first row should be loss function
  2. second row should be gradient

@avinashbarnwal
Copy link
Owner Author

avinashbarnwal commented Jun 5, 2019

Prof. @tdhock, @hcho3

I have made the loss function with all the changes required.
Here is the link for the Plot - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/loss_aft.png

I have changed the code as well.
Code Link - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/assigment1.R

Prof. @tdhock Please check the formula for the Normal - Uncensored part. I have used different formula compared to given in this https://github.com/avinashbarnwal/GSOC-2019/blob/master/paper/HOCKING-AFT.pdf, rather than this I have used the formula given in this document. (http://home.iitk.ac.in/~kundu/paper146.pdf).

@tdhock
Copy link

tdhock commented Jun 5, 2019 via email

@avinashbarnwal
Copy link
Owner Author

avinashbarnwal commented Jun 5, 2019

Prof. @tdhock, The uncensored
Loss function Normal- -log(1/(t.lower * sigma * sqrt(2 * pi)) * exp((log(t.lower/y.hat))**2/(-2 * sigma * sigma)))

We have a constant term -log(1/(t.lower * sigma * sqrt(2*pi)) irrespective of y.hat. This makes it non-zero, this is based on my thinking.

Similarly for Logistic.

Sorry, I meant Normal Uncensored formula in the above.

Normal Uncensored Old Formula -
-log(1/(y.hat * sigma * sqrt(2 * pi)) * exp((log(t.lower/y.hat))**2/(-2 * sigma * sigma)))

Normal Uncensored New Formula -
-log(1/(t.lower * sigma * sqrt(2 * pi)) * exp((log(t.lower/y.hat))**2/(-2 * sigma * sigma)))

@tdhock
Copy link

tdhock commented Jun 5, 2019 via email

@avinashbarnwal
Copy link
Owner Author

Hi Prof. @tdhock, @hcho3

Please find the updated document for AFT(https://github.com/avinashbarnwal/GSOC-2019/blob/master/doc/Accelerated_Failure_Time.pdf). I think I have made mistake here as I am taking gradient and hessian with respect to y.hat, not X\beta.

For least square loss function-gradient and hessian with respect to y.hat and X\beta are same. But it would start mattering when we have link functions involved. Similarly, for classification, we take gradient and hessian with respect to X\beta, not y.hat.

For reference, please check doc - https://cran.r-project.org/web/packages/gbm/vignettes/gbm.pdf
Bernoulli part.

This leads to more decision on notations as in survival document we are using f for pdf and in the above document, we have f(x_i) for X\beta.

Please let me know your thoughts.

@tdhock
Copy link

tdhock commented Jun 6, 2019

I dont think there is an issue because the link function is always the identity with the normal and logistic models. again we should be able to use the formulas in the survival manual.

@avinashbarnwal
Copy link
Owner Author

Prof. @tdhock, I think we need to calculate gradient with respect to log(y.hat) not y.hat as log(y.hat) = X\beta. Please let me know.

@tdhock
Copy link

tdhock commented Jun 6, 2019

section 6.8 of survival manual gives derivatives with respect to eta, which is the real-valued prediction.

@tdhock
Copy link

tdhock commented Jun 6, 2019

in your notation you use log(y.hat) for the real valued prediction, so that is the same as eta from the survival manual.

in your PDF please do not use x\beta as that is only valid for linear models

@avinashbarnwal
Copy link
Owner Author

Prof. @tdhock, Thanks. I will use eta for log(y.hat) and take gradient and Hessian based on that. In my document, I was taking gradient and Hessian based on y.hat, not on eta. I will make the correction.

@avinashbarnwal
Copy link
Owner Author

avinashbarnwal commented Jun 17, 2019

@tdhock
Copy link

tdhock commented Jun 17, 2019

the plot looks good except for the hessian for the interval censored outputs for small predictions -- it seems to be too large. can you please double check? it should look like right censored outputs (i.e. with finite lower limit) on the left side of the plot

@avinashbarnwal
Copy link
Owner Author

Thanks, Prof. @tdhock. I have changed the code and plot both. I had one sign in the formula wrong. Please recheck the plot (https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/loss_grad_hess_aft.png). Do you think Interval hessian is correct for Normal distribution?

@avinashbarnwal
Copy link
Owner Author

avinashbarnwal commented Jun 23, 2019

Hi @hcho3 and Prof. @tdhock,

Please find the python implementation of gradient boosting for AFT.

Code - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/py/gb_aft.ipynb
Plot - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/py/Nesterov_False.png

and let me know your thoughts.

@hcho3
Copy link
Collaborator

hcho3 commented Jun 24, 2019

We'll need to check the POC implementation, since the training loss should trend down, not up.
Also, let's plot mean(abs(log(Y) - eta)), since the predicted score eta should try to match log(Y).

@avinashbarnwal
Copy link
Owner Author

Hi @hcho3 and Prof. @tdhock,

As discussed with @hcho3, I am working on documentation for binomial loss and POC for the same.

@avinashbarnwal
Copy link
Owner Author

@tdhock
Copy link

tdhock commented Jul 2, 2019

there is still something wrong for the interval hessian for small predicted values on https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/loss_grad_hess_aft.png -- it should be constant as the prediction goes to zero, log(prediction) goes to -Inf

@tdhock
Copy link

tdhock commented Jul 2, 2019

binomial loss looks reasonable, except for the x axis label.

please (1) use facet_grid and (2) use more grid points and (3) maybe have different columns or colors for different labels

@avinashbarnwal
Copy link
Owner Author

Hi Prof. @tdhock and @hcho3,

I have pushed the extreme distribution to xgboost and also made changes to R files and plot with changed naming convention to "Extreme".

@tdhock
Copy link

tdhock commented Aug 8, 2019

how will that be specified in the user interface?

right now the "survival:cox" objective is documented on https://xgboost.readthedocs.io/en/latest/parameter.html as "negative values are considered right censored" so that implies to me that the "survival:" prefix means all outputs are positive-valued. (un-logged survival times)

to avoid confusion with that objective / output, I would suggest a new prefix for the three loss functions, "aft:gaussian" "aft:extreme" and "aft:logistic" -- does that make sense to you @hcho3 ? in each of these cases the output should be specfied by an interval (a,b) where both a and b are potentially negative. (log survival times) in other words, if you have an output vector c(5.5, -0.1) for survival:cox, that means the first output is un-censored, and the second is right censored (0.1, INFINITY) on the positive scale (survival times). The corresponding outputs that we would use are (1.7, 1.7) and (-2.3, INFINITY) --- on the real-valued scale (log survival times)

@hcho3
Copy link
Collaborator

hcho3 commented Aug 8, 2019

@tdhock I think "survival:cox" uses survival time, not log survival time. So the output would be positive in that case. The current AFT implementation uses survival time as well.

@tdhock
Copy link

tdhock commented Aug 8, 2019

ok in that case the -0.1 output for survival:cox becomes (0.1, INFINITY) for us.
and in that case "survival:lognormal" "survival:loglogistic" "survival:weibull" may make more sense.

however I would suggest specfying outputs as real valued (potentially negative) log survival times, which has two advantages in my opinion

  1. can use the entire double precision number range for storage (negative numbers can not be used if the output is positive / survival time)
  2. consistency with the underlying real-valued distributions (gaussian, logistic, extreme)

maybe both aft:gaussian (using real valued log survival times) and survival:loggaussian (using positive valued un-logged survival times) etc can be supported?

@hcho3
Copy link
Collaborator

hcho3 commented Aug 8, 2019

@tdhock That's reasonable suggestion. We will need to revise survival:cox implementation to use log survival times.

@avinashbarnwal
Copy link
Owner Author

Hi Prof @tdhock, @hcho3

I am going through the document https://github.com/avinashbarnwal/GSOC-2019/blob/master/paper/AFT/survival.pdf shared by you. In page number 82, we have Extreme distribution section where it says

"If y is Weibull then log(y) is distributed according to the (least) extreme
value distribution."

In previous cases - we have normal and logistic, where y is following normal and logistic.

Please let me know if i am missing anything.

I am trying to test extreme distribution using a notebook for the first few datasets. I am finding inf in errors.
Link for notebook - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/test/data/neuroblastoma-data-master/src/notebook/002_xgboost_test.ipynb

ScreenShot of Gradient and hessian
screenshot1

I have used all the formulas mentioned in the document.

@tdhock
Copy link

tdhock commented Aug 8, 2019

the loss function for the extreme value dist goes to Inf very fast as the predicted value gets smaller.

if it helps, there is an alternative way to compute the loss (my.cost in code below) which is more numerically stable (results in Inf less often):

## extreme value CDF.
F <- function(z)1-exp(-exp(z))
max.pred <- 5
min.pred <- -15
curve(F, min.pred, max.pred)
s <- 1
y.lo <- -5
library(data.table)
pred.dt <- data.table(pred=seq(min.pred, max.pred, l=101))
pred.dt[, z.lo := (y.lo-pred)/s]
pred.dt[, p.lo := F(z.lo)]
pred.dt[, e.lo := 1/(exp(exp(z.lo))-1)]
pred.dt[, my.cost := log(1+e.lo)-log(e.lo)]
pred.dt[, naive.cost := -log(1-p.lo)]
plot(my.cost ~ pred, pred.dt)
pred.dt[, lines(pred, naive.cost, col="red")]

@avinashbarnwal
Copy link
Owner Author

Hi Prof @tdhock,

Please let me know what kind of censoring we have in the above formula? Looks like it is right censoring.

Do we have different stabilizing techniques for a different kind of censoring?

Please let me know what we are using finally for y in 3rd distribution extreme or Weibull?

@avinashbarnwal
Copy link
Owner Author

Hi Prof. @tdhock and @hcho3 ,

I have added comment to f52c967#r34598175.

@tdhock
Copy link

tdhock commented Aug 9, 2019

in my opinion the y values (at least internally) should be on the real-valued (potentially negative) scale, for all distributions, so the 3rd distribution should be "extreme"

that being said if you run into issues with extreme I think it would be more useful to just have a working prototype for the other two distributions.

@avinashbarnwal
Copy link
Owner Author

Hi Prof. @tdhock and @hcho3,

I have kept "extreme" as 3rd distribution. I have changed the conditions to detect the censoring type as follows:-

  1. Uncensored - Before -> y_lower == y_higher and After -> y_lower == y_higher
  2. Left - Before -> std::isinf(y_lower) and After -> std::abs(y_lower) <=1e-12f
  3. Right - Before -> std::isinf(y_higher) and After -> std::isinf(y_higher)
  4. Interval - Before -> !std::isinf(y_lower) && !std::isinf(y_higher) and After ->std::abs(y_lower)>1e-12f && !std::isinf(y_higher)

Please let me if you think this is correct.

Rest of the code is same and I haven't made any changes to "Extreme" Distribution.

@tdhock
Copy link

tdhock commented Aug 9, 2019

i'm not sure that is correct.

y_lower and y_higher should be log survival times (possible negative).

if y_lower == y_higher then un-censored (use pdf)
otherwise censored

  • if y_lower == -INFINITY then p_lower=0 else p_lower = cdf(z_lower)
  • if y_higher == INFINITY then p_higher=1 else p_higher = cdf(z_higher)
  • loss = -log(p_higher - p_lower)

in particular I would suggest using INFINITY which is defined in math.h

I don't think you should use the 1e-12 constant which is totally arbitrary.

@avinashbarnwal
Copy link
Owner Author

avinashbarnwal commented Aug 9, 2019

Hi Prof. @tdhock,

We have been using y as un-logged values. Therefore, a comparison with 0 is done. I would change the y_lower and y_higher in logged format internally.

I saw that in your datasets you have log-transformed responses and to test I took anti-log of responses.

Here, we expect users to enter survival times or logged transformed response. I am not sure about this?

@tdhock
Copy link

tdhock commented Aug 9, 2019

in survival::survreg the user can specify either kind of outputs. For me the logged outputs make more sense because anyway that is what we use internally.

@avinashbarnwal
Copy link
Owner Author

Thanks. I will change the y_lower and y_higher to log-transformed values and the conditions to detect the censoring types.

@avinashbarnwal
Copy link
Owner Author

avinashbarnwal commented Aug 11, 2019

Hi Prof. @tdhock and @hcho3,

I have pushed the following changes:-

  • Code refactoring for left/right/interval - loss,gradient and hessian.
  • Removing x, mu, and sd and adding z.
  • Now the conditions to detect censoring is same as before.
  • Now we are passing log(y_lower) and log(y_higher) internally to do all the calculated.

I have tested here - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/test/data/neuroblastoma-data-master/src/notebook/002_xgboost_test.ipynb. We have similar results as before.

@tdhock
Copy link

tdhock commented Aug 12, 2019

good. is there a PR where we can view a summary of all changes you want to merge? (files changed)

@avinashbarnwal
Copy link
Owner Author

Hi Prof. @tdhock and @hcho3,

Please find the pull request - dmlc#4763.

@avinashbarnwal
Copy link
Owner Author

Hi Prof. @tdhock and @hcho3 ,

I am trying to write the R vignette and i am finding this error. -

Error in setinfo.xgb.DMatrix(dtrain, "label_lower_bound", y_lower) :
setinfo: unknown info name label_lower_bound

Please find the R code - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/doc.R

Any help would be great.

@tdhock
Copy link

tdhock commented Aug 19, 2019

sounds like an xgboost error, @hcho3 should know

@tdhock
Copy link

tdhock commented Aug 19, 2019

is it normal in xgboost to specify the outputs/labels as attributes of the input/feature matrix? at least in R that is pretty confusing. the usual ways in R are

  1. via formula, e.g. lm(y ~ x1 + x2, data=some.data.frame)
  2. via separate input/output matrix objects e.g. glmnet(X, y) where X is the feature/input matrix and y is the output/label vector.

@hcho3
Copy link
Collaborator

hcho3 commented Aug 19, 2019

@tdhock The mistake was due to a badly chosen sigma parameter. When I set sigma to 10 or 20, the metrics are no longer infinity. In general, we should choose a reasonable value for sigma depending on the range of log survival times. The particular dataset has log survival time ranging from 8 to 10, so sigma=1 is not reasonable.

@hcho3
Copy link
Collaborator

hcho3 commented Aug 19, 2019

@tdhock I do not think XGBoost package supports the use of a formula. Currently, users are asked to create an object of type xgb.DMatrix with data and labels supplied.

@avinashbarnwal
Copy link
Owner Author

Hi Prof. @tdhock,

@hcho3 helped in resolving the issue. We have a working vignette here - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/doc.R. I am structuring more.

@tdhock
Copy link

tdhock commented Aug 19, 2019

red flag!

why doesn't sigma=1 work? I think it should be fine for the data sets I have provided.

sigma=10 seems too big to me -- there is almost no flat region for interval-censored outputs.

s <- 1;curve(-log(pnorm(10,x,s)-pnorm(5,x,s)),0, 20)

@avinashbarnwal
Copy link
Owner Author

avinashbarnwal commented Aug 19, 2019

Hi Prof. @tdhock,

I think we need to see the standard deviation of log(survival times) and we might use something similar sigma as starting point. We tested it with very less sigma and it was giving very less predicted value with very high log(survival times). For example - log(survival times) ~ 10 and y hat was 0.03. I think this might not be correct.

Please let me know your thoughts.

@hcho3
Copy link
Collaborator

hcho3 commented Aug 19, 2019

@tdhock @avinashbarnwal Maybe we need set a bigger value for base_score, the initial guess for yhat. By default, it is set to 0.5, which may be too low. Let's try running XGBoost with sigma=1 and base_score=8.

@tdhock
Copy link

tdhock commented Aug 20, 2019

more generally you could set it to the mean of all the finite limits?

@avinashbarnwal
Copy link
Owner Author

Hi Prof. @tdhock and @hcho3,

I have included base_score as average and sigma as the standard deviation of the log(survival_times).
We have very less number of Infinite observations now.

NoteBook - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/test/data/neuroblastoma-data-master/src/notebook/002_xgboost_test.ipynb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants