Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance for logistic regression #38

Closed
DominiqueMakowski opened this issue Apr 16, 2019 · 28 comments
Closed

Performance for logistic regression #38

DominiqueMakowski opened this issue Apr 16, 2019 · 28 comments

Comments

@DominiqueMakowski
Copy link
Member

Related to #1

they say (on wikipedia 😀):

"The Hosmer-Lemeshow test is for overall calibration error, not for any particular lack of fit such as quadratic effects. It does not properly take overfitting into account, is arbitrary to choice of bins and method of computing quantiles, and often has power that is too low."

"For these reasons the Hosmer-Lemeshow test is no longer recommended. Hosmer et al have a better one d.f. omnibus test of fit, implemented in the R rms package residuals.lrm function."

Other alternatives have been developed to address the limitations of the Hosmer-Lemeshow test. These include the Osius-Rojek test and the Stukel test, available in the R script AllGOFTests.R: www.chrisbilder.com/categorical/Chapter5/AllGOFTests.R

I don't really have any opinion, never used any of those...

@strengejacke
Copy link
Member

strengejacke commented Apr 16, 2019

Ok, maybe we can implement some of those tests, though I hesitate including too many tests particularly for logistic regression, because it can become "biased" towards these regression type.

But maybe that's no real concern. If these tests are rarely implemented in R, I think it's good to have them in performance.

@DominiqueMakowski
Copy link
Member Author

I agree, this could be an added value to the package.

And I think it works quite nicely with the scope of the package as a general toolbox offering to users an easy way to discover, try, select and use the tools they want ☺️

@strengejacke
Copy link
Member

Should we find a common prefix for the goodness of fit functions? Like check_*() etc.?

@DominiqueMakowski
Copy link
Member Author

Do you consider r2 and all of the other indices as goodness of fit metrics? Initially, we thought about prefixing all "performance" (in the very large sense, including goodness of fit) by performance_*, but it was a bit verbose.

However, if the number of indices that we provide increases, we could indeed re-prefix the functions, and provide "shortcuts" for the most popular ones (like r2)

@strengejacke
Copy link
Member

No, I was not thinking of R2, but rather hosmer-lemeshow, or chisq-gof, maybe error-rate. Maybe those gof-tests that do some significance testing. So these might get a prefix like gof_ or so.

@DominiqueMakowski
Copy link
Member Author

oh right (although r2 for lm also has some significance testing amd is considered as a goodness of fit index 😁)

tbh I don't find gof_ very intuitive (looks like "game of frones" but I am too biased haha).

I am not sure of the alternatives tho, fitquality_, goodfit_...

@strengejacke
Copy link
Member

strengejacke commented Apr 17, 2019

"game of frones"

:-D

but I am too biased

Indeed, I haven't even seen one episode yet, though I'm at least a bit addicted to fantasy.

@DominiqueMakowski
Copy link
Member Author

😮

@DominiqueMakowski
Copy link
Member Author

^^ But to come back to the main matter, how do you separate indices like r2 from the other goodness of fit indices that would be prefixed? because they have some level of "testing" (with p-value and all)?

@strengejacke
Copy link
Member

strengejacke commented Apr 17, 2019

because they have some level of "testing" (with p-value and all)?

Yes, it's not an "index" (like rmse or r2), but a significance-test (except error rate, actually) - at least this could be a rough discrimination.

@DominiqueMakowski
Copy link
Member Author

DominiqueMakowski commented Apr 17, 2019

right I see your point!

Then maybe something like test_goodfit_ (I know you don't like double prefixes but we could imagine different kinds of test_* in the future). Or test_performance (which would fit into the model_ and compare_ performance that we have)

@pdwaggoner
Copy link
Member

Hi - sorry to be late to the party here, but I have used logistic regressions quite a bit, and I have a standard battery of goodness of fit tests I run for these models. For example:

  1. heatmapFit: we can plot heat map fit plot to compare goodness of fit. When reading a heat map plot, we are comparing model predictions with smoothed empirical predictions. The 45-degree line references a perfect fit. Any deviance from that line suggests a loss in goodness of fit. The "p-value" legend shows if any deviance is statistically significant (dark color means statistical significance).

  2. plot.roc from pROC: ROC curves, which plot correct predictions (sensitivity, true positive rate) against false predictions (specificity, false positive rate). A model with excellent fit will produce a curve that is very close to the left and top bound, where the area under the curve is large (near 1). A model with poor fit will produce a curve close to the 45-degree diagonal line (whereby it is equally likely for us to see correct and false predictions), such that a model has a very poor fit has area under the curve that is very small (near 0).

And then of course to compare between models (i.e., model selection):

  1. Likelihood ratio tests

  2. Extracting/comparing AIC (e.g., using extractAIC)

  3. ePCP from OOmisc: comparing expected proportion correctly predicted between models. I have some great code for clean histograms that we could standardize for this one.

Let me know your thoughts here or if I am way off base.

@DominiqueMakowski
Copy link
Member Author

Or test_performance (which would fit into the model_ and compare_ performance that we have)

Just clarifying myself: we could have the individual tests available directly by their name, and all of them "wrapped" together within a test_performance() function that would perform all available tests for the given class.

@pdwaggoner What does ePCP do? Although I think the current aim is to see how we can implement (i.e., name and stuff) a few tests hardly available elsewhere. It is not yet building the whole performance check pipeline :)

@pdwaggoner
Copy link
Member

That makes sense. ePCP calculates the expected proportion of correct prediction and the related (1-α)% (approximate) confidence intervals for a given set of predicted success probabilities and the observed (binary) values (from the documentation). E.g.:

library(OOmisc)

fit <- runif(100)
y <- rbinom(100,1,0.5)

ePCP(fit,y,alpha=0.05)

@strengejacke
Copy link
Member

ePCP() looks similar to sjstats::pred_accuracy(), though the difference is that sjstats::pred_accuracy() actually uses cross-validation to test the usefulness of the model regarding predictions.

@pdwaggoner
Copy link
Member

Oh brilliant- cross-validation is much more useful. sjstats::pred_accuracy() would likely be a better route here. I think a general fit/validation method that would be a core, useful thing would be ROC curves. They're used frequently and are super intuitive. Could be valuable to at least consider for future iterations.

@DominiqueMakowski
Copy link
Member Author

@pdwaggoner you've just been sjstats'ed 😅😅 (the phenomenon of coming up with something and realising sjstats already does it)

@pdwaggoner
Copy link
Member

Ha! It feels... great (?) ;)

@strengejacke
Copy link
Member

I have drafted a function predictive_accuracy(), which basically works like sjstats::pred_accuracy(), except that it does not have any other package dependencies (https://github.com/easystats/performance/blob/master/R/predictive_accuracy.R).

@pdwaggoner @DominiqueMakowski does anybody of you have worked with the boot package before? I haven't, and I'm not sure how to use it here.

In sjstats, I have used an own bootstrapping-method from the sjstats::bootstrap() function. How can this code be re-written using the boot package and w/o other packages (see link above)?

      cv <- data %>%
        bootstrap(n) %>%
        dplyr::mutate(
          models = purrr::map(.data$strap, ~ stats::lm(formula, data = .x)),
          predictions = purrr::map(.data$models, ~ stats::predict(.x, type = "response")),
          response = purrr::map(.data$models, ~ resp_val(.x)),
          accuracy = purrr::map2_dbl(.data$predictions, .data$response, ~ stats::cor(.x, .y, use = "pairwise.complete.obs"))
        )

Feel free to open a PR or commit to master...

@DominiqueMakowski
Copy link
Member Author

I do also find the native boot quite unintuitive. Maybe it would be worth it reimplementing a bootsrap method, as it would also be used in parameters (see current implementation here).

@strengejacke
Copy link
Member

heatmapFit: we can plot heat map fit plot to compare goodness of fit.

ePCP from OOmisc: comparing expected proportion correctly predicted between models. I have some great code for clean histograms that we could standardize for this one.

@pdwaggoner Do you have examples for these two? Although you mentioned that cross-validation might be the better approach compared to ePCP, we still could think about visualization, so I'm interested in the plots you mentioned here.

@pdwaggoner
Copy link
Member

Hi @strengejacke - apologies for my delay here. Yeah, I have some MWE for each. First, for the ePCP (expected proportion correctly predicted), we are looking for a higher value, bounded between 0 and 1, where 1 = a model that perfectly predicted all actual observations. Here is some sample code using mtcars:

library(OOmisc) # for ePCP
library(ggplot2)

logitmod <- glm(vs ~ mpg, data = mtcars); summary(logitmod)
logitmod2 <- glm(vs ~ mpg + cyl, data = mtcars); summary(logitmod2)

y <- mtcars$vs
pred1 <- predict(logitmod, type="response")
pred2 <- predict(logitmod2, type="response")

# ePCP, expected proportion of correct prediction
epcp1 <- ePCP(pred1, y, alpha = 0.05) # define observed values and obtain predicted values
epcp2 <- ePCP(pred2, y, alpha = 0.05)

epcpdata <- data.frame(rbind(epcp1, epcp2))
epcpdata$model <- c(1,2)
epcpdata$count <- factor(c(1,2), label = c("Model1","Model2"))

# Now the plot
ggplot(epcpdata, aes(x=model,y=ePCP,colour=count)) +
  geom_bar(position=position_dodge(), stat="identity", fill = "darkgray") +
  geom_errorbar(aes(ymin=lower, ymax=upper),
                width=.2,
                position=position_dodge(.9)) +
  labs(x="Model Specification",
       y="Expected Proportion of Correct Prediction",
       title = "Comparing ePCP between Model 1 and Model 2",
       colour="Model Specification") +
  theme_bw()

And next, for the heatmapFit, which is a little less flexible, the goal is to compare model predictions with smoothed empirical predictions. The 45-degree line references a perfect fit. Any deviance from that line suggests a loss in goodness of fit. The “p-value” legend shows if any deviance is statistically significant (dark color means statistical significance). Using the same models as an example from mtcars, it would look something like:

## heatmapfit
library(heatmapFit) # for heat maps

# quick rescale to bound predictions between 0 and 1
range <- function(x){
  (x-min(x))/(max(x)-min(x))
  }

pred1 <- range(pred1)
pred2 <- range(pred2)

heatmap.fit(y, pred1, reps = 1000, legend = TRUE)
heatmap.fit(y, pred2, reps = 1000, legend = TRUE)

** You can copy and paste this code to see the figures. Pretty clean, efficient renderings of model fit.

** And of course, ROC curves are always great, simple model fit checks. Let me know thoughts if you have them. Hope this helps a bit!

@strengejacke
Copy link
Member

strengejacke commented Apr 28, 2019

The epcp-function looks nice, and has the beauty of clear interpretation. I implemented a function that calculates the epcp (05243ee), currently named correct_predictions(). Is epcp() better, or another name? What would you say?

@strengejacke
Copy link
Member

library(performance)

m1 <- glm(vs ~ mpg, data = mtcars, family = binomial)
m2 <- glm(vs ~ mpg + cyl, data = mtcars, family = binomial)
m3 <- glm(vs ~ mpg + cyl * hp, data = mtcars, family = binomial)

compare_performance(m1, m2, m3)
#>   name class      AIC      BIC   R2_Tjur      RMSE   LOGLOSS      EPCP
#> 1   m1   glm 23.49109 27.88830 0.6666308 0.7393217 0.2732983 0.8359198
#> 2   m2   glm 24.84308 32.17176 0.7063532 0.6810625 0.2319231 0.8554707
#> 3   m3   glm 29.53334 32.46481 0.4738108 0.8932618 0.3989584 0.7410163

Created on 2019-04-28 by the reprex package (v0.2.1)

@strengejacke
Copy link
Member

btw, the referred paper (https://www.cambridge.org/core/services/aop-cambridge-core/content/view/92B052AADD9756C8BCC00527749E029D/S1047198700008524a.pdf/postestimation_uncertainty_in_limited_dependent_variable_models.pdf) also shows a way to implement this for multinomial models, we should keep this in mind... I did not fully understand yet how to do it, but it doesn't look that complicated.

@DominiqueMakowski
Copy link
Member Author

As you might expect, I prefer correct_predictions than epcp ☺️

I wonder if this could fit into a larger check_predictions (or check_predicted) function, that would check predicted responses against response. I reckon it could also work for a continuous variable by returning a correlation or other very easy index.

@pdwaggoner
Copy link
Member

I only prefer epcp because that's how I learned it, and that's how its derived (see the original measurement paper (specifically starting at the bottom of p. 88): https://www.jstor.org/stable/pdf/25791597.pdf). However, in line with the easystats mission, I think correct_predictions is more consistent and would be more intuitive for the average user. So I am perfectly happy to go with correct_predictions. And it looks really good! I will go take a closer look after sending this. But glad you like the metric - as you noted, its clean and crisp and has the benefit of being quite straightforward to interpret. Great work!

@strengejacke
Copy link
Member

I think this can be closed. I have opened separate issues for the two remaining features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants