Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A better model summary #214

Open
noamross opened this issue Mar 7, 2023 · 4 comments
Open

A better model summary #214

noamross opened this issue Mar 7, 2023 · 4 comments

Comments

@noamross
Copy link

noamross commented Mar 7, 2023

I think gratia would be a nice home for an alternative GAM model summary table that would be a reasonable first pass for including in a paper or at least a supplement for common models, and be amenable to modification. Things that I think would be good to have:

  • A unified table for parametric and smooth terms, with NA where applicable
  • A nice print function
  • Columns for:
    • Term name
    • Knots
    • Basis type (with possible expanded full names)
    • P-value w/stars
    • EDF
  • Optional columns of min and max effect size in data range, with CIs and associated x (covariate) values associated with each

I'm doing this for myself, and would be happy to start a PR based on it.

@gavinsimpson
Copy link
Owner

There's currently an overview() that does some of this:

library("gratia")
library("mgcv")

eg1 <- data_sim("eg1", n = 1000,  dist = "normal", scale = 2, seed = 1)
m <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), 
         data = eg1, 
         method = "REML")

overview(m)
> overview(m)

Generalized Additive Model with 4 terms

  term  type    edf statistic p.value
  <chr> <chr> <dbl>     <dbl> <chr>  
1 s(x0) TPRS   4.23    16.1   < 0.001
2 s(x1) TPRS   3.50   169.    < 0.001
3 s(x2) TPRS   8.61   201.    < 0.001
4 s(x3) TPRS   1.00     0.599 0.43875

It currently doesn't return estimates for parametric terms:

eg4 <- data_sim("eg4", n = 400,  dist = "normal", scale = 2, seed = 1)
m_by <- gam(y ~ fac + s(x2, by = fac) + s(x0), 
            data = su_eg4, method = "REML")
overview(m_by)
> overview(m_by)

Generalized Additive Model with 5 terms

  term       type         edf statistic p.value  
  <chr>      <chr>      <dbl>     <dbl> <chr>    
1 fac        parametric  2     150.     < 0.001  
2 s(x2):fac1 TPRS        2.85    4.62   0.0021453
3 s(x2):fac2 TPRS        2.95   30.9    < 0.001  
4 s(x2):fac3 TPRS        6.49   30.9    < 0.001  
5 s(x0)      TPRS        1.00    0.0168 0.8996412

It wouldn't be hard to have "parametric" be "factor" or "continuous" say, but I hadn't thought about returning the actual estimates (and the constant term is missing here).

What do you mean by Knots? The number of basis functions used? That would be easy to add.

I'm not a big fan of stars on p values, so I'd resist including them.

Expanding on the basis names (type) could be done - I was trying to avoid having the output be too wide, but my short-names are not very consistent (some aren't very short for some bases).

The above output is a plain tibble with some custom formatting rather than a specific print method: https://github.com/gavinsimpson/gratia/blob/HEAD/R/overview.R

Is this (sort of) what you had in mind?

@noamross
Copy link
Author

noamross commented Mar 9, 2023

I did not know about overview()! Yes, this is a way towards what I'm looking for. I do think anything that makes it easier for people to see and report effect sizes is important - they're under-reported even in linear models and really rare/hard for people to extract with GAMs. Yes, by Knots I mean k - I think that it's useful to have those alongside edf for interpretation. I'm not big on stars either. In theory I hoped that printing them would encourage people to use the output, but if they want them I suppose they can add it themselves.

@gavinsimpson
Copy link
Owner

I've added a few bits of this to overview(). As of 0.8.1.13 the output is:

> overview(m_by, stars = TRUE)                                                           

Generalized Additive Model with 5 terms

  term       type           k   edf statistic p.value   stars    
  <chr>      <chr>      <dbl> <dbl>     <dbl> <chr>     <noquote>
1 fac        parametric    NA  2     150.     < 0.001   ***      
2 s(x2):fac1 TPRS           9  2.85    4.62   0.0021453 **       
3 s(x2):fac2 TPRS           9  2.95   30.9    < 0.001   ***      
4 s(x2):fac3 TPRS           9  6.49   30.9    < 0.001   ***      
5 s(x0)      TPRS           9  1.00    0.0168 0.8996412

but stars is FALSE by default.

I was thinking about effect sizes, and while it's not too onerous to do this for univariate smooths (evaluating the smooth at 100 evenly spaced values should be fine for most cases), it gets slower when we're doing this for multidimensional smooths unless we keep the number of evaluation points low-ish. I haven't looked into where things are slowing down and an obvious place to speed things up would be to run each smooth in a separate thread/process with furrr and future, but I haven't implemented any of the effect-size bits yet because I want to think a little more about them.

@noamross
Copy link
Author

noamross commented Mar 11, 2023

For the 2 papers I'm doing this for right now, we decided to have min/max in the summary calculated only from the fitted values on the data the model was trained on, so there's no additional predictions to make and these min/max values don't interpolate or extrapolate at all. This is at least predictable. For 2D+ one needs to think about things like too.far anyway and I think that's too case-base-case to get into. I think cases where the 1D min/max is well outside the fitted values are edge cases that would need to be diagnosed by looking at partial effects anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants