Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create proposal for a add_transformation function #271

Merged
merged 15 commits into from
Apr 11, 2023
Merged

Conversation

nikosbosse
Copy link
Contributor

I just implemented a first proposal for #270.

I think one question is whether we want to call the extra column 'scale' or something else. 'transformation'? I like the word scale, but also "log scale" is potentially ambiguous.

I haven't done any checking here (which one could do), but I though it would be bit of an overkill since it's an extremely simple function and running eg check_forecasts() can take some while.

R/convenience-functions.R Outdated Show resolved Hide resolved
R/convenience-functions.R Outdated Show resolved Hide resolved
R/convenience-functions.R Outdated Show resolved Hide resolved
R/convenience-functions.R Outdated Show resolved Hide resolved
R/convenience-functions.R Outdated Show resolved Hide resolved
R/convenience-functions.R Outdated Show resolved Hide resolved
seabbs

This comment was marked as outdated.

Copy link
Contributor

@seabbs seabbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks nice. I was a little confused by the workflow you were proposing so more docs for that I think (though if its as I imagine it seems nice enough). Also just need more docs in general and adding some default options with verbose messaging of what they mean seems like it would be very useful.

How do you imagine this function being used in serial (so I want to have a series of transformations). At the moment I think it will just break but could easily code around this.

R/convenience-functions.R Outdated Show resolved Hide resolved
@nikosbosse
Copy link
Contributor Author

nikosbosse commented Feb 28, 2023

Ah yes, I think your suggestion makes probably more sense than mine if you have more than one transformation. The workflow I had in mind was something like

forecasts |>
  add_transformation(option = "log") |>
  score() |>
  summarise_score(by = c("model", "scale"))

with results something like

model wis scale
A. 2. natural
A. 0.1 log
B. 3. natural
B. 0.3 log

So it would automatically just add the log scaled data to your data.frame and add a separate column. With your alternative suggestion the equivalent workflow would be something like

forecasts |>
  mutate(scale = "natural") |>
  rbind(
    scale_forecasts(option = "log") |>
      mutate(scale = "log")
  ) |>
  score() |>
  summarise_score(by = c("model", "scale"))

(the additional suggestion in your workflow to enable data |> check_forecasts() |> score() would require an additional change to check_forecasts() that we probably should deal with separately)

@seabbs
Copy link
Contributor

seabbs commented Mar 1, 2023

I like your workflow as its certainly less clunky but yes worry about what happens when you want multiple transformations. Perhaps the solution is to add some code logic t oadd_transformation so that it looks for a scale variable and if it finds it filters for natural and applies everything there then appends to the end vs doing the current logic if it doesn't find it?

The alternative would be making it be called once and enabling passing a list of transformations but I am not a big fan of that as it hides quite a lot of code internally for small gain..

I hadn't realised check_forecasts wasn't pipeable - I guess it doesn't totally need to be but it would be nice if it were. Solutions are likely some kind of s3 class I would guess.

@nikosbosse
Copy link
Contributor Author

Trying to summarise the discussion:
we want to

  • implement the version with add_transformation(option = "log")
    • have a way to do that multiple times, checking whether there is already a "scale" column
  • have your proposed function that transforms without adding as well?
  • have default options?
    • problem with that one e.g. for the log would be that one would need to pass in the offset as an extra argument
    • and for population that would require a very specific column with a name that's appropriate in epi, but maybe not in other instances
  • have the possibility to pass in a custom function? I
    • think that would be nice
  • have both options, i.e. some prespecified choicse + custom functions?

@seabbs
Copy link
Contributor

seabbs commented Mar 1, 2023

I think we:

implement the version with add_transformation(option = "log")
have a way to do that multiple times, checking whether there is already a "scale" column

Want this as its nice and clear for one but feel we need to support multiple.

have your proposed function that transforms without adding as well?

No I don't think we need this as well.

have default options?
problem with that one e.g. for the log would be that one would need to pass in the offset as an extra argument
and for population that would require a very specific column with a name that's appropriate in epi, but maybe not in other instances

I think we want defaults as lots of users won't know what to use. For the log we could just set the offset and for more advanced use expect users to specify manually (we can clearly document this).

For population adjustments its really just a normalisation using an offset which is common everywhere if we label it as that and say in the docs its using it for population would mean x might work?

have the possibility to pass in a custom function? I
think that would be nice

Agree it definitely needs this and can just have this in the first instance (though has default options aren't that much work I would suggest we do both).

@sbfnk
Copy link
Contributor

sbfnk commented Mar 2, 2023

implement the version with add_transformation(option = "log")
have a way to do that multiple times, checking whether there is already a "scale" column

Are the transformation always targeting the natural values? Or might we want to apply a scaling to the scaled values? The former makes more sense to me overall (as anything could be passed in a custom function) but it might be a bit complicated having to communicate to the user that what the operation does depends on whether a scale column is present and contains values "natural". Alternatively the log+1 could be implemented as e.g.

forecasts |>
  transform(operation = sum,  offset = 1) |>
  transform(operation = log) |>
  score() |>
  summarise_score(by = c("model"))

which would remove the need to name the transform and then filter out columns values (I think this is your option (a), @nikosbosse). If implemented in an S3 class (as mentioned by @seabbs in #270 (comment)) then we could track what has been applied.

I think your suggestion might work better than this but would require careful signalling to the user as it could be confusing.

@seabbs
Copy link
Contributor

seabbs commented Mar 2, 2023

@sbfnk for your suggestion you would lose the ability to score natural and transformed forecasts at once as in @nikosbosse approach + wouldn't be able stack transformations without the addition of code. On the plus side the code is much simpler, the workflow for a single transform is cleaner, and you don't need to bother labelling things which is a pain.

It is similar in spirit to my suggestion here: #271 (comment)

Having seen @nikosbosse suggest use case I do like the idea of scoring multiple transformations at once and having functionality to make this easier. Not entirely convinced its worth the complexity trade-offs you highlight though.

Default options: I really like the idea of default options (like those supplied in the scale_ ggplot2 functions) as its a nice hand holder for less experienced users and they can easily be documented (i.e in the examples). That said it really isn't much code to add these so perhaps just having examples and links to reading more is all that is needed.

I do like the use case for s3 here and in other parts of this package but we have discussed this previously and decided not to go this way. I think it might be better to have a general should scoringutils have an s3 rewrite than introduce piecemeal?

@codecov
Copy link

codecov bot commented Mar 5, 2023

Codecov Report

Merging #271 (7bf1bab) into master (66487aa) will decrease coverage by 0.40%.
The diff coverage is 78.04%.

❗ Current head 7bf1bab differs from pull request most recent head 4a72b1d. Consider uploading reports for the commit 4a72b1d to get more accurate results

@@            Coverage Diff             @@
##           master     #271      +/-   ##
==========================================
- Coverage   90.82%   90.42%   -0.40%     
==========================================
  Files          21       22       +1     
  Lines        1286     1327      +41     
==========================================
+ Hits         1168     1200      +32     
- Misses        118      127       +9     
Impacted Files Coverage Δ
R/convenience-functions.R 78.04% <78.04%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@nikosbosse
Copy link
Contributor Author

So I updated the suggested code a bit:

  • For now I decided against dedicated default options. Mainly because option = "log" doesn't seem that much easier to me than fun = log. Open to being convinced otherwise.
  • For fun = log I added some "clever" code to check whether there are any zero values. If there are, then it automatically applies an offset and complains to the user with a warning. You can pass an argument offset = 1 or whatever offset you like to the function. I think the workflow is well documented and intuitive, but I imagine this to be somewhat controversial :D
  • it has an argument add = TRUE to toggle whether you want to transform the existing data, or whether you want to add the transformed data to your existing data. This also works multiple times and there is an example for it

What do you think?

Some workflow examples:

#' library(magrittr) #(pipe operator)
#' # replace negative values with zero
#' transformed <- example_quantile %>%
#'   transform_forecasts(fun = function(x) {pmax(0, x)}, add = FALSE)
#'
#' # add log transformed forecasts (produces a warning as some values are zero)
#' transform_forecasts(transformed)
#'
#' # specifying an offset manually for the log transformation removes the warning
#' transform_forecasts(transformed, offset = 1)
#'
#' # adding multiple transformations
#' transformed %>%
#'   transform_forecasts(offset = 1) %>%
#'   transform_forecasts(fun = sqrt, label = "sqrt") %>%
#'   score() %>%
#'   summarise_scores(by = c("model", "scale"))

@sbfnk
Copy link
Contributor

sbfnk commented Mar 6, 2023

Alternative would be to create a small function log_shift or similar that removes the need for filtering through dot arguments, hiding default options and checking whether any function is a specific premitive?

log_shift <- function(x, offset = 1, ...) {
  return(log(x + offset, ...))
}

This could then be the default argument for transform_forecasts if you really wanted a default argument.

it has an argument add = TRUE to toggle whether you want to transform the existing data, or whether you want to add the transformed data to your existing data. This also works multiple times and there is an example for it

I'd call it append or similar - add to me implies numerical addition as in e.g. magrittr::add

@nikosbosse
Copy link
Contributor Author

Maybe the best idea would be to just have an option offset = 0 (or shift = 0?) that just adds one to everything. It could then automatically do that for the log (or not do it and just complain / alert the user, which I guess is the better option).
With this we wouldn't have to change the log function which I think is good, but there would also be a simple argument that allows the user to apply an offset.

@seabbs
Copy link
Contributor

seabbs commented Mar 8, 2023

I still prefer @sbfnk's suggestion of a custom transformation function vs giving special design consideration in this function for the log transform but your call.

@seabbs
Copy link
Contributor

seabbs commented Mar 24, 2023

@nikosbosse any help needed to unstick this?

@nikosbosse
Copy link
Contributor Author

No real blockers, I was waiting on more feedback from Le Big Boss. I'll implement the log_shift() version :)

@seabbs
Copy link
Contributor

seabbs commented Mar 24, 2023

I was waiting on more feedback from Le Big Boss.

Really makes me feel empowered 😆

@nikosbosse
Copy link
Contributor Author

You had cast your vote :)
Implemented a new suggestion. Current draft exports a function log_shift() that allows the user to add an offset and also truncate negative values in one function call. Default offset is 0.

R/convenience-functions.R Outdated Show resolved Hide resolved
R/convenience-functions.R Outdated Show resolved Hide resolved
R/convenience-functions.R Outdated Show resolved Hide resolved
R/convenience-functions.R Outdated Show resolved Hide resolved
Copy link
Contributor

@seabbs seabbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Seems like a nice workflow. Just flagged that the docs are currently out of date.

Not for this PR by a vignette on forecast transformation seems like a nice and useful idea.

@nikosbosse
Copy link
Contributor Author

@sbfnk @seabbs happy to merge?

Copy link
Contributor

@sbfnk sbfnk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nearly there. Just one more question.

R/convenience-functions.R Show resolved Hide resolved
R/convenience-functions.R Outdated Show resolved Hide resolved
R/convenience-functions.R Outdated Show resolved Hide resolved
log_shift <- function(
x,
offset = 0,
truncate = FALSE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want this option here? I'd think it's better for the log function to throw an error and the user to pre-process the forecasts/data so that there are no negatives.

If sticking with it I would suggest to make this more self-explanatory, i.e. calling the option negative_to_zero or something like that (although given that we might be adding an offset anyway I'm not clear why we'd choose zero over anything else).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason to have that truncation is that the alternative (at least staying in the same workflow) is a bit more clunky. You would have to use something like transform_forecasts(fun = function(x) pmax(0, x), append = FALSE, label = "natural")

See e.g. these examples:

#' example_quantile %>%
#'   transform_forecasts(fun = function(x) pmax(0, x), append = FALSE) %>%
#'   transform_forecasts(fun = sqrt, label = "sqrt") 
#' # adding multiple transformations
#' library(magrittr) # pipe operator
#' example_quantile %>%
#'   transform_forecasts(offset = 1, truncate = TRUE) %>%
#'   # manually truncate all negative values before applying sqrt
#'   transform_forecasts(fun = function(x) pmax(0, x), append = FALSE, label = "natural") %>%
#'   transform_forecasts(fun = sqrt, label = "sqrt") %>%
#'   score() %>%
#'   summarise_scores(by = c("model", "scale"))

(the part with the label = natural here is only because it's transforming the original while something else has already been computed, which maybe is a stupid thing to do?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at it again I should definitely simplify the second example...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason to have that truncation is that the alternative (at least staying in the same workflow) is a bit more clunky. You would have to use something like transform_forecasts(fun = function(x) pmax(0, x), append = FALSE, label = "natural")

Wouldn't it be more natural to do something like:

example_quantile %>%
  dplyr::filter(true_value >= 0) %>%
  transform_forecasts(fun = sqrt, label = "sqrt") 

or

example_quantile %>%
  dplyr::mutate(true_value = dplyr::ifelse(true_value < 0, 0, true_value)) %>%
  transform_forecasts(fun = sqrt, label = "sqrt") 

i.e. fix the problem in the data rather than letting the transformation function take care of it?

Copy link
Contributor Author

@nikosbosse nikosbosse Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have very strong feelings either way. The truncate option (now changed to negative_to_zero) is easier if you have negative values in both predictions and true values. So if you know what you're doing it's convenient (and also turned off by default). But agree that it feels maybe a bit odd?

@nikosbosse
Copy link
Contributor Author

@seabbs do you have feelings about the negative_to_zero argument and whether or not it should be kept?
Pro: it's a bit more convenient for users who know what they're doing (as they can replace forecasts and observations in one go)
Con: it's a bit cleaner to force users to make the changes manually

Current version has the examples replaced to make everything a bit easier to understand:

#' library(magrittr) # pipe operator
#'
#' # add log transformed forecasts (produces a warning as some values are zero)
#' # negative values need to be handled (here by replacing them with 0)
#' example_quantile %>%
#'   .[, true_value := ifelse(true_value < 0, 0, true_value)] %>%
#'   transform_forecasts(append = FALSE)
#'
#' # alternatively:
#' transform_forecasts(example_quantile, negative_to_zero = TRUE, append = FALSE)
#'
#' # specifying an offset manually for the log transformation removes the warning
#' example_quantile %>%
#'   .[, true_value := ifelse(true_value < 0, 0, true_value)] %>%
#'   transform_forecasts(offset = 1, append = FALSE)
#'
#' # truncating forecasts manually before sqrt
#' example_quantile %>%
#'   .[, true_value := ifelse(true_value < 0, 0, true_value)] %>%
#'   transform_forecasts(fun = sqrt, label = "sqrt")
#'
#' # alternatively, this achieves the same
#' example_quantile %>%
#'   transform_forecasts(fun = function(x) pmax(0, x), append = FALSE) %>%
#'   transform_forecasts(fun = sqrt, label = "sqrt")
#'
#' # adding multiple transformations
#' library(magrittr) # pipe operator
#' example_quantile %>%
#'   .[, true_value := ifelse(true_value < 0, 0, true_value)] %>%
#'   transform_forecasts(offset = 1) %>%
#'   transform_forecasts(fun = sqrt, label = "sqrt") %>%
#'   score() %>%
#'   summarise_scores(by = c("model", "scale"))

@nikosbosse
Copy link
Contributor Author

@Bisaloo do you have thoughts? :) democratic development

@seabbs
Copy link
Contributor

seabbs commented Apr 7, 2023

I think if we are keeping it then it should be its own helper function. I also think this is could be true for offsetting.

@nikosbosse
Copy link
Contributor Author

I think if we are keeping it then it should be its own helper function. I also think this is could be true for offsetting.

Yes definitely - it's currently an argument of the log_shift() helper function (maybe I should adapt the examples again to make that clear).

the transform_forecasts() function currently only has the following arguments:

transform_forecasts <- function(data,
                                fun = log_shift,
                                append = TRUE,
                                label = "log",
                                ...)

and log_shift() has

log_shift <- function(x,
                      offset = 0,
                      negative_to_zero = FALSE,
                      base = exp(1))

@seabbs
Copy link
Contributor

seabbs commented Apr 7, 2023

  • it's currently an argument of the

Yes, so not an argument but a helper function on its own.

@nikosbosse
Copy link
Contributor Author

I'm not sure I understand. So you would have a separate helper function that's a wrapper for \(x) pmax(0,x)? And then one would call transform_forecasts() twice?

@seabbs
Copy link
Contributor

seabbs commented Apr 7, 2023

A helper independent of transform forecasts but 🤷🏻

@nikosbosse nikosbosse merged commit 4a72b1d into master Apr 11, 2023
@nikosbosse
Copy link
Contributor Author

I merged a slightly different version without a negative_to_zero argument in log_shift() now. Could still decide to introduce that argument later if we like.

@seabbs seabbs deleted the add_scale_fct branch April 11, 2023 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants