-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
create proposal for a add_transformation function #271
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks nice. I was a little confused by the workflow you were proposing so more docs for that I think (though if its as I imagine it seems nice enough). Also just need more docs in general and adding some default options with verbose messaging of what they mean seems like it would be very useful.
How do you imagine this function being used in serial (so I want to have a series of transformations). At the moment I think it will just break but could easily code around this.
Ah yes, I think your suggestion makes probably more sense than mine if you have more than one transformation. The workflow I had in mind was something like
with results something like
So it would automatically just add the log scaled data to your data.frame and add a separate column. With your alternative suggestion the equivalent workflow would be something like
(the additional suggestion in your workflow to enable |
I like your workflow as its certainly less clunky but yes worry about what happens when you want multiple transformations. Perhaps the solution is to add some code logic t o The alternative would be making it be called once and enabling passing a list of transformations but I am not a big fan of that as it hides quite a lot of code internally for small gain.. I hadn't realised |
Trying to summarise the discussion:
|
I think we:
Want this as its nice and clear for one but feel we need to support multiple.
No I don't think we need this as well.
I think we want defaults as lots of users won't know what to use. For the log we could just set the offset and for more advanced use expect users to specify manually (we can clearly document this). For population adjustments its really just a normalisation using an offset which is common everywhere if we label it as that and say in the docs its using it for population would mean x might work?
Agree it definitely needs this and can just have this in the first instance (though has default options aren't that much work I would suggest we do both). |
Are the transformation always targeting the natural values? Or might we want to apply a scaling to the scaled values? The former makes more sense to me overall (as anything could be passed in a custom function) but it might be a bit complicated having to communicate to the user that what the operation does depends on whether a
which would remove the need to name the transform and then filter out columns values (I think this is your option (a), @nikosbosse). If implemented in an S3 class (as mentioned by @seabbs in #270 (comment)) then we could track what has been applied. I think your suggestion might work better than this but would require careful signalling to the user as it could be confusing. |
@sbfnk for your suggestion you would lose the ability to score natural and transformed forecasts at once as in @nikosbosse approach + wouldn't be able stack transformations without the addition of code. On the plus side the code is much simpler, the workflow for a single transform is cleaner, and you don't need to bother labelling things which is a pain. It is similar in spirit to my suggestion here: #271 (comment) Having seen @nikosbosse suggest use case I do like the idea of scoring multiple transformations at once and having functionality to make this easier. Not entirely convinced its worth the complexity trade-offs you highlight though. Default options: I really like the idea of default options (like those supplied in the I do like the use case for s3 here and in other parts of this package but we have discussed this previously and decided not to go this way. I think it might be better to have a general should |
Codecov Report
@@ Coverage Diff @@
## master #271 +/- ##
==========================================
- Coverage 90.82% 90.42% -0.40%
==========================================
Files 21 22 +1
Lines 1286 1327 +41
==========================================
+ Hits 1168 1200 +32
- Misses 118 127 +9
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
So I updated the suggested code a bit:
What do you think? Some workflow examples:
|
Alternative would be to create a small function log_shift <- function(x, offset = 1, ...) {
return(log(x + offset, ...))
} This could then be the default argument for
I'd call it |
Maybe the best idea would be to just have an option |
I still prefer @sbfnk's suggestion of a custom transformation function vs giving special design consideration in this function for the log transform but your call. |
@nikosbosse any help needed to unstick this? |
No real blockers, I was waiting on more feedback from Le Big Boss. I'll implement the |
Really makes me feel empowered 😆 |
You had cast your vote :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Seems like a nice workflow. Just flagged that the docs are currently out of date.
Not for this PR by a vignette on forecast transformation seems like a nice and useful idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nearly there. Just one more question.
R/convenience-functions.R
Outdated
log_shift <- function( | ||
x, | ||
offset = 0, | ||
truncate = FALSE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want this option here? I'd think it's better for the log function to throw an error and the user to pre-process the forecasts/data so that there are no negatives.
If sticking with it I would suggest to make this more self-explanatory, i.e. calling the option negative_to_zero
or something like that (although given that we might be adding an offset anyway I'm not clear why we'd choose zero over anything else).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason to have that truncation is that the alternative (at least staying in the same workflow) is a bit more clunky. You would have to use something like transform_forecasts(fun = function(x) pmax(0, x), append = FALSE, label = "natural")
See e.g. these examples:
#' example_quantile %>%
#' transform_forecasts(fun = function(x) pmax(0, x), append = FALSE) %>%
#' transform_forecasts(fun = sqrt, label = "sqrt")
#' # adding multiple transformations
#' library(magrittr) # pipe operator
#' example_quantile %>%
#' transform_forecasts(offset = 1, truncate = TRUE) %>%
#' # manually truncate all negative values before applying sqrt
#' transform_forecasts(fun = function(x) pmax(0, x), append = FALSE, label = "natural") %>%
#' transform_forecasts(fun = sqrt, label = "sqrt") %>%
#' score() %>%
#' summarise_scores(by = c("model", "scale"))
(the part with the label = natural here is only because it's transforming the original while something else has already been computed, which maybe is a stupid thing to do?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at it again I should definitely simplify the second example...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason to have that truncation is that the alternative (at least staying in the same workflow) is a bit more clunky. You would have to use something like
transform_forecasts(fun = function(x) pmax(0, x), append = FALSE, label = "natural")
Wouldn't it be more natural to do something like:
example_quantile %>%
dplyr::filter(true_value >= 0) %>%
transform_forecasts(fun = sqrt, label = "sqrt")
or
example_quantile %>%
dplyr::mutate(true_value = dplyr::ifelse(true_value < 0, 0, true_value)) %>%
transform_forecasts(fun = sqrt, label = "sqrt")
i.e. fix the problem in the data rather than letting the transformation function take care of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have very strong feelings either way. The truncate option (now changed to negative_to_zero
) is easier if you have negative values in both predictions and true values. So if you know what you're doing it's convenient (and also turned off by default). But agree that it feels maybe a bit odd?
@seabbs do you have feelings about the Current version has the examples replaced to make everything a bit easier to understand:
|
@Bisaloo do you have thoughts? :) democratic development |
I think if we are keeping it then it should be its own helper function. I also think this is could be true for offsetting. |
Yes definitely - it's currently an argument of the the
and
|
Yes, so not an argument but a helper function on its own. |
I'm not sure I understand. So you would have a separate helper function that's a wrapper for |
A helper independent of transform forecasts but 🤷🏻 |
I merged a slightly different version without a |
I just implemented a first proposal for #270.
I think one question is whether we want to call the extra column 'scale' or something else. 'transformation'? I like the word scale, but also "log scale" is potentially ambiguous.
I haven't done any checking here (which one could do), but I though it would be bit of an overkill since it's an extremely simple function and running eg
check_forecasts()
can take some while.