New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Taking loss functions seriously #450
Comments
Thanks for taking the time to put this together, Anthony is the one who dedicated most thought on measures so he's best placed to answer. |
Thank you @tlienart for the feedback. A separate package would be great 💯 If you feel like adding me to the organization, I could work on the proposal therein already, otherwise I can submit PRs to the repository. |
oh it's not there yet so I think here is a good place to discuss what it could look like, thanks for the support! |
cc: @ablaom |
Thanks for your helpful review of the measure API. I appreciate this My first impression is that your requirements are more specialized
I agree these are worthwhile goals. It would be helpful if you could In another thread you mentioned type-instabilities. It would be Probabilistic predictors
...
Yes.
The output of probabilistic predictors is varied. The predicted An important class of performance measures for probabilistic Here "distribution" is a little vague; if it's finite, it should be I understand that limitations in the main ML platforms (scikit-learn, At present we do not implement a large number of proper scoring rules So, for our purposes, I don't agree that I like the distinction, represented by the trait
|
This is ok, but it is not an argument in favor of the current API for losses.
I disagree with this view. Because scoring rules are something that can be used to track performance, it doesn't mean they fit in the concept of loss as traditionally used.
In the referred article the authors introduce a new concept called probabilistic loss functionals, which is something different than traditional loss functions, and they make it clear. These should be two separate concepts, and this attempt to make everything fit on the same bag is the issue that I am raising. I am discussing the API of traditional supervised loss functions, and in this case, it doesn't make sense to allow yhat to be a distribution.
I disagree. The current interface is confusing for the end user who is not interested in all kinds of performance metrics one can possibly conceive as a "measure". I only wish to eval my models with traditional supervised losses for a paper, and now I have to learn a complex trait to filter out what are the losses, what are the scores, what are the probabilistic functionals, what are the outputs that the model produces, and so on. This is unnecessarily complex.
Exactly. And that is why we shouldn't be talking about rms as if it was a supervised loss as defined above (and in LossFunctions.jl). Something that doesn't fit the definition above deserves a separate API and set of traits.
This general view is useless in practice, because I need to know the nature of the function that I am applying to a sample. If I know that the function for example satisfies the definition I gave above, I can expect properties to hold. Now we have a generic thing called "measure" that puts together a bunch of different concepts on the same bag. The user is now terrified because he doesn't know what is the combination of traits he should use to filter things out.
For example, as I defined above, all losses for me are "weightable" because this is a property of the expectation operator and not of the loss. As you mentioned there are scoring rules which are not computed on a per-sample basis and not aggregated with an expectation operator. So I cannot use those.
Again, I am not proposing a redefinition of measure, I am proposing a specific definition of loss. As I understand you have loss + scoring rules + whatever = performance measure, but I don't care about the rest of the list at this moment. Just the loss functions. SummaryUnfortunately we have views of the world that are too different when it comes to software design. I am always willing to contribute to the MLJ stack, but I realize that it is very difficult to do so given that my research needs are not being addressed by the current design. I could try to adapt my viewpoint to contribute, but that is not efficient because the proposal you have where If for some reason we change our minds in the future about this design, we can try to reconcile the codebases. |
I've actually just discovered that LossFunctions.jl does the weighting correctly: https://juliaml.github.io/LossFunctions.jl/stable/user/aggregate/ Sharing in case someone stumbles on the same bug here. |
@juliohm could you summarize the main issues you have with this interface? None of the issues here seem irreconcilable, and I really don't want to fragment the Julia ML ecosystem the way other interfaces and ecosystems (like named arrays or automatic differentiation) have been. There maybe be some places where we have to create different packages, but as much as possible I think we should try to make sure everything is interoperable. To try and give a summary of the main issues I've found: First, it looks like you want to focus on the narrower category of proper loss functions, rather than generic loss functionals. How about we create a new type called something like "Separable loss functions" that contains only losses that can be expressed as This way we can allow generalized loss functionals without. Or, if you'd like, we could split this package into two packages, one for separable+proper loss functions and one for more "unusual" losses.
I believe this is just a convenience for computational efficiency. It's always possible to find a function
Can you clarify what you'd propose as an alternative interface here? |
I sympathize with this feeling, but please understand that I had done my homework before moving forward with the development of alternative packages. Thank you for trying to revive this issue though.
That is JuliaML/LossFunctions.jl (I am the main maintainer nowadays).
I disagree with many design decisions that have been made in the project and respect them. I don't have any intention to brainstorm MLJ interfaces at this point in time. As I mentioned in another issue, we are not using the project in our industrial applications anymore. |
In case it is useful, MLJBase measures were recently moved out to StatisticalMeasures.jl. These are based on a modified system of traits that are part of StatisticalMeasuresBase.jl. |
Oh, this is great, it looks like the two interfaces are compatible now, so I can just use StatisticalMeasures.jl with LossFunctions.jl measures. Thank you for the hard work on this, Anthony! |
In the tradition of Julia, this issue follows the "Taking X seriously" convention where "X" here represents loss functions in statistical learning.
The current state of affairs of loss functions (or more generally "measures" in MLJ) is not ideal. There is a lot of code repetition that could be avoided, and a lot of machinery that could be reused in various different measures. In particular, the weighting machinery varies for different measures, and as discussed in #445 it does not serve for cost-sensitive learning, or more generally, transfer learning. Additionally, measure implementations are not necessarily ready for automatic differentiation nor they are ready for computation on GPUs.
I would like to redesign the measures in MLJ to include all important use cases, and to facilitate future additions. For that, I need your help. Before we dive into specific questions about the current traits implemented for measures, I would like to share what I think should be the high-level abstraction for measures. The definitions below are heavily inspired by the LossFunctions.jl documentation, and by a more theoretical view on empirical risk minimization.
Let's concentrate our attention to supervised loss functions, i.e. functions
L(yhat, y)
that operate on scalar objectsythat
andy
. By scalar object I only mean an object with 0 dimensions (e.g. numbers in the real line). For now I will assume that these scalar objects are<:Real
, but if you feel that for exampleythat
should include other objects like distributions, please motivate your claim that loss functions should be the mechanism to compare numbersy
with distributionsythat
. It is not necessarily clear that a loss function should support this comparison.For a supervised loss function
L
, we should be able to perform at least two operations:(yhat, y)
E[L]
using a sample ofn
pairs:E[L] ~ (1/n) * sum(L(yhat_i, y_i))
In the second operation, we can also introduce a weighting function:
E[W*L] ~ (1/n) * sum(w_i * L(yhat_i, y_i))
where each pair has a different weight in the final estimate. This mechanism is quite important in transfer learning, where the weights are given by the ratio of the test and train distributions
w(x) = p_test(x) / p_train(x)
. We've formalised the process of estimating these weights in DensityRatioEstimation.jl, and we need to make sure that the loss functions API consumes them correctly.To start this discussion, I would like to go over the existing traits for measures. First, I would like to understand how each trait is currently used in other parts of MLJ.jl. Below is the full list of traits I could find:
I understand that
is_measure_type
checks if a type is a measure type. In my opinion, the more useful trait operates on instancesis_measure
. How is the trait on the type being used? Can't we just rename it tois_measure
and cover both cases (type + instance)?I understand that
name
stores the name of the measure, and that this trait is a global trait in MLJ. I like it that we can always recover the name of objects in the stack.Am I correct to say that the existence of
target_scitype
andprediction_type
is due to the fact that loss functions currently compare objects of different type? Should it be that way? Is it covering that comparison betweenyhat=distribution
andy=number
? My opinion at this moment favours a simple interfaceL(yhat,y)
whereyhat
andy
are scalars of the same scientific type. I understand thatyhat=f(x)
is the output of a learning model withtarget_scitype
andprediction_type
, but propagating this type information seems unnecessarily complex.From the definition I shared above, every loss function should support weights. The weights are not a property of the loss function itself, but a property of the expectation operator. I would just deprecate
support_weghts
and implement the weighting mechanism outside the losses.I personally find the
orientation
trait suboptimal. I understand the desire to include multiple concepts (loss, score, etc) on the same umbrella, but we lose expressivity doing so. There will be traits in the future that only make sense whenorientation=:loss
ororientation=:score
. You already know that my vote goes for deprecating this trait, and working on separate concepts for losses, scores, etc. It doesn't mean that we need to have different trait names for these concepts, it just means that we won't be thinking about them as a single generic concept called measures. I would like to be able for example to replaceis_measure
by more specific traits in my user code likeis_loss
oris_score
. Code that consumes losses does not necessarily consumes scores, and vice-versa. So in summary, my suggestion would be to deprecateorientation
, introduceis_loss
,is_score
,etc and finally define a newis_measure(x) = is_loss(x) || is_score(x)
for the generic check.I understand that the trait
reports_each_observation
tells whether or not a loss is returned for the whole sample or per pair in the sample. This doesn't make much sense to me in the context of loss functions based on the definitions above about expected losses in samples. Can you please elaborate on how this trait is being used elsewhere in MLJ? I see that L1 and L2 losses for example report the values for each observation, but wouldn't it be simpler to just broadcast the equivalent scalar losses? To me thisreports_each_observation
trait could be deprecated as well.I understand that the trait
aggregation
tells which aggregation method is used to combine the losses for each pair in the sample. Unless we have a use case for a different aggregation method that is not the sample mean, this trait is also unnecessary. Can you please elaborate on how it is being used elsewhere in the stack?I don't understand the
is_feature_dependent
trait. Could you please elaborate? My guess is that it tells whether or not the lossL(yhat,y)
is dependent on the featurex
used to estimateyhat=f(x)
, but I am probably wrong because this is always the case? Also, I noticed that all current losses have this set tofalse
.I like the general
docstring
trait available for all objects in the MLJ stack.The
distribution_type
trait seems to be another trait that is the result of allowing losses between objectsyhat
andy
of different kind. Could you please elaborate on what is the meaning of this trait and how it relates totarget_scitype
andprediction_type
?I appreciate your time replying to all these questions, and apologise in advance if my words appear harsh. I am not a native English speaker, so I write with a reduced vocabulary that sometimes may sound aggressive to some.
If you can take a careful look at all these points, that would be extremely helpful. My current research depends on this, and the sooner I get your feedback, the faster I will be able to contribute.
The text was updated successfully, but these errors were encountered: