Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taking loss functions seriously #450

Closed
juliohm opened this issue Feb 25, 2020 · 11 comments
Closed

Taking loss functions seriously #450

juliohm opened this issue Feb 25, 2020 · 11 comments

Comments

@juliohm
Copy link
Contributor

juliohm commented Feb 25, 2020

In the tradition of Julia, this issue follows the "Taking X seriously" convention where "X" here represents loss functions in statistical learning.

The current state of affairs of loss functions (or more generally "measures" in MLJ) is not ideal. There is a lot of code repetition that could be avoided, and a lot of machinery that could be reused in various different measures. In particular, the weighting machinery varies for different measures, and as discussed in #445 it does not serve for cost-sensitive learning, or more generally, transfer learning. Additionally, measure implementations are not necessarily ready for automatic differentiation nor they are ready for computation on GPUs.

I would like to redesign the measures in MLJ to include all important use cases, and to facilitate future additions. For that, I need your help. Before we dive into specific questions about the current traits implemented for measures, I would like to share what I think should be the high-level abstraction for measures. The definitions below are heavily inspired by the LossFunctions.jl documentation, and by a more theoretical view on empirical risk minimization.

Let's concentrate our attention to supervised loss functions, i.e. functions L(yhat, y) that operate on scalar objects ythat and y. By scalar object I only mean an object with 0 dimensions (e.g. numbers in the real line). For now I will assume that these scalar objects are <:Real, but if you feel that for example ythat should include other objects like distributions, please motivate your claim that loss functions should be the mechanism to compare numbers y with distributions ythat. It is not necessarily clear that a loss function should support this comparison.

For a supervised loss function L, we should be able to perform at least two operations:

  1. Evaluate the loss at a pair (yhat, y)
  2. Estimate the expected loss E[L] using a sample of n pairs: E[L] ~ (1/n) * sum(L(yhat_i, y_i))

In the second operation, we can also introduce a weighting function:

  1. Weighted expected loss is given by E[W*L] ~ (1/n) * sum(w_i * L(yhat_i, y_i))

where each pair has a different weight in the final estimate. This mechanism is quite important in transfer learning, where the weights are given by the ratio of the test and train distributions w(x) = p_test(x) / p_train(x). We've formalised the process of estimating these weights in DensityRatioEstimation.jl, and we need to make sure that the loss functions API consumes them correctly.

To start this discussion, I would like to go over the existing traits for measures. First, I would like to understand how each trait is currently used in other parts of MLJ.jl. Below is the full list of traits I could find:

is_measure_type

const MEASURE_TRAITS =
    [:name, :target_scitype, :supports_weights, :prediction_type, :orientation,
     :reports_each_observation, :aggregation, :is_feature_dependent, :docstring,
     :distribution_type]
  • I understand that is_measure_type checks if a type is a measure type. In my opinion, the more useful trait operates on instances is_measure. How is the trait on the type being used? Can't we just rename it to is_measure and cover both cases (type + instance)?

  • I understand that name stores the name of the measure, and that this trait is a global trait in MLJ. I like it that we can always recover the name of objects in the stack.

  • Am I correct to say that the existence of target_scitype and prediction_type is due to the fact that loss functions currently compare objects of different type? Should it be that way? Is it covering that comparison between yhat=distribution and y=number? My opinion at this moment favours a simple interface L(yhat,y) where yhat and y are scalars of the same scientific type. I understand that yhat=f(x) is the output of a learning model with target_scitype and prediction_type, but propagating this type information seems unnecessarily complex.

  • From the definition I shared above, every loss function should support weights. The weights are not a property of the loss function itself, but a property of the expectation operator. I would just deprecate support_weghts and implement the weighting mechanism outside the losses.

  • I personally find the orientation trait suboptimal. I understand the desire to include multiple concepts (loss, score, etc) on the same umbrella, but we lose expressivity doing so. There will be traits in the future that only make sense when orientation=:loss or orientation=:score. You already know that my vote goes for deprecating this trait, and working on separate concepts for losses, scores, etc. It doesn't mean that we need to have different trait names for these concepts, it just means that we won't be thinking about them as a single generic concept called measures. I would like to be able for example to replace is_measure by more specific traits in my user code like is_loss or is_score. Code that consumes losses does not necessarily consumes scores, and vice-versa. So in summary, my suggestion would be to deprecate orientation, introduce is_loss, is_score,etc and finally define a new is_measure(x) = is_loss(x) || is_score(x) for the generic check.

  • I understand that the trait reports_each_observation tells whether or not a loss is returned for the whole sample or per pair in the sample. This doesn't make much sense to me in the context of loss functions based on the definitions above about expected losses in samples. Can you please elaborate on how this trait is being used elsewhere in MLJ? I see that L1 and L2 losses for example report the values for each observation, but wouldn't it be simpler to just broadcast the equivalent scalar losses? To me this reports_each_observation trait could be deprecated as well.

  • I understand that the trait aggregation tells which aggregation method is used to combine the losses for each pair in the sample. Unless we have a use case for a different aggregation method that is not the sample mean, this trait is also unnecessary. Can you please elaborate on how it is being used elsewhere in the stack?

  • I don't understand the is_feature_dependent trait. Could you please elaborate? My guess is that it tells whether or not the loss L(yhat,y) is dependent on the feature x used to estimate yhat=f(x), but I am probably wrong because this is always the case? Also, I noticed that all current losses have this set to false.

  • I like the general docstring trait available for all objects in the MLJ stack.

  • The distribution_type trait seems to be another trait that is the result of allowing losses between objects yhat and y of different kind. Could you please elaborate on what is the meaning of this trait and how it relates to target_scitype and prediction_type?

I appreciate your time replying to all these questions, and apologise in advance if my words appear harsh. I am not a native English speaker, so I write with a reduced vocabulary that sometimes may sound aggressive to some.

If you can take a careful look at all these points, that would be extremely helpful. My current research depends on this, and the sooner I get your feedback, the faster I will be able to contribute.

@tlienart
Copy link
Collaborator

Thanks for taking the time to put this together, Anthony is the one who dedicated most thought on measures so he's best placed to answer.
Note that there is the eventual plan of having a separate MLJMeasures package which would be a good occasion to generally improve the interface; so it's a good time to discuss this and your feedback is welcome!

@juliohm
Copy link
Contributor Author

juliohm commented Feb 25, 2020

Thank you @tlienart for the feedback. A separate package would be great 💯 If you feel like adding me to the organization, I could work on the proposal therein already, otherwise I can submit PRs to the repository.

@tlienart
Copy link
Collaborator

oh it's not there yet so I think here is a good place to discuss what it could look like, thanks for the support!

@juliohm
Copy link
Contributor Author

juliohm commented Feb 27, 2020

cc: @ablaom

@ablaom
Copy link
Member

ablaom commented Mar 3, 2020

@juliohm

Thanks for your helpful review of the measure API. I appreciate this
takes some time and effort. Thanks also for the offer to help out in
an area where I agree there is room for improvement.

My first impression is that your requirements are more specialized
than the needs of the general MLJ user. I hope that despite this you
will appreciate that, in a broader context, the original goals of the
API are generally worthwhile, and you remain willing to
contribute. Let me do my best to respond to your post. I'm sorry for
not responding to your comments in the same order made.

Additionally, measure implementations are not necessarily ready for
automatic differentiation nor they are ready for computation on
GPUs.

I agree these are worthwhile goals. It would be helpful if you could
provide examples of the shortcomings, thanks.

In another thread you mentioned type-instabilities. It would be
likewise helpful if you could flag examples. (I'm more concerned with
evaluation of the measures here that with instantiation thereof.)

Probabilistic predictors

for example yhat should include other objects like distributions,
please motivate your claim that loss functions should be the
mechanism to compare numbers y with distributions yhat. It is
not necessarily clear that a loss function should support this
comparison.

...

Am I correct to say that the existence of target_scitype and
prediction_type is due to the fact that loss functions currently
compare objects of different type?

Yes.

Should it be that way? Is it covering that comparison between
yhat=distribution and y=number? My opinion at this moment
favors a simple interface L(yhat,y) where yhat and y are
scalars of the same scientific type. I understand that yhat=f(x)
is the output of a learning model with target_scitype and
prediction_type, but propagating this type information seems
unnecessarily complex.

The output of probabilistic predictors is varied. The predicted
distributions need not be parametric or even have analytic
representations (e.g., generated by MCMC). For uniformity of
interface, it was decided that probabilistic models in MLJ should
always predict a distribution, rather than model/domain specific,
ambiguously ordered, probabilities or parameters.

An important class of performance measures for probabilistic
predictions are the proper scoring rules. See, e.g., this
article
.
Some of these rules are very general in the sense that one formula,
defined in terms of the pdf, defines a loss that can be applied to
large families of distribution types simultaneously. An example is the
Brier score which applies not just to finite distributions but to any
distribution whose pdf is suitably well-behaved. So it is very natural
to implement loss functions that operate on a distribution, rather
than some representation the provider and consumer must agree upon on
a case-by-case basis.

Here "distribution" is a little vague; if it's finite, it should be
UnivariatFinite, any other parametric distribution should generally
be a Distribution.Distribution object; it should at least implement
rand and if possible pdf.

I understand that limitations in the main ML platforms (scikit-learn,
MLR, etc) around the performance evaluation of probabilistic
predictors is a source of some frustration in the Bayesian /
probabilistic programming community, and consequently a source for
fragmentation between the various paradigms. The package
skpro (in which
yhat is allowed to be a distribution) is one response to this issue
which has informed MLJ's design. See also this related
article.

At present we do not implement a large number of proper scoring rules
but we should like to do so at some point.

So, for our purposes, I don't agree that yhat should be restricted
to a number.

I like the distinction, represented by the trait prediction_type
which ensures that deterministic measures are always applied to
deterministic observations, while probabilistic measures (e.g.,
cross_entropy) are always applied to probabilistic predictions. It
eliminates confusion and provides extra interface points for the
user. If you really want to apply a deterministic measure to a
probabilistic prediction, specify precisely how you want this to be
done. Are you computing the median? The mode? Or perhaps you are
going to have a weighted mode whose weighting is learned, etc. There
are convenience methods like predict_mode to deal with common use
cases when evaluating a model.

distribution_type

The distribution_type trait seems to be another trait that is the
result of allowing losses between objects yhat and y of different
kind. Could you please elaborate on what is the meaning of this
trait and how it relates to target_scitype and prediction_type?

The distribution_type was a late addition and is not currently used
anywhere in the stack. It declares the type of probability
distribution that can be plugged in as yhat when evaluating the
measure (e.g., UnivariateFinite for cross_entropy). It is
missing if prediction_type is not :probabilistic, when it has no
meaning. The trait target_scitype says nothing about the nature of
the probability distributions predicted by a model. It concerns the
target observations, rather than predictions.

Given the fact the the type of the distribution (e.g., MCMC-generated
object) might not be accessible, this trait may not be universally
useful. On the other hand, I don't see it does any harm.

What is a loss function?

For a supervised loss function L, we should be able to perform at least two operations:

Evaluate the loss at a pair (yhat, y) Estimate the expected loss
E[L] using a sample of n pairs: E[L] ~ (1/n) * sum(L(yhat_i, y_i)) In
the second operation, we can also introduce a weighting function:

Weighted expected loss is given by E[W*L] ~ (1/n) * sum(w_i * L(yhat_i, y_i))

Yes, I agree that it would be nice if all performance measures in
common use were defined as the mean of a per-observation measure, both
from the theoretical and practical points-of-view. But many entrenched
performance measures (absent from LossFunctions.jl) don't satisfy this
criterion. Examples include rms and its many cousins, area under the
ROC curve, and F_β scores. (Of course one could use sums of squares
instead of rms but general users won't want to do this). More benign
examples are things like true_positive which count instead of
average the per-observation measurements (as they are conventionally
defined). You may criticise the use of these measures on theoeretical
grounds but you surely know they are ubiquitous.

We consequently take a more general point of view than you propose: A
measure is a function applied to a sample, and we do not require
that it be the aggregate of any function applied to individual
observations.

In those cases where a measure applied to the sample can be
recovered by aggregating its applications to the observations in
isolation, one is allowed (and we generally should but don't)
implemement reports_each_observation as true, which indicates the
corresponding measure method returns a vector of the per-observation
measurements, instead of a single value. If
reports_each_observation trait is false a single value is
expected.

aggregation

Measures that report_each_observation are aggregated outside of the
measures API and so we require the aggregation trait to declare how
the per-observation measurements are to be aggregated to obtain the
correct value. Aggregation is not always by mean; rms and
true_positive are two of many counter examples. Furthermore, for
any measure, further aggregation occurs in resampling (e.g., CV) when
aggregates from multiple samples are themselves aggregated.

Can you please elaborate on how [aggregation] is being used
elsewhere in the stack?

When a model's performance is evaluated (using evaluate! or
evaluate) one or more performance measures are applied to each
observation in resampling (where you have a collection of train/test
pairs of row indices, as in CV, for example). These per_observation
measurements are aggregated to form a per_fold measurement (across
the test set) and the per_fold measurements are in turn aggregated
to obtain an overall measurement. For measures like auc, which do
not report_each_observation, the first step is skipped (and
missing reported). It is worth noting here that the
per_observation scores are not discarded after aggregation, as
some tuning strategies (Bayesian) make use of them. The
evaluate!/evaluate methods return a named tuple with keys
per_observation, per_fold, and measurement.

If you think it's worthwhile, I would be happy to allow the user to
specify an alternative aggregation method at time of instantiation of
a measure, with the trait specifying a default value.

orientation

I personally find the orientation trait suboptimal. I understand
the desire to include multiple concepts (loss, score, etc) on the
same umbrella, but we lose expressivity doing so. There will be
traits in the future that only make sense when orientation=:loss
or orientation=:score. You already know that my vote goes for
deprecating this trait, and working on separate concepts for
losses, scores, etc. It doesn't mean that we need to have different
trait names for these concepts, it just means that we won't be
thinking about them as a single generic concept called measures. I
would like to be able for example to replace is_measure by more
specific traits in my user code like is_loss or is_score. Code
that consumes losses does not necessarily consumes scores, and
vice-versa. So in summary, my suggestion would be to deprecate
orientation, introduce is_loss, is_score,etc and finally
define a new is_measure(x) = is_loss(x) || is_score(x) for the
generic check.

Sorry, I guess I'm missing some use cases here. For me any loss
function becomes a scoring function if I multiply by minus one, and
vice-versa. I suppose it' common to suppose a loss returns a value
between 0 and 1, with 1 optimal, but I was not aware this was a
universal convention or essentially used anywhere. Can you provide me
with an example of an algorithm that consumes loss functions that
cannot also consume scores by simply multiplying the evaluations by
minus one (after testing orientation trait`)?

We also want to include functions as "measures" that are neither
losses or scores. One user aleady requested that confusion_matrix be
admissible in performance evaluation, and this has been
implemented. It's orientation is :other which means, for example,
that it cannot be used in hyperparameter optimization.

reports_each_observation

  • I understand that the trait reports_each_observation tells
    whether or not a loss is returned for the whole sample or per pair
    in the sample. This doesn't make much sense to me in the context
    of loss functions based on the definitions above about expected
    losses in samples. Can you please elaborate on how this trait is
    being used elsewhere in MLJ? I see that L1 and L2 losses for
    example report the values for each observation, but wouldn't it be
    simpler to just broadcast the equivalent scalar losses? To me this
    reports_each_observation trait could be deprecated as well.

The definition of this trait is given in "What is a loss function?"
above.

Several MLJ measures that don't currently report each observation
could do so (especially in MLJBase/src/continuous.jl) and I am happy
for them to be re-factored.

If a loss function reports_each_observation, then currently it
implements both a scalar and a vector version which I agree is
sub-optimal. In those cases, I agree it makes sense to require only an
implementation of the scalar case, and to use trait-dispatch to reduce
the vector methods to the scalar case. Of course, when
reports_each_observation is false, a vector method (only) needs to
be implemented.

From the definition I shared above, every loss function should
support weights. The weights are not a property of the loss function
itself, but a property of the expectation operator. I would just
deprecate support_weights and implement the weighting mechanism
outside the losses.

Yes, but your definition, as noted earlier, is too restrictive for our
purposes.

Here is a proposal: We define supports_weights(m) == reports_each_observation(m) && aggregation(m) <: Union{Sum, Mean}.
Pros: No need for measures to implement supports_weights; less code,
more easily maintained. Cons: Considerable refactoring. No way to
specify weights for general measures, such as auc and F_β-scores.

This proposal presupposes that all measures that can implement
reports_each_observation indeed do so.

is_feature_dependent

Some problem-specific performance measures depend on the features X
as well as y, yhat. For example, in
this
data science competition, losses for perishable grocery items are
weighted more heavily that non-perishables (and the weighting is
non-linear). We provide the is_feature_dependent trait as a
mechanism for communicating that a custom performance measure depends
on X (so that MLJBase.value(m, yhat, X, y, w) gets dispatched
properly). See MLJ
docs

for an example of user interaction.

Yes, this trait would be false for all built-in measures.

is_measure trait

  • I understand that is_measure_type checks if a type is a measure
    type. In my opinion, the more useful trait operates on instances
    is_measure. How is the trait on the type being used? Can't we just
    rename it to is_measure and cover both cases (type + instance)?

Yes, this is a bit untidy. The is_measure_type trait is needed for
model inspection. There are two facilities for this:

  • info(M) - for returning a named-tuple of all trait values of M, where
    M is an instance or type.

  • measures() (measures(some_boolean_function)) - for listing all
    such named tuples (on which some_boolean_function is true), in
    the same way that models() lists all the model metadata
    entries. (See MLJBase/src/measures/registry.jl to see details.) A
    subtle point is that these methods methods must filter finite
    lists of types, because there are generally infinitely many
    measure instances (some measures have parameters). So a pure-instance
    is_measure trait seem insufficient here, no?

The is_measure trait (which can be deduced from the other, of
course) is not used elsewhere in the stack in any essential way.

Our options would appear to be:

  1. keep is_measure_type and simplify the current code to require
    implementation of is_measure_type only

  2. re-factor to have only the is_measure trait (acting on instances)
    and lose the inspection functionality.

I cant think of a reason to prefer 2 over 1. Why do you say is_measure` is more useful?

Summary

In summary:

  • A performance measure (such a cross_entropy) for probabilistic
    predictions should, in my opinion, expect yhat to be
    a distribution

  • All the current traits serve a well-justified purpose, with the
    possible exceptions of supports_weights and distribution_type

  • The above design points not withstanding, there are opportunities to
    reduce code and improve implementation of the design which we agree
    upon

@juliohm
Copy link
Contributor Author

juliohm commented Mar 3, 2020

The output of probabilistic predictors is varied. The predicted
distributions need not be parametric or even have analytic
representations (e.g., generated by MCMC). For uniformity of
interface, it was decided that probabilistic models in MLJ should
always predict a distribution, rather than model/domain specific,
ambiguously ordered, probabilities or parameters.

This is ok, but it is not an argument in favor of the current API for losses.

An important class of performance measures for probabilistic
predictions are the proper scoring rules. See, e.g., this article.
Some of these rules are very general in the sense that one formula,
defined in terms of the pdf, defines a loss that can be applied to
large families of distribution types simultaneously. An example is the
Brier score which applies not just to finite distributions but to any
distribution whose pdf is suitably well-behaved. So it is very natural
to implement loss functions that operate on a distribution, rather
than some representation the provider and consumer must agree upon on
a case-by-case basis.

I disagree with this view. Because scoring rules are something that can be used to track performance, it doesn't mean they fit in the concept of loss as traditionally used.

I understand that limitations in the main ML platforms (scikit-learn,
MLR, etc) around the performance evaluation of probabilistic
predictors is a source of some frustration in the Bayesian /
probabilistic programming community, and consequently a source for
fragmentation between the various paradigms. The package
skpro (in which
yhat is allowed to be a distribution) is one response to this issue
which has informed MLJ's design. See also this related
article.

At present we do not implement a large number of proper scoring rules
but we should like to do so at some point.

So, for our purposes, I don't agree that yhat should be restricted
to a number.

In the referred article the authors introduce a new concept called probabilistic loss functionals, which is something different than traditional loss functions, and they make it clear. These should be two separate concepts, and this attempt to make everything fit on the same bag is the issue that I am raising. I am discussing the API of traditional supervised loss functions, and in this case, it doesn't make sense to allow yhat to be a distribution.

I like the distinction, represented by the trait prediction_type
which ensures that deterministic measures are always applied to
deterministic observations, while probabilistic measures (e.g.,
cross_entropy) are always applied to probabilistic predictions. It
eliminates confusion and provides extra interface points for the
user.

I disagree. The current interface is confusing for the end user who is not interested in all kinds of performance metrics one can possibly conceive as a "measure". I only wish to eval my models with traditional supervised losses for a paper, and now I have to learn a complex trait to filter out what are the losses, what are the scores, what are the probabilistic functionals, what are the outputs that the model produces, and so on. This is unnecessarily complex.

Yes, I agree that it would be nice if all performance measures in
common use were defined as the mean of a per-observation measure, both
from the theoretical and practical points-of-view. But many entrenched
performance measures (absent from LossFunctions.jl) don't satisfy this
criterion. Examples include rms and its many cousins, area under the
ROC curve, and F_β scores.

Exactly. And that is why we shouldn't be talking about rms as if it was a supervised loss as defined above (and in LossFunctions.jl). Something that doesn't fit the definition above deserves a separate API and set of traits.

We consequently take a more general point of view than you propose: A
measure is a function applied to a sample, and we do not require
that it be the aggregate of any function applied to individual
observations.

This general view is useless in practice, because I need to know the nature of the function that I am applying to a sample. If I know that the function for example satisfies the definition I gave above, I can expect properties to hold. Now we have a generic thing called "measure" that puts together a bunch of different concepts on the same bag. The user is now terrified because he doesn't know what is the combination of traits he should use to filter things out.

Sorry, I guess I'm missing some use cases here. For me any loss
function becomes a scoring function if I multiply by minus one, and
vice-versa. I suppose it' common to suppose a loss returns a value
between 0 and 1, with 1 optimal, but I was not aware this was a
universal convention or essentially used anywhere. Can you provide me
with an example of an algorithm that consumes loss functions that
cannot also consume scores by simply multiplying the evaluations by
minus one (after testing orientation trait`)?

For example, as I defined above, all losses for me are "weightable" because this is a property of the expectation operator and not of the loss. As you mentioned there are scoring rules which are not computed on a per-sample basis and not aggregated with an expectation operator. So I cannot use those.

Yes, but your definition, as noted earlier, is too restrictive for our
purposes.

Again, I am not proposing a redefinition of measure, I am proposing a specific definition of loss. As I understand you have loss + scoring rules + whatever = performance measure, but I don't care about the rest of the list at this moment. Just the loss functions.

Summary

Unfortunately we have views of the world that are too different when it comes to software design. I am always willing to contribute to the MLJ stack, but I realize that it is very difficult to do so given that my research needs are not being addressed by the current design. I could try to adapt my viewpoint to contribute, but that is not efficient because the proposal you have where yhat and y have different type does not seem right conceptually, and only makes things more complex than strictly necessary. In that scenario, where I already tried to clarify my concerns with a GitHub issue as usual, I think the most productive path forward is to just fork the concepts that I am not satisfied with as I've been doing in GeoStats.jl.

If for some reason we change our minds in the future about this design, we can try to reconcile the codebases.

@juliohm juliohm closed this as completed Mar 3, 2020
@juliohm
Copy link
Contributor Author

juliohm commented Mar 6, 2020

I've actually just discovered that LossFunctions.jl does the weighting correctly: https://juliaml.github.io/LossFunctions.jl/stable/user/aggregate/ Sharing in case someone stumbles on the same bug here.

@ParadaCarleton
Copy link

@juliohm could you summarize the main issues you have with this interface? None of the issues here seem irreconcilable, and I really don't want to fragment the Julia ML ecosystem the way other interfaces and ecosystems (like named arrays or automatic differentiation) have been. There maybe be some places where we have to create different packages, but as much as possible I think we should try to make sure everything is interoperable.

To try and give a summary of the main issues I've found:

First, it looks like you want to focus on the narrower category of proper loss functions, rather than generic loss functionals. How about we create a new type called something like "Separable loss functions" that contains only losses that can be expressed as f(mean(loss(yhat, y))), where f is monotonic and equal to the identity by default? (f is there because sometimes, adding one final function call can make the resulting loss function easier to interpret, as in RMS; however, this doesn't make a difference as long as f is monotonic.)

This way we can allow generalized loss functionals without. Or, if you'd like, we could split this package into two packages, one for separable+proper loss functions and one for more "unusual" losses.

Unless we have a use case for a different aggregation method that is not the sample mean, this trait is also unnecessary.

I believe this is just a convenience for computational efficiency. It's always possible to find a function f such that invf(sum(f, x)) == accumulate(aggregator, x). For example, the logarithm to convert from products to sums. I think this could be deprecated in theory, or just pushed into some hidden corner of the documentation with a default of mean (to avoid bothering new users implementing this interface).

I could try to adapt my viewpoint to contribute, but that is not efficient because the proposal you have where yhat and y have different type does not seem right conceptually, and only makes things more complex than strictly necessary.

Can you clarify what you'd propose as an alternative interface here?

@juliohm
Copy link
Contributor Author

juliohm commented Oct 26, 2023

I really don't want to fragment the Julia ML ecosystem the way other interfaces and ecosystems

I sympathize with this feeling, but please understand that I had done my homework before moving forward with the development of alternative packages. Thank you for trying to revive this issue though.

How about we create a new type called something like "Separable loss functions" that contains only losses that can be expressed as f(mean(loss(yhat, y))), where f is monotonic and equal to the identity by default?

That is JuliaML/LossFunctions.jl (I am the main maintainer nowadays).

Can you clarify what you'd propose as an alternative interface here?

I disagree with many design decisions that have been made in the project and respect them. I don't have any intention to brainstorm MLJ interfaces at this point in time. As I mentioned in another issue, we are not using the project in our industrial applications anymore.

@ablaom
Copy link
Member

ablaom commented Oct 27, 2023

In case it is useful, MLJBase measures were recently moved out to StatisticalMeasures.jl. These are based on a modified system of traits that are part of StatisticalMeasuresBase.jl.

@ParadaCarleton
Copy link

In case it is useful, MLJBase measures were recently moved out to StatisticalMeasures.jl. These are based on a modified system of traits that are part of StatisticalMeasuresBase.jl.

Oh, this is great, it looks like the two interfaces are compatible now, so I can just use StatisticalMeasures.jl with LossFunctions.jl measures. Thank you for the hard work on this, Anthony!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants