Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interface compatibility with Distributions.jl ? #8

Open
lrnv opened this issue Mar 22, 2023 · 5 comments
Open

Interface compatibility with Distributions.jl ? #8

lrnv opened this issue Mar 22, 2023 · 5 comments

Comments

@lrnv
Copy link

lrnv commented Mar 22, 2023

Hey,

Thanks for this great addition to the ecosystem !

From Distribution.jl, it seems like the first argument to the fit_mle function should be the distributions type and not an instance of the type :

julia> fit_mle(Gamma,rand(1000))
Gamma{Float64}(α=1.604973623956157, θ=0.3097630701718876)

julia> fit_mle(Gamma,rand(100))
Gamma{Float64}(α=1.6985042802071995, θ=0.31065614192888746)

julia> fit_mle(Gamma(1,1),rand(100))
ERROR: MethodError: no method matching fit_mle(::Gamma{Float64}, ::Vector{Float64})
Closest candidates are:
  fit_mle(::Type{<:LogNormal}, ::AbstractArray{T}) where T<:Real at C:\Users\lrnv\.julia\packages\Distributions\bQ6Gj\src\univariate\continuous\lognormal.jl:163
  fit_mle(::Type{<:Weibull}, ::AbstractArray{<:Real}; alpha0, maxiter, tol) at C:\Users\lrnv\.julia\packages\Distributions\bQ6Gj\src\univariate\continuous\weibull.jl:145
  fit_mle(::Type{<:Beta}, ::AbstractArray{T}; maxiter, tol) where T<:Real at C:\Users\lrnv\.julia\packages\Distributions\bQ6Gj\src\univariate\continuous\beta.jl:217
  ...
Stacktrace:
 [1] top-level scope
   @ REPL[14]:1

julia> 

This is not really a problem for yo as you are free to overload this function as you want, and your interface actually makes a lot of sense since you exploit the guesses in your algorithm. But would it be possible to add methods following this convention, maybe with automatic guesses ? I have fit_mle bindings in Copulas.jl that assume this convention, and thus do not work directly with your package :(

Edit: I was trying to make a code example of what i would like, but I saw that mixures types do not include components types... More specifically, I would like to be able to type :

fit_mle(MixtureModel{Gamma,Gamma,Normal},data)

instead of

fit_mle(MixtureModel([Gamma(),Gamma(),Normal()],[1/3 1/3 1/3]),data)

Would that be possible ?

It would allow composability, as I am currently using :

using Copulas, Distributions, ExpectationMaximization, Random
X₁ = MixtureModel([Gamma(2,3), LogNormal(1,1)],[1/2,1/2])
X₂ = Pareto()
X₃ = LogNormal(0,1)
C = ClaytonCopula(3,0.7) # A 3-variate Frank Copula with θ = 0.7
D = SklarDist(C,(X₁,X₂,X₃)) # The final distribution

# This generates a (3,1000)-sized dataset from the multivariate distribution D
simu = rand(D,1000)

D̂ = fit(SklarDist{FrankCopula,Tuple{Gamma,Normal,LogNormal}}, simu) # works
# But how can i specify that i want a mixture for one of the variables ? 

which, under the hood, calls fit_mle(Marginal_Type,marginal_data) on each marignals.

@dmetivie
Copy link
Owner

dmetivie commented Mar 23, 2023

Thanks for the great issue! I was actually starting to wonder about mixtures of copula (or copulas of mixtures, as you wrote).
I believe we could make packages work together.
BTW, I just released v0.2 of the package with brand-new docs.

I address the type vs instance problem in Instance vs Type version section of the doc.
There is also a discussion in the PR#1670 in Distributions.jl I made. The main criticism from @devmotion was that it could add confusion. Maybe denoting the instance version with fit_mle! would be clearer (however one cannot mutate in place distribution, so it would not be true in place, this would also be confusing).
My point was that it would allow more flexibility and simpler writing (at the cost of a slight warning on what the function fit_mle does depending on the case).

As you noted: mixture models do not include the type of every component, so fit_mle(Type{mix}, y) cannot work.
Similarly, I noticed that ProductDistribution does not have types in it, so again fit_mle(Type{Product}, y) cannot work.

Based on these observations, using the instance version fit_mle(mix, y) (or product) made more sense.

  • For mixture, the input mix is used as initial state of EM algo, so it is 2 in 1.

I added this conversion line for
Interface compatibility with Distributions.jl

function fit_mle(g::D, args...) where {D<:Distribution}
    fit_mle(typeof(g), args...)
end

So first when encountering a g::Distribution, it will search for an instance version (for MixtureModels or Product), if it does not find it, it will fall back with the instance version thanks to the previous definition.

Note that I did not define the same for the fit function fit(g::D, args...) (only fit). I could very well add it.

In your (great) package, you so far used the type convention which, IMO, for Copulas (and other multivariate) cases is a bit heavy to write.
You could very well use the instance convention.

The following works in your example!

function Distributions.fit(D::SklarDist, x)
    # The first thing to do is to fit the marginals : 
    @assert length(D) == size(x, 1)
    # One could put fit but EM works for with fit_mle only
    m = Tuple(Distributions.fit_mle(D.m[i], x[i, :]) for i in axes(x, 1))
    u = pseudos(x)

    # τ⁻¹(FrankCopula, τ) works but not τ⁻¹(FrankCopula{3, Float64}, τ)
    # I found https://discourse.julialang.org/t/extract-type-name-only-from-parametric-type/14188/20
    # to just extract the typename.
    # Again defining fit(C::ArchimedeanCopula, u) = fit(::Type{CT},u) where {CT <: ArchimedeanCopula} would simplify the writting
    C = Distributions.fit(Base.typename(typeof(D.C)).wrapper, u)
    return SklarDist(C, m)
end

D̂mix = fit(D, simu)

Note that I had to define the weighted fit_mle(LogNormal, x, w) (only fit_mle(LogNormal, x) is in Distributions.jl).
I believe this does the job

using StatsBase
function Distributions.fit_mle(::Type{<:LogNormal}, x::AbstractArray{T}, w::AbstractArray) where T<:Real
    lx = log.(x)
    μ, σ = mean_and_std(lx, weights(w))
    LogNormal(μ, σ)
end

(Note that StochasticEM() version only use unweighted version of fit_mle.)

Hence, IMO, it seems that for compatibility between EM.jl and Copula.jl, it is Copula.jl that needs to simply add Distributions.fit(D::SklarDist, x) as shown here. I believe it should not have any consequences on your existing functions.

@lrnv
Copy link
Author

lrnv commented Mar 23, 2023

I apologize for not reading your documentation and posting without restraint. Thanks for the clear explanation and the link to the PR, I can appreciate that you already put a lot of efforts into this;

I agree with you that, for mixtures, the fit_mle(D::Dist,...) makes more sense than the fit_mle(D::Type,...) version, since, as we noted, (1) mixtures types do not include component types and (2) EM fitting uses guesses for parameters.

I also agree with the PR reviewers that this adds different behaviors to the function. I see two outputs:

  1. Your PR gets merged, I adapt the Copulas.jl code to accept this new version as you proposed. Low hanging fruit (now that you already did all the work and wrote the PR).
  2. We parametrize MixtureModel. Harder but I think better. Let me make the case for it.

Plan

MixtureModel become MixtureModel{T1,T2,T3} for small mixture, potentially NMix{T,N} like we have NTuple in base to represent a mixture with N components all of type T.

Issues

What if you want 100 Gaussians + a dirac ? Maybe via nesting as MixtureModel{Dirac,NMix{Gaussian,100}} ? This is unclear to me yet.

Also, this might be breaking...

Implications

Implementations of fit_mle that require guesses could pass these guesses as later parameters as fit_mle(MixtureModel{T1,T2,T3},guesses::MixtureModel{T1,T2,T3},...). Yes, this looks redundant, but you could also bind once and for all fit_mle(X::MixtureModel,..) = fit_mle(typeof(X),X,...) to keep your infrastructure in place. Also, you might allow the user NOT to provide guesses by using the default parameters values for distributions that are already implemented in Distributions.jl.

A good argument

My best argument for this change of parametrization on the MixtureModel type is that other "modified" distributions in Distributions.jl already does it, including ProductDistribution:

julia> typeof(product_distribution(Gamma(),Normal()))
Distributions.ProductDistribution{1, 0, Tuple{Gamma{Float64}, Normal{Float64}}, Continuous, Float64}

julia> typeof(truncated(Normal(),0,Inf))
Truncated{Normal{Float64}, Continuous, Float64}

julia> typeof(censored(Normal(),0,Inf))
Distributions.Censored{Normal{Float64}, Continuous, Float64, Float64, Float64}

julia> 

Although, the current type structure is kinda clunky (what are these 1 and 0 in the ProductDistribution type ?). Expanding fit_mle to censored, truncated or product distributions should therefore be possible for both interfaces.

Conclusion

Whatever gets decided, I will adapt Copulas.jl accordingly.

Post-scriptum

I agree with you on the fact that abusing the meaning of ! using fit_mle!(D::Dist,...) is probably not the best solution as, as you noted, distributions are immutable.

@dmetivie
Copy link
Owner

dmetivie commented Apr 3, 2023

Thanks for your comments.

Since I just noticed (thanks to you) that Product have been updated, I now agree more with what you propose.
In particular, the part fit_mle(X::MixtureModel,..) = fit_mle(typeof(X),X,...)!

I think now my main point against is that if one wants to use an algorithm that implies using fit_mle for generic distributions (like EM) the arguments would have to change depending on the distribution:

  • fit_mle(Exponential, y)
  • fit_mle(Categorical, k, y) This one can be transformed as fit_mle(Categorical, y) using that, but in some unlucky scenarios where the last category is not drawn maximum(y)<k. To fix that Type wise, one might define some subtype like Categorical{k}?
  • Now we also have fit_mle(MixtureModel{T1,T2,...}, guesses, y)

These are 3 ways to use the fit_mle depending on the distribution, so for generic algo it does not seem convenient.
Whereas with the instance formulation all write fit_mle(dist, y).

Coming back to your two scenarios:

  1. It does not sound like it will be merged. Maybe I need to make a stronger case.
  2. Should we open a PR in Distributions for that? Why would it be breaking?

@lrnv
Copy link
Author

lrnv commented Apr 3, 2023

About Categorical and type parameter conventions.

Some naming conventions so that we understand each others :

  • By type parameters I mean exactly what julia means , the T in MyType{T}
  • By distribution parameters, I mean what statiticiens mean : the (\mu,\sigma) of a gaussian.
  • By parameter field I mean the set of all possible parametrisation of a distribution. For the gaussian, this is something like $\mathbb R \times \mathbb R_+$.

I think in a dream world, I would like the parameter field to be clearly defined from the type parameters. This is in fact what is implicitly supposed by Distributions.jl's docs at this page:

[ The statement d = fit(D,x) ] fits a distribution of type D to a given dataset x, where x should be an array comprised of all samples. The fit function will choose a reasonable way to fit the distribution, which, in most cases, is maximum likelihood estimation.

Let's deal with the Categorical case as an exemple. The definition of categorical in Distributions.jl's code is :

const Categorical{P<:Real,Ps<:AbstractVector{P}} = DiscreteNonParametric{Int,P,Base.OneTo{Int},Ps}

function Categorical{P,Ps}(p::Ps; check_args::Bool=true) where {P<:Real, Ps<:AbstractVector{P}}
    @check_args Categorical (p, isprobvec(p), "vector p is not a probability vector")
    return Categorical{P,Ps}(Base.OneTo(length(p)), p; check_args=check_args)
end

Categorical(p::AbstractVector{P}; check_args::Bool=true) where {P<:Real} =
    Categorical{P,typeof(p)}(p; check_args=check_args)

function Categorical(k::Integer; check_args::Bool=true)
    @check_args Categorical (k, k >= 1, "at least one category is required")
    return Categorical{Float64,Vector{Float64}}(Base.OneTo(k), fill(1/k, k); check_args=false)
end

Hence, the number of categories k is NOT part of the type parametrization, and thus can be considered a distribution parameter of the distribution to be derived from the data, which is what fit(Categorical,y) is currently doing if I read you correctly. The parameter field of Categorical is thus $\mathbb N$ for the parameter k times the probability simplex of size k.

If we introduce Categorical{k}, the number of categories of this distribution would not anymore be a distribution parameter but rather a type parameter and thus fit(Categorical{k},data) should respect it even if data does not contain the maximum possible value (and maybe throw an error if any(data > k) for good measure?)

The version fit(Categorical, k, data) feels like it should not exist. I thus strongly agree with you that having one common interface fit(Distribution{EnventuallyWithTypeParams},data) is the way to go. If you want k not to be a distribution parameter, make it a type parameter. It just looks like the distribution we want, here Categorical{k}, does not exists yet.

About MixtureModel

Same thing as for Categorical : The call fit(MixtureModel,data) is clearly dumb, as there is no way to infer the parameter field from the type MixtureModel: from the docs, even the number of components is not part of the type. On the other hand, in fit(MixtureModel{Gamma,Normal},data), we can infer the parameter field from the type, and thus it should work. You could also argue that none will really try fit(MixtureModel,data) as it does not make much sense.

PS: Currently, on this doc page, MixtureModel{Univariate,Continuous,Normal} represents mixtures of Gaussians with an unknown number of components. Thus, it would make sense to define fit(MixtureModel{Univariate,Continuous,Normal},data) with an appropriate algorithm to fit Gaussian mixtures models with unknown number of components, I am sure there are packages that do this already (and which could use this binding !). However, the mixtures fitted in ExpectationMaximization.jl have no corresponding types that allow this (for the moment), which is why we are here.

About Your interface and going forward.

Yes, defining in your package

fit_mle(X::MixtureModel,..) = fit_mle(typeof(X),X,...)
fit_mle(T::Type{MixtureModel{A,B,C}},..) = fit_mle(T(),...) # Where D() instantiate a distribution with default parameters.

would probably be enough to 1) keep your existing code and 2) comply with the common interface, if and only if we make a strong case on Distributions.jl to change these stuff (namely, parametrize mixtures so that their parameter field is inferable from their type parameters, as they claim in the docs).

If you agree with that, I could try making the PR myself. Or maybe an issue first to ask their opinion before, linking to this discussion ?

@dmetivie
Copy link
Owner

dmetivie commented Apr 4, 2023

If you agree with that, I could try making the PR myself. Or maybe an issue first to ask their opinion before, linking to this discussion ?

Yes, thank you!

About the initialization, other distributions could benefit from the notation fit_mle(typeof(X),X,...) to warm start.
Basically, all distributions where the MLE is not known explicitly and where numerical optimization is needed, i.e. most distributions.
For example, the Gamma distribution needs a starting point for the $a$ parameter. Currently, the $a_0$ is given as a kwargs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants