Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed model #574

Open
azev77 opened this issue Jun 18, 2020 · 2 comments
Open

Proposed model #574

azev77 opened this issue Jun 18, 2020 · 2 comments

Comments

@azev77
Copy link
Contributor

azev77 commented Jun 18, 2020

Suppose I got a new dataset in the mail today & wanna see which brand-name distribution in Distributions.jl best fits it.

using Distributions, Random, HypothesisTests;

Uni = subtypes(UnivariateDistribution)
#Cts_Uni = subtypes(ContinuousUnivariateDistribution)
DGP_True = LogNormal(17,7);
Random.seed!(123);
const d_train = rand(DGP_True, 1_000)
const d_test  = rand(DGP_True, 1_000)

Er =[]; D_fit  =[];
for d in Uni
    println(d)
    try
        dd = "$(d)"   |> Meta.parse |> eval
        D̂ = fit(dd, d_train)
        Score = [loglikelihood(D̂, d_test),
                OneSampleADTest(d_test, D̂)            |> pvalue,
                ApproximateOneSampleKSTest(d_test, D̂) |> pvalue,
                ExactOneSampleKSTest(d_test, D̂)       |> pvalue,
                #PowerDivergenceTest(d_test,lambda=1)  Not working!!!
                JarqueBeraTest(d_test)                |> pvalue   #Only Normal 
        ];
        #Score = loglikelihood(D̂, ds) #TODO: compute a better score.
        push!(D_fit, [d, D̂, Score])
    catch e
        println(e, d)
        push!(Er, (d,e))
    end
end

a=hcat(D_fit...)
M_names =  a[1,:]; M_fit   =  a[2,:]; M_scores = a[3,:];
idx =sortperm(M_scores, rev=true);
Dfit_sort=hcat(M_names[idx], sort(M_scores, rev=true) )
julia> Dfit_sort
11×3 Array{Any,2}:
 LogNormal                [-20600.7, 0.823809, 0.789128, 0.781033, 0.0]
 Gamma                     [-21159.4, 6.0e-7, 2.45426e-68, 1.23247e-69, 0.0]
 Cauchy                    [-24823.3, 6.0e-7, 2.91142e-213, 8.6107e-227, 0.0]
 InverseGaussian           [-26918.1, 6.0e-7, 0.0, 0.0, 0.0]
 Exponential               [-33380.3, 6.0e-7, 0.0, 0.0, 0.0]
 Normal                   [-40611.5, 6.0e-7, 1.32495e-213, 3.51792e-227, 0.0]
 Rayleigh                  [-61404.6, 6.0e-7, 0.0, 0.0, 0.0]
 Laplace                   [-2.03419e9, 6.0e-7, 1.49234e-138, 5.47197e-144, 0.0]
 DiscreteNonParametric     [-Inf, 6.0e-7, 0.197933, 0.193494, 0.0]
 Pareto                    [-Inf, 6.0e-7, 6.69184e-108, 3.7704e-111, 0.0]
 Uniform                  [-Inf, 6.0e-7, 0.0, 0.0, 0.0]

Basically this is predicting Y given X=constant, except the prediction here is not a number but a (unconditional) distribution.

@ablaom
Copy link
Member

ablaom commented Jun 19, 2020

In MLJ the plan is to view fitting a distribution as probabilistic supervised learning where the input is X=nothing - a single point with no information. The data you have above would be the target, labelled y, and the prediction yhat is a single (probabilistic) prediction. The API is set up for this already - see https://alan-turing-institute.github.io/MLJ.jl/dev/adding_models_for_general_use/#Models-that-learn-a-probability-distribution-1 - but no-one has contributed a model yet.

Is this what you are after?

@azev77
Copy link
Contributor Author

azev77 commented Jun 20, 2020

Yes, that is.
Btw, this case w/ X=nothing can be generalized.
For example: y= x*\beta + e, where e ~_{iid} F(\theta) for a large class of probability distributions for the error term.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants