Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixture model converges to wrong result (randomly) #7

Open
hberger opened this issue Feb 21, 2023 · 1 comment
Open

Mixture model converges to wrong result (randomly) #7

hberger opened this issue Feb 21, 2023 · 1 comment

Comments

@hberger
Copy link

hberger commented Feb 21, 2023

Hi,
I have been trying to model an approximately gaussian mixture (image data, background + foreground intensities) using normalmixEM with 2 or 3 components.

While this seems to work generally, quite often I get results that just do not seem to make any sense in terms of the initial data distribution. In particular, the input distribution is clearly bimodal with a 30/70 split in proportions, but the resulting mixture does not fit the modes nor does the sum of the mixtures seem to fit the overall distribution well.

This problem happens for certain data sets for k=2 about 1 in 5 times and about 1 in 100-200 times for k=3. I can reproduce this behavior using specific random seeds for both k=2 and k=3. This does not seem to depend so much on the initial data as removing data points or subsampling from 100k to 1k data points does not change the behavior. Removing the long tail to the right also does not change anything. I tried to play around with initial lambda values but that does not seem to change much.

Tested in mixtools 1.2.0 on R 4.1.2 (openblas) and mixtools 2.0 on R 4.2.2 (standard R BLAS), both on Linux.

To reproduce (example data have been attached below):

library(mixtools)

#d = readRDS("100k_vector_for_normalmixEM.rds")
# 1k subsample from 100k data
d = readRDS("1k_vector_for_normalmixEM.rds")

hist(d, 100)

# k=2; problem occurs about 1 every 5 runs without setting random seed
set.seed(1238)
mixmdl = normalmixEM(d, k=2)
summary(mixmdl)
plot(mixmdl, which=2)

# k=3; much less frequent without setting random seed
set.seed(10319)
mixmdl = normalmixEM(d, k=3)
summary(mixmdl)
plot(mixmdl, which=2)

Results from the 100k data set (1k looks similar) for k=2:

summary of normalmixEM object:
          comp 1     comp 2
lambda 0.8223187 0.17768126
mu     0.0119736 0.01614824
sigma  0.0050708 0.00156856
loglik at estimate:  391868.6 

image

Output from a 'good' run (100k, k=2):

summary of normalmixEM object:
           comp 1     comp 2
lambda 0.29123448 0.70876552
mu     0.00687830 0.01511378
sigma  0.00119344 0.00370195
loglik at estimate:  402918.3 

example_data.zip

@hberger
Copy link
Author

hberger commented Sep 21, 2023

Just an additional comment: I can fix the problem when using custom initialization values derived from k-means on the data:

km <- kmeans(d, 2)
mu_km = km$centers[, 1]
sd_km = tapply(d, km$cluster, sd)
mixmdl = normalmixEM(d, k=2, fast=T, mu = mu_km, sigma = sd_km)

This leads to stable prediction of the correct population parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant