You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was trying to get more fine grinded topics with higher k both on the wiki corpus and the corpus of my own (around 60000 news article). However, even though I have set a higher k, I constantly get the following warning message:
Warning message:
In calculate_gmm(wiki_dfm_filtered, seed = 46709394) :
Cannot converge with a model with k = 20. Actual k = 3
I am not sure if the low number of topics is due to the dimension reduction in step 3 (filter_dfm()) or the GMM algorithm. I did try to change the multiplication_factor variable to retain more dimensions, but the result is no better.
20-topic rectr model trained with a dfm with a dimension of 342 x 41 and de/en language(s).
Filtered with k = 20
Aligned word embeddings: bert
Defacto k = 3
I got similar results when I tried with my own corpus.
On a remotely related note, I am wondering if it would be possible to export the filtered dfm for other algorithrms that might produce fine grind topics? For example, Gaussian LDA, which is an adaptation of LDA that takes in word embeddings (here's python implementation: https://pypi.org/project/gaussianlda/ . I can't find any in R).
Once again, thanks a lot for the great work!
The text was updated successfully, but these errors were encountered:
Make sure your corpus actually has many cross-lingual topics (i.e. topics exist in all languages). I am pretty sure the wiki corpus does not have 20. I don't know about your corpora.
In the paper, I have written that there is a possibility to converge to a solution with less than your desired k, when your corpus doesn't have enough variance to support a solution with a large k. GMM is pretty restrictive. You can't get fine-grinded topics if your input is not fine-grinded enough. (as always, GIGO).
You can access the actual matrix of a filtered dfm object by adding $dfm after it (e.g. filtered_dfm$dfm). You can then use it anywhere.
Thanks for pointing this out! My corpus is a sample from news artcile from US and a few european countries, and I think the lack of cross-lingual topics is the reason (they could have over 20 in total but I doubt if many of them are cross-lingual across all languages).
First of all, thanks for the groundbreak work!
I was trying to get more fine grinded topics with higher k both on the wiki corpus and the corpus of my own (around 60000 news article). However, even though I have set a higher k, I constantly get the following warning message:
I am not sure if the low number of topics is due to the dimension reduction in step 3 (
filter_dfm()
) or the GMM algorithm. I did try to change themultiplication_factor
variable to retain more dimensions, but the result is no better.Here's a reproducible example:
Here's what I get:
sessionInfo:
I got similar results when I tried with my own corpus.
On a remotely related note, I am wondering if it would be possible to export the filtered dfm for other algorithrms that might produce fine grind topics? For example, Gaussian LDA, which is an adaptation of LDA that takes in word embeddings (here's python implementation: https://pypi.org/project/gaussianlda/ . I can't find any in R).
Once again, thanks a lot for the great work!
The text was updated successfully, but these errors were encountered: