Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Obtaining fine-grinded topics? #2

Closed
justinchuntingho opened this issue Sep 29, 2020 · 2 comments
Closed

Obtaining fine-grinded topics? #2

justinchuntingho opened this issue Sep 29, 2020 · 2 comments

Comments

@justinchuntingho
Copy link

First of all, thanks for the groundbreak work!

I was trying to get more fine grinded topics with higher k both on the wiki corpus and the corpus of my own (around 60000 news article). However, even though I have set a higher k, I constantly get the following warning message:

Warning message:
In calculate_gmm(wiki_dfm_filtered, seed = 46709394) :
  Cannot converge with a model with k = 20.  Actual k = 3

I am not sure if the low number of topics is due to the dimension reduction in step 3 (filter_dfm()) or the GMM algorithm. I did try to change the multiplication_factor variable to retain more dimensions, but the result is no better.

Here's a reproducible example:

library("rectr")

wiki_corpus <- create_corpus(wiki$content, wiki$lang)
wiki_dfm <- transform_dfm_boe(wiki_corpus, noise = TRUE)
wiki_dfm

wiki_dfm_filtered <- filter_dfm(wiki_dfm, k = 20, multiplication_factor = 2)
wiki_dfm_filtered

wiki_gmm <- calculate_gmm(wiki_dfm_filtered, seed = 46709394)
wiki_gmm

Here's what I get:

20-topic rectr model trained with a dfm with a dimension of 342 x 41 and de/en language(s).
Filtered with k =  20
Aligned word embeddings: bert
Defacto k = 3

sessionInfo:

R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rectr_0.1.3

loaded via a namespace (and not attached):
 [1] reticulate_1.16    modeltools_0.2-23  tidyselect_1.1.0   remotes_2.2.0     
 [5] purrr_0.3.4        lattice_0.20-41    colorspace_1.4-1   vctrs_0.3.4       
 [9] generics_0.0.2     testthat_2.3.2     stats4_4.0.2       usethis_1.6.3     
[13] SnowballC_0.7.0    yaml_2.2.1         rlang_0.4.7        pkgbuild_1.1.0    
[17] pillar_1.4.6       glue_1.4.2         withr_2.3.0        sessioninfo_1.1.1 
[21] lifecycle_0.2.0    munsell_0.5.0      gtable_0.3.0       mvtnorm_1.1-1     
[25] devtools_2.3.2     memoise_1.1.0      callr_3.4.4        ps_1.3.4          
[29] flexmix_2.3-15     fansi_0.4.1        tokenizers_0.2.1   Rcpp_1.0.5        
[33] scales_1.1.1       backports_1.1.10   desc_1.2.0         pkgload_1.1.0     
[37] RcppParallel_5.0.2 jsonlite_1.7.1     RSpectra_0.16-0    fs_1.5.0          
[41] fastmatch_1.1-0    stopwords_2.0      ggplot2_3.3.2      digest_0.6.25     
[45] stringi_1.5.3      processx_3.4.4     dplyr_1.0.2        quanteda_2.1.1    
[49] grid_4.0.2         rprojroot_1.3-2    cli_2.0.2          tools_4.0.2       
[53] magrittr_1.5       tibble_3.0.3       crayon_1.3.4       pkgconfig_2.0.3   
[57] ellipsis_0.3.1     Matrix_1.2-18      data.table_1.13.0  prettyunits_1.1.1 
[61] assertthat_0.2.1   rstudioapi_0.11    R6_2.4.1           nnet_7.3-14       
[65] compiler_4.0.2  

I got similar results when I tried with my own corpus.

On a remotely related note, I am wondering if it would be possible to export the filtered dfm for other algorithrms that might produce fine grind topics? For example, Gaussian LDA, which is an adaptation of LDA that takes in word embeddings (here's python implementation: https://pypi.org/project/gaussianlda/ . I can't find any in R).

Once again, thanks a lot for the great work!

@chainsawriot
Copy link
Owner

chainsawriot commented Sep 29, 2020

Make sure your corpus actually has many cross-lingual topics (i.e. topics exist in all languages). I am pretty sure the wiki corpus does not have 20. I don't know about your corpora.

In the paper, I have written that there is a possibility to converge to a solution with less than your desired k, when your corpus doesn't have enough variance to support a solution with a large k. GMM is pretty restrictive. You can't get fine-grinded topics if your input is not fine-grinded enough. (as always, GIGO).

You can access the actual matrix of a filtered dfm object by adding $dfm after it (e.g. filtered_dfm$dfm). You can then use it anywhere.

rectr/R/rectr.R

Line 182 in 649ac80

input_dfm$dfm <- svd_dfm[,i:max_d]

@justinchuntingho
Copy link
Author

Thanks for pointing this out! My corpus is a sample from news artcile from US and a few european countries, and I think the lack of cross-lingual topics is the reason (they could have over 20 in total but I doubt if many of them are cross-lingual across all languages).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants