## 'Grid Search' for Optimal Gensim LDA Parameters

### Determining fit based on topic circles of uniform size and minimum overlap

#### Testing effect of number of passes on model performance. Using 6 topics, chunksize of 277 (length of corpus divided by 4 cores), min_prob of 0.01 (default)
#### Results: more uniformity, less overlap, as number of passes increases. Best at 30 passes (maximum number used in grid search)

In [11]:
import gensim

dictionary = gensim.corpora.Dictionary.load('lda_mod/well_docs.dict')
corpus = gensim.corpora.MmCorpus('lda_mod/well_docs.mm')
lda = gensim.models.ldamodel.LdaModel.load('lda_models/n_topics: 6, passes: 2, chunksize: 277, min probability: 0.01.model')

In [12]:
import pyLDAvis.gensim

# note: topic numbers here are not the same as assigned by gensim.
# these topic numbers are assigned in descending order corresponding to topic with largest percentage of tokens from entire corpus.
to_disp = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(to_disp)

In [9]:
lda = gensim.models.ldamodel.LdaModel.load('lda_models/n_topics: 6, passes: 10, chunksize: 277, min probability: 0.01.model')

In [10]:
to_disp = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(to_disp)

In [17]:
lda = gensim.models.ldamodel.LdaModel.load('lda_models/n_topics: 6, passes: 20, chunksize: 277, min probability: 0.01.model')

In [18]:
to_disp = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(to_disp)

In [15]:
lda = gensim.models.ldamodel.LdaModel.load('lda_models/n_topics: 6, passes: 30, chunksize: 277, min probability: 0.01.model')

In [16]:
to_disp = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(to_disp)

#### Testing at 7, 8 , and 9 topics with 30 passes, chunksize of 277, min prob of 0.01
#### Least overlap found at 7 topics

In [51]:
lda = gensim.models.ldamodel.LdaModel.load('lda_models/n_topics: 7, passes: 30, chunksize: 277, min probability: 0.01.model')
to_disp = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(to_disp)

In [52]:
lda = gensim.models.ldamodel.LdaModel.load('lda_models/n_topics: 8, passes: 30, chunksize: 277, min probability: 0.01.model')
to_disp = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(to_disp)

In [53]:
lda = gensim.models.ldamodel.LdaModel.load('lda_models/n_topics: 9, passes: 30, chunksize: 277, min probability: 0.01.model')
to_disp = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(to_disp)

#### Testing at 7 topics with chunksize of 277 and varying min prob from 0.01, 0.25, 0.5, and 0.75
#### Not noticing any changes. Will leave min prob at default and remove from grid search

In [54]:
lda = gensim.models.ldamodel.LdaModel.load('lda_models/n_topics: 7, passes: 30, chunksize: 277, min probability: 0.01.model')
to_disp = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(to_disp)

In [55]:
lda = gensim.models.ldamodel.LdaModel.load('lda_models/n_topics: 7, passes: 30, chunksize: 277, min probability: 0.25.model')
to_disp = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(to_disp)

In [56]:
lda = gensim.models.ldamodel.LdaModel.load('lda_models/n_topics: 7, passes: 30, chunksize: 277, min probability: 0.5.model')
to_disp = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(to_disp)

In [57]:
lda = gensim.models.ldamodel.LdaModel.load('lda_models/n_topics: 7, passes: 30, chunksize: 277, min probability: 0.75.model')
to_disp = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(to_disp)

#### Testing effect of number of passes on model performance. Using 10 topics, chunksize of 277 (length of corpus divided by 4 cores), min_prob of 0.01 (default)
#### Results: more uniformity, less overlap, as number of passes increases. Best at 30 passes (maximum number used in grid search)
#### 10 topics results in 2 medium sized overlaps and one small overlap at 30 passes.
#### Will evaluate results of choosing topics between 6 and 10 at 30 passes to see if overlap can be reduced.

In [39]:
lda = gensim.models.ldamodel.LdaModel.load('lda_models/n_topics: 10, passes: 2, chunksize: 277, min probability: 0.01.model')
to_disp = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(to_disp)

In [21]:
lda = gensim.models.ldamodel.LdaModel.load('lda_models/n_topics: 10, passes: 10, chunksize: 277, min probability: 0.01.model')

In [22]:
to_disp = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(to_disp)

In [23]:
lda = gensim.models.ldamodel.LdaModel.load('lda_models/n_topics: 10, passes: 20, chunksize: 277, min probability: 0.01.model')
to_disp = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(to_disp)

In [24]:
lda = gensim.models.ldamodel.LdaModel.load('lda_models/n_topics: 10, passes: 30, chunksize: 277, min probability: 0.01.model')
to_disp = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(to_disp)

#### Testing effect of number of passes on model performance. Using 14 topics, chunksize of 277 (length of corpus divided by 4 cores), min_prob of 0.01 (default)
#### Similar results to above: more uniformity, less overlap, as number of passes increases. Best at 30 passes (maximum number used in grid search)
#### 14 topics results in 2 major, 1 medium, and 3 small sized overlaps at 30 passes.

In [25]:
lda = gensim.models.ldamodel.LdaModel.load('lda_models/n_topics: 14, passes: 2, chunksize: 277, min probability: 0.01.model')
to_disp = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(to_disp)

In [58]:
lda = gensim.models.ldamodel.LdaModel.load('lda_models/n_topics: 14, passes: 10, chunksize: 277, min probability: 0.01.model')
to_disp = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(to_disp)

In [59]:
lda = gensim.models.ldamodel.LdaModel.load('lda_models/n_topics: 14, passes: 20, chunksize: 277, min probability: 0.01.model')
to_disp = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(to_disp)

In [28]:
lda = gensim.models.ldamodel.LdaModel.load('lda_models/n_topics: 14, passes: 30, chunksize: 277, min probability: 0.01.model')
to_disp = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(to_disp)

#### Testing effect of number of passes on model performance. Using 20 topics, chunksize of 277 (length of corpus divided by 4 cores), min_prob of 0.01 (default)
#### Similar results as above: more uniformity, less overlap, as number of passes increases. Best at 30 passes (maximum number used in grid search)
#### 20 topics results in several major overlaps.

In [29]:
lda = gensim.models.ldamodel.LdaModel.load('lda_models/n_topics: 20, passes: 2, chunksize: 277, min probability: 0.01.model')
to_disp = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(to_disp)

In [30]:
lda = gensim.models.ldamodel.LdaModel.load('lda_models/n_topics: 20, passes: 10, chunksize: 277, min probability: 0.01.model')
to_disp = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(to_disp)

In [31]:
lda = gensim.models.ldamodel.LdaModel.load('lda_models/n_topics: 20, passes: 20, chunksize: 277, min probability: 0.01.model')
to_disp = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(to_disp)

In [32]:
lda = gensim.models.ldamodel.LdaModel.load('lda_models/n_topics: 20, passes: 30, chunksize: 277, min probability: 0.01.model')
to_disp = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(to_disp)