# Topic modeling

The [topicmod module](api.rst#module-tmtoolkit.topicmod) offers a wide range of tools to facilitate [topic modeling](https://cacm.acm.org/magazines/2012/4/147361-probabilistic-topic-models/fulltext) with Python. This chapter will introduce the following techniques: 

TODO (link this)
TODO model convergence

- [parallel topic model computation for different copora and/or parameter sets](#Computing-topic-models-in-parallel)
- evaluation of topic models (including finding the "optimal" number of topics)
- common statistics for topic models
- export of topic models and summaries to different file formats
- visualization of topic models

## An example document-term matrix

tmtoolkit supports topic models that are computed from document-term matrices (DTMs). Just as in the previous chapter, we will at first generate a DTM. However, this time the sample will be bigger:

In [1]:
import random
random.seed(20191120)   # to make the sampling reproducible

import numpy as np
np.set_printoptions(precision=5)

from tmtoolkit.corpus import Corpus

corpus = Corpus.from_builtin_corpus('english-NewsArticles').sample(100)

We will also now generate two DTMs, because we later want to show how you can compute topic models for two different DTMs in parallel. At first, we to some general preprocessing:

In [2]:
from tmtoolkit.preprocess import TMPreproc

preproc = TMPreproc(corpus)
preproc.pos_tag() \
    .lemmatize() \
    .tokens_to_lowercase() \
    .remove_special_chars_in_tokens()

<TMPreproc [100 documents]>

Now we at first apply more "relaxed" cleaning:

In [3]:
preproc_bigger = preproc.copy() \
    .clean_tokens(remove_shorter_than=2) \
    .remove_common_tokens(df_threshold=0.95) \
    .remove_uncommon_tokens(df_threshold=0.05)

preproc_bigger.n_docs, preproc_bigger.vocabulary_size

(100, 846)

Another copy of `preproc` will apply more aggressive cleaning and hence in a smaller vocabulary size:

In [4]:
preproc_smaller = preproc.copy() \
    .filter_for_pos('N') \
    .clean_tokens(remove_numbers=True, remove_shorter_than=2) \
    .remove_common_tokens(df_threshold=0.9) \
    .remove_uncommon_tokens(df_threshold=0.1)

del preproc

preproc_smaller.n_docs, preproc_smaller.vocabulary_size

(100, 149)

We will create the document labels, vocabulary arrays and DTMs for both versions now:

In [5]:
# doc_labels are the same for both

doc_labels = np.array(preproc_bigger.doc_labels)
doc_labels[:10]

array(['NewsArticles-1032', 'NewsArticles-1036', 'NewsArticles-104',
       'NewsArticles-1043', 'NewsArticles-1048', 'NewsArticles-1090',
       'NewsArticles-1126', 'NewsArticles-113', 'NewsArticles-1137',
       'NewsArticles-1141'], dtype='<U17')

In [6]:
vocab_bg = np.array(preproc_bigger.vocabulary)
vocab_sm = np.array(preproc_smaller.vocabulary) 

In [7]:
dtm_bg = preproc_bigger.dtm
dtm_sm = preproc_smaller.dtm

del preproc_bigger, preproc_smaller  # don't need these any more

dtm_bg, dtm_sm

(<100x846 sparse matrix of type '<class 'numpy.int32'>'
 	with 10356 stored elements in Compressed Sparse Row format>,
 <100x149 sparse matrix of type '<class 'numpy.int32'>'
 	with 2482 stored elements in Compressed Sparse Row format>)

We now have two sparse DTMs `dtm_bg` (from the bigger preprocessed data) and `dtm_sm` (from the smaller preprocessed data), a list of document labels `doc_labels` that represent the rows of both DTMs and vocabulary arrays `vocab_bg` and `vocab_sm` that represent the columns of the respective DTMs. We will use this data for the remainder of the chapter.

## Computing topic models in parallel

tmtoolkit allows to compute topic models in parallel, making use of all processor cores in your machine. Parallelization can be done per input DTM, per hyperparameter set and as combination of both. Hyperparameters control the number of topics and their "granularity". We will later have a look at the role of hyperparameters and how to find an optimal combination for a given dataset with the means of topic model evaluation.

For now, we will concentrate on computing the topic models for both of our two DTMs in parallel. tmtoolkit supports three very popular packages for topic modeling, which provide the work of actually computing the model from the input matrix. They can all be accessed in separate sub-modules of the [topicmod module](api.rst#module-tmtoolkit.topicmod):

- [topicmod.tm_lda](api.html#module-tmtoolkit.topicmod.tm_lda) provides an interface for the [lda](https://lda.readthedocs.io/en/latest/) package
- [topicmod.tm_sklearn](api.html#module-tmtoolkit.topicmod.tm_sklearn) provides an interface for the [scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) package
- [topicmod.tm_gensim](api.html#module-tmtoolkit.topicmod.tm_gensim) provides an interface for the [Gensim](https://radimrehurek.com/gensim/) package

Each of these sub-modules offer at least two functions that work with the respective package: `compute_models_parallel()` for general parallel model computation and `evaluate_topic_models()` for parallel model computation and evaluation (discussed later). For now, we want to compute two models in parallel with the [lda](https://lda.readthedocs.io/en/latest/) package and hence use [compute_models_parallel()](api.rst#tmtoolkit.topicmod.tm_lda.compute_models_parallel) from [topicmod.tm_lda](api.html#module-tmtoolkit.topicmod.tm_lda).

We need to provide two things for this function: First, the input matrices as a dict that maps labels to the respective DTMs. Second, hyperparameters to use for the model computations. Note that each topic modeling package has different hyperparameters and you should refer to their documentation in order to find out, which hyperparameters you can provide. For lda, we set the number of topics `n_topics` to 10 and the number of iterations for the Gibbs sampling process `n_iter` to 1000. We always want to use the same hyperparameters, so we pass these as `constant_parameters`. If we wanted to create models for a whole range of parameters, e.g. for different numbers of topics, we could provide `varying_parameters`. We will check this out later when we evaluate topic models.

<div class="alert alert-info">
    
Note that for proper topic modeling, we shouldn't just set the number of topics, but try to find it out via evaluation methods. We should also check if the algorithm converged using the provided likelihood estimations. We will do both later on, but now focus on `compute_models_parallel()`.

</div>

In [8]:
import logging
import warnings
from tmtoolkit.topicmod.tm_lda import compute_models_parallel

# suppress the "INFO" messages and warnings from lda
logger = logging.getLogger('lda')
logger.addHandler(logging.NullHandler())
logger.propagate = False

warnings.filterwarnings('ignore')

# set data to use
dtms = {
    'bigger': dtm_bg,
    'smaller': dtm_sm
}

# and fixed hyperparameters
lda_params = {
    'n_topics': 10,
    'n_iter': 1000
}

models = compute_models_parallel(dtms, constant_parameters=lda_params)
models

defaultdict(list,
            {'smaller': [({'n_topics': 10, 'n_iter': 1000},
               <lda.lda.LDA at 0x7fc12c740898>)],
             'bigger': [({'n_topics': 10, 'n_iter': 1000},
               <lda.lda.LDA at 0x7fc12c740278>)]})

As expected, two models were created. These can be accessed via the labels that we used to define the `dtm` dict:

In [9]:
models['smaller']

[({'n_topics': 10, 'n_iter': 1000}, <lda.lda.LDA at 0x7fc12c740898>)]

We can see that for each input DTM, we get a list of 2-tuples. The first element in each tuple is a dict that represents the hyperparameters that were used to compute the model, the second element is actual topic model (the `<lda.lda.LDA ...>` object). This structure looks a bit complex, but this is because it also supports varying parameters. Since we only have one fixed set of hyperparameters per DTM, we only have a list of length 1 for each DTM.

We will now access the models and print the top words per topic by using [print_ldamodel_topic_words()](api.rst#tmtoolkit.topicmod.model_io.print_ldamodel_topic_words):

In [10]:
from tmtoolkit.topicmod.model_io import print_ldamodel_topic_words

model_sm = models['smaller'][0][1]
print_ldamodel_topic_words(model_sm.topic_word_, vocab_sm, top_n=3)

topic_1
> #1. people (0.137050)
> #2. country (0.101592)
> #3. year (0.075717)
topic_2
> #1. election (0.088436)
> #2. leader (0.088436)
> #3. party (0.088436)
topic_3
> #1. germany (0.094707)
> #2. percent (0.088910)
> #3. year (0.059924)
topic_4
> #1. mr (0.209053)
> #2. investigation (0.090129)
> #3. statement (0.082922)
topic_5
> #1. us (0.129454)
> #2. report (0.127012)
> #3. russia (0.112359)
topic_6
> #1. force (0.111813)
> #2. syria (0.089941)
> #3. source (0.077790)
topic_7
> #1. trump (0.202530)
> #2. president (0.104168)
> #3. house (0.090667)
topic_8
> #1. china (0.198046)
> #2. company (0.123237)
> #3. market (0.092433)
topic_9
> #1. day (0.117525)
> #2. child (0.102199)
> #3. family (0.084319)
topic_10
> #1. police (0.164607)
> #2. attack (0.119975)
> #3. officer (0.103239)


In [11]:
model_bg = models['bigger'][0][1]
print_ldamodel_topic_words(model_bg.topic_word_, vocab_bg, top_n=3)

topic_1
> #1. police (0.081255)
> #2. officer (0.055464)
> #3. man (0.043858)
topic_2
> #1. trump (0.086968)
> #2. president (0.053012)
> #3. house (0.048043)
topic_3
> #1. say (0.049042)
> #2. year (0.033772)
> #3. one (0.029955)
topic_4
> #1. china (0.058659)
> #2. company (0.048884)
> #3. market (0.029333)
topic_5
> #1. people (0.033917)
> #2. country (0.027558)
> #3. make (0.025136)
topic_6
> #1. would (0.044348)
> #2. leader (0.026519)
> #3. new (0.025605)
topic_7
> #1. say (0.145526)
> #2. report (0.036203)
> #3. mr (0.020999)
topic_8
> #1. people (0.037668)
> #2. force (0.037668)
> #3. syria (0.030300)
topic_9
> #1. election (0.068604)
> #2. vote (0.054450)
> #3. germany (0.053361)
topic_10
> #1. us (0.078872)
> #2. russian (0.052268)
> #3. russia (0.044667)


for single dtm, varying n_iter -> model convergence