# Topics – Easy Topic Modeling in Python

The text mining technique **Topic Modeling** has become a popular statistical method for clustering documents. This [Jupyter notebook](http://jupyter.org/) introduces a step-by-step workflow, basically containing data preprocessing, the actual topic modeling using **latent Dirichlet allocation** (LDA), which learns the relationships between words, topics, and documents, as well as some interactive visualizations to explore the model.

LDA, introduced in the context of text analysis in [2003](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf), is an instance of a more general class of models called **mixed-membership models**. Involving a number of distributions and parameters, the topic model is typically performed using [Gibbs sampling](https://en.wikipedia.org/wiki/Gibbs_sampling) with conjugate priors and is purely based on word frequencies.

There have been written numerous introductions to topic modeling for humanists (e.g. [this one](http://scottbot.net/topic-modeling-for-humanists-a-guided-tour/)), which provide another level of detail regarding its technical and epistemic properties.

For this workflow, you will need a corpus (a set of texts) as plain text (`.txt`) or [TEI XML](http://www.tei-c.org/index.xml) (`.xml`). Using the `dariah_topics` package, you also have the ability to process the output of [DARIAH-DKPro-Wrapper](https://github.com/DARIAH-DE/DARIAH-DKPro-Wrapper), a command-line tool for *natural language processing*.

Topic modeling works best with very large corpora. The [TextGrid Repository](https://textgridrep.org/) is a great place to start searching for text data. Anyway, to demonstrate the technique, we provide one small text collection in the folder `grenzboten_sample` containing 15 diary excerpts, as well as 15 war diary excerpts, which appeared in *Die Grenzboten*, a German newspaper of the late 19th and early 20th century.

**Of course, you can work with your own corpus in this notebook.**

We're relying on the LDA implementation by [Andrew McCallum](https://people.cs.umass.edu/~mccallum/), called [MALLET](http://mallet.cs.umass.edu/topics.php), which is known to be very robust. Aside from that, we provide two more Jupyter notebooks:

* [IntroducingGensim](IntroducingGensim.ipynb), using LDA by [Gensim](https://radimrehurek.com/project/gensim/), which is attractive because of its multi-core support.
* [IntroducingLda](IntroducingLda.ipynb), using LDA by [lda](http://pythonhosted.org/lda/index.html), which is very lightweight.

For more information in general, have a look at the [documentation](http://dev.digital-humanities.de/ci/job/DARIAH-Topics/doclinks/1/).

## First step: Installing dependencies

To work within this Jupyter notebook, you will have to import the `dariah_topics` library. As you do, `dariah_topics` also imports a couple of external libraries, which have to be installed first. `pip` is the preferred installer program in Python. Starting with Python 3.4, it is included by default with the Python binary installers. If you are interested in `pip`, have a look at [this website](https://docs.python.org/3/installing/index.html).

To install the `dariah_topics` library with all dependencies, open your commandline, go with `cd` to the folder `Topics` and run:

```
pip install -r requirements.txt
```

Alternatively, you can do:

```
python setup.py install
```

If you get any errors or are not able to install *all* dependencies properly, try [Stack Overflow](https://stackoverflow.com/questions/tagged/pip) for troubleshooting or create a new issue on our [GitHub page](https://github.com/DARIAH-DE/Topics).

**Important**: If you are on macOS or Linux, you will have to use `pip3` and `python3`.

### Some final words
As you probably already know, code has to be written in the grey cells. You execute a cell by clicking the **Run**-button (or **Ctrl + Enter**). If you want to run all cells of the notebook at once, click **Cell > Run All** or **Kernel > Restart & Run All** respectively, if you want to restart the Python kernel first. On the left side of an (unexecuted) cell stands `In [ ]:`. The empty bracket means, that the cell hasn't been executed yet. By clicking **Run**, a star appears in the brackets (`In [*]:`), which means the process is running. In most cases, you won't see that star, because your computer is faster than your eyes. You can execute only one cell at once, all following executions will be in the waiting line. If the process of a cell is done, a number appears in the brackets (`In [1]:`).

## Starting with topic modeling!

Execute the following cell to import modules from the `dariah_topics` library.

In [None]:
from cophi_toolbox import preprocessing
from dariah_topics import utils
from dariah_topics import postprocessing
from dariah_topics import visualization

Furthermore, we will need some additional functions from external libraries.

In [None]:
import metadata_toolbox.utils as metadata
import pandas as pd
from pathlib import Path

Let's not pay heed to any warnings right now and execute the following cell.

In [None]:
import warnings
warnings.filterwarnings('ignore')

## 1. Preprocessing

### 1.1. Reading a corpus of documents

#### Defining the path to the corpus folder

In the present example code, we are using the 30 diary excerpts from the folder `grenzboten`. To use your own corpus, change the path accordingly.

In [None]:
path_to_corpus = Path('data', 'grenzboten_sample')

#### Specifying the pattern of filenames for metadata extraction

You have the ability to extract metadata from the filenames. For instance, if your textfiles look like:

```
goethe_1816_stella.txt
```

the pattern would look like this:

```
{author}_{year}_{title}
```

So, let's try this for the example corpus.

In [None]:
pattern = '{author}_{year}_{title}'

#### Accessing file paths and metadata
We begin by creating a list of all the documents in the folder specified above. That list will tell the function `preprocessing.read_files` (see below) which text documents to read. Furthermore, based on filenames we can create some metadata, e.g. author and title.

In [None]:
meta = pd.concat([metadata.fname2metadata(str(path), pattern=pattern) for path in path_to_corpus.glob('*.txt')])
meta[:5] # by adding '[:5]' to the variable, only the first 5 elements will be printed

#### Read listed documents from folder

In [None]:
corpus = list(preprocessing.read_files(meta.index))
corpus[0][:255] # printing the first 255 characters of the first document

Your `corpus` contains as much elements (`documents`) as texts in your corpus are. Each element of `corpus` is a list containing exactly one element, the text itself as one single string including all whitespaces and punctuations:

```
[['This is the content of your first document.'],
 ['This is the content of your second document.'],
 ...
 ['This is the content of your last document.']]
```

### 1.3. Tokenize corpus
Now, your `documents` in `corpus` will be tokenized. Tokenization is the task of cutting a stream of characters into linguistic units, simply words or, more precisely, tokens. The tokenize function `dariah_topics` provides is a simple Unicode tokenizer. Depending on the corpus, it might be useful to use an external tokenizer function, or even develop your own, since its efficiency varies with language, epoch and text type.

In [None]:
tokenized_corpus = [list(preprocessing.tokenize(document)) for document in corpus]

At this point, each `document` is represented by a list of separate token strings. As above, have a look at the first document (which has the index `0` as Python starts counting at 0) and show its first 14 words/tokens (that have the indices `0:13` accordingly).

In [None]:
tokenized_corpus[0][0:13]

### 1.4 Create a document-term matrix

The LDA topic model is based on a [document-term matrix](https://en.wikipedia.org/wiki/Document-term_matrix) of the corpus. To improve performance in large corpora, the matrix describes the frequency of terms that occur in the collection. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

#### 1.4.1 Large corpus matrix

If you have a very large corpus, create a document-term matrix designed for large corpora.

In [None]:
document_term_matrix, document_ids, type_ids = preprocessing.create_document_term_matrix(tokenized_corpus,
                                                                                         meta['title'],
                                                                                         large_corpus=True)

#### 1.4.2 Small corpus matrix

Otherwise, use the document-term matrix desigend for small corpora.

In [None]:
document_term_matrix = preprocessing.create_document_term_matrix(tokenized_corpus, meta['title'])
document_term_matrix[:5]

### 1.5. Feature removal

*Stopwords* (also known as *most frequent tokens*) and *hapax legomena* are harmful for LDA and have to be removed from the corpus or the document-term matrix respectively. In this example, the 50 most frequent tokens will be categorized as stopwords.

**Hint**: Be careful with removing most frequent tokens, you might remove tokens quite important for LDA. Anyway, to gain better results, it is highly recommended to use an external stopwords list.

In this notebook, we combine the 50 most frequent tokens, hapax legomena and an external stopwordslist.

#### List the 100 most frequent words

If you have chosen the large corpus model, you will have to add `type_ids` to the function `preprocessing.list_mfw()`.

In [None]:
stopwords = preprocessing.list_mfw(document_term_matrix, most_frequent_tokens=100)

These are the five most frequent words:

In [None]:
stopwords[:5]

#### List hapax legomena

In [None]:
hapax_legomena = preprocessing.find_hapax_legomena(document_term_matrix)
print("Total number of types in corpus:", document_term_matrix.shape[1])
print("Total number of hapax legomena:", len(hapax_legomena))

#### Optional: Use external stopwordlist

In [None]:
path_to_stopwordlist = Path('data', 'stopwords', 'de.txt')
external_stopwords = [line.strip() for line in path_to_stopwordlist.open('r', encoding='utf-8')]

#### Combine lists and remove content from `tokenized_corpus`

In [None]:
features = stopwords + hapax_legomena + external_stopwords
clean_tokenized_corpus = list(preprocessing.remove_features(features, tokenized_corpus=tokenized_corpus))

## 2. Model creation

#### Path to MALLET folder 

Now we must tell the library where to find the local instance of MALLET. If you managed to install MALLET, it is sufficient set `path_to_mallet = 'mallet'`, if you store MALLET in a local folder, you have to specify the path to the binary explictly (e.g. `path_to_mallet = 'C:/mallet-2.0.8/bin/mallet'`).

**Whitespaces are not allowed in the path!**

In [None]:
path_to_mallet = 'mallet'

### 2.1. Create `Mallet` object

Finally, we can instance the `Mallet` object.

In [None]:
Mallet = utils.Mallet(path_to_mallet)

The object `Mallet` has a method `import_tokenized_corpus()` to create a specific corpus file for MALLET.

In [None]:
mallet_corpus = Mallet.import_tokenized_corpus(clean_tokenized_corpus, meta['title'])

Furthermore, `Mallet` has the method `train_topics()` to create and train the LDA model. To create a LDA model, there have to be specified a couple of parameters.

But first, if you are curious about any library, module, class or function, try `help()`. This can be very useful, because (at least in a well documented library) explanations of use and parameters will be printed. We're interested in the function `Mallet.train_topics()` in the module `dariah_topics.mallet`, so let's try:

```
help(mallet.Mallet)
```

This will print something like this (in fact even more):

```
Help on method train_topics in module dariah_topics.mallet:

train_topics(mallet_binary, **kwargs) method of dariah_topics.mallet.Mallet instance
    Args:
        input_model (str): Absolute path to the binary topic model created by `output_model`.
        output_model (str): Write a serialized MALLET topic trainer object.
            This type of output is appropriate for pausing and restarting training,
            but does not produce data that can easily be analyzed.
        output_topic_keys (str): Write the top words for each topic and any
            Dirichlet parameters to file.
        topic_word_weights_file (str): Write unnormalized weights for every
            topic and word type.
        word_topic_counts_file (str): Write a sparse representation of topic-word
            assignments. By default this is null, indicating that no file will
            be written.
        output_doc_topics (str): Write the topic proportions per document, at
            the end of the iterations.
        num_topics (int): Number of topics. Defaults to 10.
        num_top_words (int): Number of keywords for each topic. Defaults to 10.
        num_interations (int): Number of iterations. Defaults to 1000.
        num_threads (int): Number of threads for parallel training.  Defaults to 1.
        num_icm_iterations (int): Number of iterations of iterated conditional
            modes (topic maximization).  Defaults to 0.
        no_inference (bool): Load a saved model and create a report. Equivalent
            to `num_iterations = 0`. Defaults to False.
        random_seed (int): Random seed for the Gibbs sampler. Defaults to 0.
        optimize_interval (int): Number of iterations between reestimating
            dirichlet hyperparameters. Defaults to 0.
        optimize_burn_in (int): Number of iterations to run before first
            estimating dirichlet hyperparameters. Defaults to 200.
        use_symmetric_alpha (bool): Only optimize the concentration parameter of
            the prior over document-topic distributions. This may reduce the
            number of very small, poorly estimated topics, but may disperse common
            words over several topics. Defaults to False.
        alpha (float): Sum over topics of smoothing over doc-topic distributions.
            alpha_k = [this value] / [num topics]. Defaults to 5.0.
        beta (float): Smoothing parameter for each topic-word. Defaults to 0.01.
```

So, now you know how to define the number of topics and the number of sampling iterations as well. A higher number of iterations will probably yield a better model, but also increases processing time. `alpha` and `beta` are so-called *hyperparameters*. They influence the model's performance, so feel free to play around with them. In the present example, we will leave the default values. Furthermore, there exist various methods for hyperparameter optimization, e.g. gridsearch or Gaussian optimization.

**Warning: This step can take quite a while!** Meaning something between some seconds and some hours depending on corpus size and the number of iterations. Our example corpus should be done within a minute or two at `num_iterations=1000`.

First, create an output folder:

In [None]:
output = Path('data', 'mallet_output')

if not output.exists():
    output.mkdir()

In [None]:
%%time

Mallet.train_topics(mallet_corpus,
                    output_topic_keys=str(Path(output, 'topic_keys.txt')),
                    output_doc_topics=str(Path(output, 'doc_topics.txt')),
                    num_topics=10,
                    num_iterations=1000)

If you are curious about MALLET's logging, have a look at the file `mallet.log`, which should have been created in the same directory as your notebook is.

### 2.4. Create document-topic matrix

The generated model object can now be translated into a human-readable document-topic matrix (that is a actually a pandas data frame) that constitutes our principle exchange format for topic modeling results. For generating the matrix from a Gensim model, we can use the following function:

In [None]:
topics = postprocessing.show_topics(topic_keys_file=str(Path(output, 'topic_keys.txt')))
topics

## 3. Visualization

Each topic has a certain probability for each document in the corpus (have a look at the cell below). This probability distributions are visualized in an interactive **heatmap** (the darker the color, the higher the probability) which displays the kind of information
                that is presumably most useful to literary scholars. Going beyond pure exploration, this visualization can be used to show thematic developments over a set of texts as well as a single text, akin to a dynamic topic model. What might become
                apparent here, is that some topics correlate highly with a specific author or group of authors, while other topics correlate highly with a specific text or group of texts.

In [None]:
document_topics = postprocessing.show_document_topics(topics=topics,
                                                      doc_topics_file=str(Path(output, 'doc_topics.txt')))
document_topics[:5]

### 3.1. Distribution of topics

#### Distribution of topics over all documents

The distribution of topics over all documents can now be visualized in an interactive heatmap.

In [None]:
from bokeh.io import output_notebook, show
output_notebook()
%matplotlib inline

In [None]:
PlotDocumentTopics = visualization.PlotDocumentTopics(document_topics)
show(PlotDocumentTopics.interactive_heatmap(), notebook_handle=True)

Or a static heatmap:

In [None]:
static_heatmap = PlotDocumentTopics.static_heatmap()
static_heatmap.show()