# Additive Regularization of Topic Models

### Marina Gomtsyan, Valerii Likhosherstov, Aliaksandra Shysheya

# Topic Modeling

* document clustering
* feature selection
* information retrieval
* etc.

# Probabilistic latent semantic analysis

$D$ - set of texts, $W$ - set of words, $T$ - set of latent topics.

$n_d$ words in each document, $n_{dw}$ times $w$ appears in $d$.

* Text collection:  $(w_i,d_i, t_i) \sim p(w,d,t)$, $i = 1, ... , n$.
* Assumption: $p(w | t) = p(w | d, t)$.
* $p(w | d) = \sum_{t \in T} p(t | d) p(w | t) = \sum_{t \in T} \phi_{wt} \theta_{td} \approx \frac{n_{dw}}{n_d}$.

# PLSA 

Problem statement:

$$ L (\Phi, \Theta) = \sum_{d \in D} \sum_{w  \in D} n_{dw} \ln \sum_{t \in T} \phi_{wt} \theta_{td} \to \max_{\Phi, \Theta} $$
$$ \text{subject to} $$
$$ \sum_{w \in W} \phi_{wt} = 1, \quad \phi_{td} \geq 0 $$
$$ \sum_{t \in T} \theta_{td} = 1, \quad \theta_{td} \geq 0 $$

Solution - EM-algorithm iterations.

### Additive Regularization for Topic Models



Let's add regularization to the objective:

$$ L(\Phi, \Theta) + R(\Phi, \Theta) \to \max_{\Phi, \Theta} $$

* $ R( \Phi, \Theta) = \sum_{i=1}^r \tau_i R_i(\Phi, \Theta) $.
* For smooth $R$ exact EM-algorithm still exists!

# Example - sparsing regularization



$ R(\Phi, \Theta) = -\beta_0 \sum_{t \in T} \sum_{w \in W} \beta_{w} \ln \phi_{wt} - \alpha_{0} \sum_{d \in D}\sum_{t \in T} \alpha_t \ln \theta_{td} $

$ \beta_w, \alpha_t $ are uniform distributions, $ \beta_0, \alpha_0 $ are non-negative.

# Many other regularizers for:

* Topic covariance minimization
* Elimination of insignificant topics
* Classification regularizer
* etc.

# Our project

* Implement EM-algorithm for ARTM and a set of regularizers.
* Set up several ARTM models.
* Choose a quality metric.
* Compare their quality on some dataset.

# Models

* **50 topics**
* Baseline model - PLSA
* Topic covariance minimization + Elimination of insignificant topics
* Sparsing regularization
* Classification using genre labels from dataset.

# Metrics

### Perplexity (to test PLSA):

$$ \mathscr{P}(D, p) = exp\Big(-\frac{1}{n} L(\Phi, \Theta)\Big) $$

### Interpretability (main metric):

$$ \text{PMI}_t = \frac{1}{k(k-1)} \sum_{i=1}^{k-1} \sum_{j=i}^{k} \text{PMI}(w_i, w_j) $$

$w_1, ..., w_k$ - most probable words in topic $t$. $k = 100$.

$$ \text{PMI} (u, v) = \ln \frac{\mid D \mid N_{uv}}{N_u N_v} $$

$N_u$ - # documents with word $u$.

$N_{uv}$ - # documents with words $u$ and $v$ in a sliding window of 10 words.

# Dataset

* From OpenCorpora - open source Russian language corpora.
* 2610 documents, 59264 words (1900 after filtration).
* Manually lemmatized.

# EM-algorithm testing

Baseline model - PLSA

<img src="files/perplexity.png">

# Best model

Topic covariance minimization + Elimination of insignificant topics - ~5% improvement

<img src="files/best_model.png">

# Some topics for the best model

* `['новый', 'искусство', 'эпоха', 'концепция', 'архитектура', 'разработка', 'личность', 'возрождение', 'здание', 'живопись']`
* `['по', 'рубль', 'общий', 'каждый', 'сумма', 'проект', 'млн', 'сеть', 'евро', 'регион']`
* `['год', '2008', '2009', 'сообщил', '2004', 'миллиард', 'сша', 'германия', '60', 'продажа']`

# Document clusters

<img src="files/tsne.png">