In [22]:
%matplotlib inline


Topics and Transformations
===========================

Introduces transformations and demonstrates their use on a toy corpus.



In [23]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Creating the Corpus
-------------------

In [24]:
from collections import defaultdict
from gensim import corpora

documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]

# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

2020-04-22 12:24:07,552 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-04-22 12:24:07,553 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)


### Creating a transformation

In [25]:
from gensim import models
tfidf = models.TfidfModel(corpus)  # step 1 -- initialize a model

2020-04-22 12:24:28,654 : INFO : collecting document frequencies
2020-04-22 12:24:28,655 : INFO : PROGRESS: processing document #0
2020-04-22 12:24:28,657 : INFO : calculating IDF weights for 9 documents and 12 features (28 matrix non-zeros)


* We used the corpus from tutorial 1 to train the transformer. 
* Different transformers require different initialization parameters - TfIdf simply goes through the corpus once. Training other models such as LSA or LDA is much more involved.
* Note: transforms convert between two vector spaces. The same vector space (= the same set of feature ids) must be used for training & subsequent transforms. Failure to use the same input feature space, such as applying a different string preprocessing, using different
  feature ids, or using bag-of-words vectors where TfIdf vectors are expected, will
  result in feature mismatch during transformation calls, resulting in garbage output and/or runtime exceptions.

### Transforming vectors

From now on, ``tfidf`` is treated as a read-only object that can be used to convert
any vector from the old representation (BoW integer counts) to the new representation
(TfIdf real-valued weights).

In [26]:
doc_bow = [(0, 1), (1, 1)]
print(tfidf[doc_bow])  # step 2 -- use the model to transform vectors

[(0, 0.7071067811865476), (1, 0.7071067811865476)]


* Apply transformation to entire corpus:



In [27]:
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
    print(doc)

[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]
[(0, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.3244870206138555), (6, 0.44424552527467476), (7, 0.3244870206138555)]
[(2, 0.5710059809418182), (5, 0.4170757362022777), (7, 0.4170757362022777), (8, 0.5710059809418182)]
[(1, 0.49182558987264147), (5, 0.7184811607083769), (8, 0.49182558987264147)]
[(3, 0.6282580468670046), (6, 0.6282580468670046), (7, 0.45889394536615247)]
[(9, 1.0)]
[(9, 0.7071067811865475), (10, 0.7071067811865475)]
[(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]
[(4, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]


* Once a transformer model is initialized, it can be used on any vectors (provided they come from the same vector space, of course), even if they were not used for training. This is done usin a process called folding-in for LSA, by topic inference for LDA etc.

* Note: calling ``model[corpus]`` only creates a wrapper around the old ``corpus``
  document stream -- actual conversions are done on-the-fly, during document iteration.
  We cannot convert the entire corpus at the time of calling ``corpus_transformed = model[corpus]``, because it requires storing the result in main memory - this violates gensim's objective of memory indepedence.
* If you will be iterating over the transformed ``corpus_transformed`` multiple times, and the transformation is costly, **serialize the resulting corpus to disk first.** Transformations can also be serialized, one on top of another, in a sort of chain:



In [28]:
lsi_model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)  # initialize an LSI transformation
corpus_lsi = lsi_model[corpus_tfidf]  # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi

2020-04-22 12:32:46,095 : INFO : using serial LSI version on this node
2020-04-22 12:32:46,097 : INFO : updating model with new documents
2020-04-22 12:32:46,099 : INFO : preparing a new chunk of documents
2020-04-22 12:32:46,100 : INFO : using 100 extra samples and 2 power iterations
2020-04-22 12:32:46,101 : INFO : 1st phase: constructing (12, 102) action matrix
2020-04-22 12:32:46,103 : INFO : orthonormalizing (12, 102) action matrix
2020-04-22 12:32:46,105 : INFO : 2nd phase: running dense svd on (12, 9) matrix
2020-04-22 12:32:46,106 : INFO : computing the final decomposition
2020-04-22 12:32:46,107 : INFO : keeping 2 factors (discarding 47.565% of energy spectrum)
2020-04-22 12:32:46,108 : INFO : processed documents up to #9
2020-04-22 12:32:46,108 : INFO : topic #0(1.594): 0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"time" + 0.060*"response" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"
2020-04-22 12:32:46,109 : INFO : topic #

* Here we transformed our Tf-Idf corpus using [Latent Semantic Indexing](http://en.wikipedia.org/wiki/Latent_semantic_indexing) into a latent 2-D space (2-D because we set ``num_topics=2``). 
* What do these two latent dimensions stand for? Let's see with :func:`models.LsiModel.print_topics`:



In [29]:
lsi_model.print_topics(2)

2020-04-22 12:36:20,935 : INFO : topic #0(1.594): 0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"time" + 0.060*"response" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"
2020-04-22 12:36:20,936 : INFO : topic #1(1.476): -0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"response" + -0.320*"time" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"


[(0,
  '0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"time" + 0.060*"response" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"'),
 (1,
  '-0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"response" + -0.320*"time" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"')]

* according to LSI:
    - trees", "graph" and "minors" are all related words (and contribute the most to the 1st topic)
    - the 2nd topic practically concerns itself with all the other words. 
* As expected, the first five documents are more strongly related to the second topic while the remaining four documents to the first topic:



In [30]:
# both bow->tfidf and tfidf->lsi transformations are done on the fly
for doc, as_text in zip(corpus_lsi, documents):
    print(doc, as_text)

[(0, 0.06600783396090451), (1, -0.5200703306361849)] Human machine interface for lab abc computer applications
[(0, 0.1966759285914269), (1, -0.7609563167700043)] A survey of user opinion of computer system response time
[(0, 0.08992639972446562), (1, -0.7241860626752503)] The EPS user interface management system
[(0, 0.07585847652178268), (1, -0.6320551586003422)] System and human system engineering testing of EPS
[(0, 0.10150299184980302), (1, -0.5737308483002957)] Relation of user perceived response time to error measurement
[(0, 0.7032108939378308), (1, 0.1611518021402595)] The generation of random binary unordered trees
[(0, 0.8774787673119828), (1, 0.16758906864659615)] The intersection graph of paths in trees
[(0, 0.9098624686818575), (1, 0.14086553628719237)] Graph minors IV Widths of trees and well quasi ordering
[(0, 0.6165825350569282), (1, -0.05392907566389199)] Graph minors A survey


* Use :func:`save` and :func:`load` functions to persist items.



In [31]:
import os
import tempfile

with tempfile.NamedTemporaryFile(
    prefix='model-', 
    suffix='.lsi', 
    delete=False) as tmp:
    lsi_model.save(tmp.name)  # same for tfidf, lda, ...

loaded_lsi_model = models.LsiModel.load(tmp.name)

os.unlink(tmp.name)

2020-04-22 12:39:14,701 : INFO : saving Projection object under /tmp/model-a77xtk9n.lsi.projection, separately None
2020-04-22 12:39:14,703 : INFO : saved /tmp/model-a77xtk9n.lsi.projection
2020-04-22 12:39:14,703 : INFO : saving LsiModel object under /tmp/model-a77xtk9n.lsi, separately None
2020-04-22 12:39:14,704 : INFO : not storing attribute projection
2020-04-22 12:39:14,705 : INFO : not storing attribute dispatcher
2020-04-22 12:39:14,706 : INFO : saved /tmp/model-a77xtk9n.lsi
2020-04-22 12:39:14,707 : INFO : loading LsiModel object from /tmp/model-a77xtk9n.lsi
2020-04-22 12:39:14,708 : INFO : loading id2word recursively from /tmp/model-a77xtk9n.lsi.id2word.* with mmap=None
2020-04-22 12:39:14,709 : INFO : setting ignored attribute projection to None
2020-04-22 12:39:14,710 : INFO : setting ignored attribute dispatcher to None
2020-04-22 12:39:14,710 : INFO : loaded /tmp/model-a77xtk9n.lsi
2020-04-22 12:39:14,711 : INFO : loading LsiModel object from /tmp/model-a77xtk9n.lsi.proje

In [34]:
!ls /tmp/*lsi*

/tmp/model-5xidaxof.lsi.projection  /tmp/model-si39wa43.lsi.projection
/tmp/model-a77xtk9n.lsi.projection


* How similar are those documents?

Available transformations
--------------------------

Gensim implements several popular Vector Space Model algorithms:

* [Term Frequency * Inverse Document Frequency, Tf-Idf](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) expects a bag-of-words (integer values) training corpus. It accepts a vector & returns another vector of the same dimensionality -- the features which were rare in the training corpus will have their value increased. It therefore converts integer-valued vectors into real-valued ones, while leaving #dimensions intact. It can also optionally normalize the results to (Euclidean) unit length.

``model = models.TfidfModel(corpus, normalize=True)``

* [Latent Semantic Indexing, LSI (sometimes LSA)](http://en.wikipedia.org/wiki/Latent_semantic_indexing) transforms documents from either bag-of-words or (preferrably) TfIdf into a latent space of a lower dimensionality. For the toy corpus we used only 2 latent dimensions, but on real corpora, target dimensionality of 200--500 is recommended. 

``model = models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300)``

* LSI training is unique in that we can continue "training" at any point, simply by providing more training documents. This is done by incremental updates to the underlying moded (called **online training**). So the input document stream can be infinite -- just keep feeding LSI new documents as they arrive, while using the computed transformation model as read-only in the meanwhile!

``model.add_documents(another_tfidf_corpus)``  # now LSI has been trained on tfidf_corpus + another_tfidf_corpus
``lsi_vec = model[tfidf_vec]``  # convert some new document into the LSI space, without affecting the model
``model.add_documents(more_documents)``  # tfidf_corpus + another_tfidf_corpus + more_documents
``lsi_vec = model[tfidf_vec]``

* See ``gensim.models.lsimodel`` docs to learn how to make LSI gradually "forget" old observations in infinite streams. If you want to get dirty, there are also parameters you can tweak that affect speed vs. memory footprint vs. numerical precision of the LSI algorithm.

* `gensim` uses an online incremental streamed distributed training algorithm, published in [5]_. `gensim` also executes a stochastic multi-pass algorithm, from Halko et al. [4]_ to accelerate in-core computations. 
* See the wiki for clustering help.

* `Random Projections, RP <http://www.cis.hut.fi/ella/publications/randproj_kdd.pdf>`_ aim to
  reduce vector space dimensionality. This is a very efficient (both memory- and
  CPU-friendly) approach to approximating TfIdf distances between documents, by throwing in a little randomness.
  Recommended target dimensionality is again in the hundreds/thousands, depending on your dataset.

''model = models.RpModel(tfidf_corpus, num_topics=500)''

* [Latent Dirichlet Allocation, LDA](http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) also trransforms vectors from BoW counts into a topic space of lower dimensionality. 
* It is a probabilistic extension of LSA (also called multinomial PCA) - LDA's topics can be interpreted as probability distributions over words. 
* `gensim`s implementation of online LDA parameter estimation uses [2]_, modified to run in `distributed mode <distributed>` on clusters.

``model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)``

* [Hierarchical Dirichlet Process, HDP](http://jmlr.csail.mit.edu/proceedings/papers/v15/wang11a/wang11a.pdf) is a non-parametric bayesian method (note the missing number of requested topics). `gensim`s implementation is based on [3]_. The HDP model is a new addition to `gensim`, and still rough around its academic edges -- use with care.

''model = models.HdpModel(corpus, id2word=dictionary)''

* Adding new VSM (Vector Space Model) transformers (such as different weighting schemes):
    - see the `apiref` or [Python code](https://github.com/piskvorky/gensim/blob/develop/gensim/models/tfidfmodel.py)

References
----------

[1] Bradford. 2008. An empirical study of required dimensionality for large-scale latent semantic indexing applications.

[2] Hoffman, Blei, Bach. 2010. Online learning for Latent Dirichlet Allocation.

[3] Wang, Paisley, Blei. 2011. Online variational inference for the hierarchical Dirichlet process.

[4] Halko, Martinsson, Tropp. 2009. Finding structure with randomness.

[5] Řehůřek. 2011. Subspace tracking for Latent Semantic Analysis.



In [12]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
#img = mpimg.imread('run_topics_and_transformations.png')
#imgplot = plt.imshow(img)
#plt.axis('off')
#plt.show()