In [1]:
%matplotlib inline


Similarity Queries
==================

Demonstrates querying a corpus for similar documents.



In [2]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Creating the Corpus
-------------------

First, we need to create a corpus to work with.
This step is the same as in the previous tutorial;
if you completed it, feel free to skip to the next section.



In [3]:
from collections import defaultdict
from gensim import corpora

documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]

# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

2020-04-22 12:57:44,169 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-04-22 12:57:44,169 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)


Similarity interface
--------------------
* We previously covered how create a VSM corpus & how to transform it between different vector spaces. 
* Next: determine **similarity between pairs of documents**, or **similarity between a specific document and a set of
other documents** (such as a user query vs. indexed documents).
* Basis: [Deerwester: Indexing by Latent Semantic Analysis](http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf) (1990).


In [4]:
from gensim import models
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)

2020-04-22 13:00:21,887 : INFO : using serial LSI version on this node
2020-04-22 13:00:21,889 : INFO : updating model with new documents
2020-04-22 13:00:21,890 : INFO : preparing a new chunk of documents
2020-04-22 13:00:21,892 : INFO : using 100 extra samples and 2 power iterations
2020-04-22 13:00:21,893 : INFO : 1st phase: constructing (12, 102) action matrix
2020-04-22 13:00:21,894 : INFO : orthonormalizing (12, 102) action matrix
2020-04-22 13:00:21,895 : INFO : 2nd phase: running dense svd on (12, 9) matrix
2020-04-22 13:00:21,896 : INFO : computing the final decomposition
2020-04-22 13:00:21,897 : INFO : keeping 2 factors (discarding 43.156% of energy spectrum)
2020-04-22 13:00:21,898 : INFO : processed documents up to #9
2020-04-22 13:00:21,899 : INFO : topic #0(3.341): 0.644*"system" + 0.404*"user" + 0.301*"eps" + 0.265*"response" + 0.265*"time" + 0.240*"computer" + 0.221*"human" + 0.206*"survey" + 0.198*"interface" + 0.036*"graph"
2020-04-22 13:00:21,900 : INFO : topic #1(2

* LSI enables identifying patterns and relationships between terms & topics. This LSI space is 2D (`num_topics = 2`) so there are two topics, but this is arbitrary. [more](https://en.wikipedia.org/wiki/Latent_semantic_indexing).

* Assume a user typed in the query `"Human computer interaction"`. We want to sort our nine corpus documents in decreasing order of relevance to this query. 

* Unlike modern search engines, here we only concentrate on a single aspect of possible similarities---on apparent semantic relatedness of their texts (words). No hyperlinks, no random-walk static ranks, just a semantic extension over the boolean keyword match:

In [5]:
doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]  # convert the query to LSI space
print(vec_lsi)

[(0, 0.46182100453271735), (1, -0.07002766527899984)]


* We'll use [cosine similarity](http://en.wikipedia.org/wiki/Cosine_similarity) to find the two vectors' similarity. 
* Cosine similarity is a standard measure in Vector Space Modeling. When the vectors represent probability distributions,
[other measures](http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Symmetrised_divergence) may be more appropriate.

### Initializing query structures

To prepare for similarity queries, we need to enter all documents which we want
to compare against subsequent queries. In our case, they are the same nine documents
used for training LSI, converted to 2-D LSA space. But that's only incidental, we
might also be indexing a different corpus altogether.

In [6]:
from gensim import similarities
index = similarities.MatrixSimilarity(lsi[corpus])  # transform corpus to LSI space and index it

2020-04-22 13:10:50,202 : INFO : creating matrix with 9 documents and 2 features


<div class="alert alert-danger"><h4>Warning</h4><p>The class :class:`similarities.MatrixSimilarity` is only appropriate when the whole
  set of vectors fits into memory. For example, a corpus of one million documents
  would require 2GB of RAM in a 256-dimensional LSI space, when used with this class.

  Without 2GB of free RAM, you would need to use the :class:`similarities.Similarity` class.
  This class operates in fixed memory, by splitting the index across multiple files on disk, called shards.
  It uses :class:`similarities.MatrixSimilarity` and :class:`similarities.SparseMatrixSimilarity` internally,
  so it is still fast, although slightly more complex.</p></div>

* Index persistence is handled with :func:`save` and :func:`load` functions.



In [7]:
index.save('/tmp/deerwester.index')
index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')

2020-04-22 13:11:59,713 : INFO : saving MatrixSimilarity object under /tmp/deerwester.index, separately None
2020-04-22 13:11:59,718 : INFO : saved /tmp/deerwester.index
2020-04-22 13:11:59,719 : INFO : loading MatrixSimilarity object from /tmp/deerwester.index
2020-04-22 13:11:59,721 : INFO : loaded /tmp/deerwester.index


In [8]:
!ls /tmp/*.index

/tmp/corpus.lda-c.index  /tmp/corpus.svmlight.index  /tmp/mymodel.index
/tmp/corpus.low.index	 /tmp/deerwester.index
/tmp/corpus.mm.index	 /tmp/deerwester.mm.index


This is true for all similarity indexing classes (:class:`similarities.Similarity`,
:class:`similarities.MatrixSimilarity` and :class:`similarities.SparseMatrixSimilarity`).
Also in the following, `index` can be an object of any of these. When in doubt,
use :class:`similarities.Similarity`, as it is the most scalable version, and it also
supports adding more documents to the index later.

### Performing queries

In [9]:
sims = index[vec_lsi]  # perform a similarity query against the corpus
print(list(enumerate(sims)))  # print (document_number, document_similarity) 2-tuples

[(0, 0.998093), (1, 0.93748635), (2, 0.9984453), (3, 0.9865886), (4, 0.90755945), (5, -0.12416792), (6, -0.10639259), (7, -0.09879464), (8, 0.050041765)]


* Cosine measure returns similarities in the range `<-1, 1>` (greater value = more similarity).
* Sort in descending order, and obtain the final answer to the query `"Human computer interaction"`:



In [10]:
sims = sorted(enumerate(sims), key=lambda item: -item[1])
for i, s in enumerate(sims):
    print(s, documents[i])

(2, 0.9984453) Human machine interface for lab abc computer applications
(0, 0.998093) A survey of user opinion of computer system response time
(3, 0.9865886) The EPS user interface management system
(1, 0.93748635) System and human system engineering testing of EPS
(4, 0.90755945) Relation of user perceived response time to error measurement
(8, 0.050041765) The generation of random binary unordered trees
(7, -0.09879464) The intersection graph of paths in trees
(6, -0.10639259) Graph minors IV Widths of trees and well quasi ordering
(5, -0.12416792) Graph minors A survey


* Note: documents no. 2 (``"The EPS user interface management system"``) and 4 (``"Relation of user perceived response time to error measurement"``) would never be returned by a standard boolean fulltext search, because they do not share any common words with ``"Human
computer interaction"``. However, after applying LSI, we see both of them received quite high similarity scores (no. 2 is actually the most similar!), which corresponds to our intuition of them sharing a "computer-human" related topic with the query. In fact, this semantic
generalization is the reason why we apply transformations and do topic modelling in the first place.

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
#img = mpimg.imread('run_similarity_queries.png')
#imgplot = plt.imshow(img)
#plt.axis('off')
#plt.show()