# Similarity Queries

## Similarity Interface

In [11]:
from os.path import join
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

from gensim.corpora import Dictionary, MmCorpus
from gensim import models, similarities

dictionary = Dictionary.load('dewac_noun_tfidf.dict')
corpus = MmCorpus('dewac_noun_tfidf.mm')
print(corpus)

2018-11-28 20:08:03,753 : INFO : loading Dictionary object from dewac_noun_tfidf.dict
2018-11-28 20:08:03,792 : INFO : loaded dewac_noun_tfidf.dict
2018-11-28 20:08:03,968 : INFO : loaded corpus index from dewac_noun_tfidf.mm.index
2018-11-28 20:08:03,968 : INFO : initializing cython corpus reader from dewac_noun_tfidf.mm
2018-11-28 20:08:03,969 : INFO : accepted corpus with 1747499 documents, 100000 features, 188870159 non-zero entries


MmCorpus(1747499 documents, 100000 features, 188870159 non-zero entries)


In [27]:
lsi = models.LsiModel.load('dewac_LSImodel_100')

2018-11-28 20:13:57,368 : INFO : loading LsiModel object from dewac_LSImodel_100
2018-11-28 20:13:57,412 : INFO : loading id2word recursively from dewac_LSImodel_100.id2word.* with mmap=None
2018-11-28 20:13:57,413 : INFO : setting ignored attribute projection to None
2018-11-28 20:13:57,413 : INFO : setting ignored attribute dispatcher to None
2018-11-28 20:13:57,413 : INFO : loaded dewac_LSImodel_100
2018-11-28 20:13:57,414 : INFO : loading LsiModel object from dewac_LSImodel_100.projection
2018-11-28 20:13:57,800 : INFO : loaded dewac_LSImodel_100.projection


Now suppose a user typed in the query *“Human computer interaction”*. We would like to sort our nine corpus documents in decreasing order of relevance to this query. Unlike modern search engines, here we only concentrate on a single aspect of possible similarities—on apparent semantic relatedness of their texts (words). No hyperlinks, no random-walk static ranks, just a semantic extension over the boolean keyword match:

In [44]:
doc = "Computer Intelligenz"
vec_bow = dictionary.doc2bow(doc.split())
print(vec_bow)
vec_lsi = lsi[vec_bow] # convert the query to LSI space
print(vec_lsi)

[(184, 1), (15991, 1)]
[(0, 0.029593182003438412), (1, 0.007539014057921945), (2, -0.02160713982309926), (3, -0.005549113087644178), (4, -0.0034824868502470394), (5, -0.027274555137378843), (6, -0.01935030923073703), (7, -0.031014676043065795), (8, -0.003704357303702175), (9, -0.01268476126847821), (10, -0.02895186390752488), (11, 0.001876340369223607), (12, 0.0062180141473300995), (13, -0.03633695742145685), (14, 0.017898365657640587), (15, 0.0023357964570057134), (16, -0.0358114602962959), (17, -0.013174186211293352), (18, 0.006075862561866364), (19, -0.008225892161631816), (20, -0.010348577723679761), (21, 0.04484566144346404), (22, -0.011065517935184215), (23, 0.013673016697683562), (24, 0.06806759374058592), (25, -0.039074340092499396), (26, -0.04555679800506657), (27, 0.004960719790031228), (28, -0.039189714492645626), (29, 0.0064260593211559695), (30, -0.014460057880298886), (31, 0.01307748894852384), (32, 0.002146679463034294), (33, -0.02440269856744703), (34, -0.05865720219414

In addition, we will be considering [cosine](http://en.wikipedia.org/wiki/Cosine_similarity) similarity to determine the similarity of two vectors. Cosine similarity is a standard measure in Vector Space Modeling, but wherever the vectors represent probability distributions, [different similarity measures](http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Symmetrised_divergence) may be more appropriate.

### Initializing query structures

To prepare for similarity queries, we need to enter all documents which we want to compare against subsequent queries. In our case, they are the same nine documents used for training LSI, converted to 2-D LSA space. But that’s only incidental, we might also be indexing a different corpus altogether.

In [30]:
index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it

2018-11-28 20:19:04,946 : INFO : creating matrix with 1747499 documents and 100 features
  if np.issubdtype(vec.dtype, np.int):


> <B>Warning</B>:
> The class `similarities.MatrixSimilarity` is only appropriate when the whole set of vectors fits into memory. For example, a corpus of one million documents would require 2GB of RAM in a 256-dimensional LSI space, when used with this class.
> Without 2GB of free RAM, you would need to use the `similarities.Similarity` class. This class operates in fixed memory, by splitting the index across multiple files on disk, called shards. It uses `similarities.MatrixSimilarity` and `similarities.SparseMatrixSimilarity` internally, so it is still fast, although slightly more complex.

Index persistency is handled via the standard save() and load() functions:

In [31]:
index.save('dewac.index')
#index = similarities.MatrixSimilarity.load('dewac_noun_tfidf.mm.index')

2018-11-28 20:24:29,814 : INFO : saving MatrixSimilarity object under dewac.index, separately None
2018-11-28 20:24:29,815 : INFO : storing np array 'index' to dewac.index.index.npy
2018-11-28 20:24:30,116 : INFO : saved dewac.index


This is true for all similarity indexing classes (`similarities.Similarity`, `similarities.MatrixSimilarity` and `similarities.SparseMatrixSimilarity`). Also in the following, index can be an object of any of these. When in doubt, use `similarities.Similarity`, as it is the most scalable version, and it also supports adding more documents to the index later.

### Performing queries

To obtain similarities of our query document against the nine indexed documents:

In [45]:
sims = index[vec_lsi] # perform a similarity query against the corpus
print(list(enumerate(sims))[:10]) # print (document_number, document_similarity) 2-tuples

[(0, 0.05882028), (1, 0.01627208), (2, -0.008270286), (3, -0.024598738), (4, 0.010590587), (5, -0.03251593), (6, 0.040067364), (7, -0.018309794), (8, 0.039401855), (9, 0.053282425)]


Cosine measure returns similarities in the range *<-1, 1>* (the greater, the more similar), so that the first document has a score of 0.99809301 etc.

With some standard Python magic we sort these similarities into descending order, and obtain the final answer to the query *“Human computer interaction”*:

```
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print(sims) # print sorted (document number, similarity score) 2-tuples

[(2, 0.99844527), # The EPS user interface management system
(0, 0.99809301), # Human machine interface for lab abc computer applications
(3, 0.9865886), # System and human system engineering testing of EPS
(1, 0.93748635), # A survey of user opinion of computer system response time
(4, 0.90755945), # Relation of user perceived response time to error measurement
(8, 0.050041795), # Graph minors A survey
(7, -0.098794639), # Graph minors IV Widths of trees and well quasi ordering
(6, -0.1063926), # The intersection graph of paths in trees
(5, -0.12416792)] # The generation of random binary unordered trees
```

(I added the original documents in their “string form” to the output comments, to improve clarity.)

The thing to note here is that documents no. 2 ("`The EPS user interface management system`") and 4 ("`Relation of user perceived response time to error measurement`") would never be returned by a standard boolean fulltext search, because they do not share any common words with "`Human computer interaction`". However, after applying LSI, we can observe that both of them received quite high similarity scores (no. 2 is actually the most similar!), which corresponds better to our intuition of them sharing a “computer-human” related topic with the query. In fact, this semantic generalization is the reason why we apply transformations and do topic modelling in the first place.

## Where next?

Congratulations, you have finished the tutorials – now you know how gensim works :-) To delve into more details, you can browse through the [API documentation](https://radimrehurek.com/gensim/apiref.html), see the [Wikipedia experiments](https://radimrehurek.com/gensim/wiki.html) or perhaps check out [distributed computing](https://radimrehurek.com/gensim/distributed.html) in gensim.

Gensim is a fairly mature package that has been used successfully by many individuals and companies, both for rapid prototyping and in production. That doesn’t mean it’s perfect though:

* there are parts that could be implemented more efficiently (in C, for example), or make better use of parallelism (multiple machines cores)
* new algorithms are published all the time; help gensim keep up by [discussing them](http://groups.google.com/group/gensim) and [contributing code](https://github.com/piskvorky/gensim/wiki/Developer-page)
* your **feedback is most welcome** and appreciated (and it’s not just the code!): [idea contributions](https://github.com/piskvorky/gensim/wiki/Ideas-&-Features-proposals), [bug reports](https://github.com/piskvorky/gensim/issues) or just consider contributing [user stories and general questions](http://groups.google.com/group/gensim/topics).
Gensim has no ambition to become an all-encompassing framework, across all NLP (or even Machine Learning) subfields. Its mission is to help NLP practicioners try out popular topic modelling algorithms on large datasets easily, and to facilitate prototyping of new algorithms for researchers.

In [38]:
sims = sorted(enumerate(sims), key=lambda item: -item[1])
from pprint import pprint
pprint(sims[:10]) # print sorted (document number, similarity score) 2-tuples

[(518630, 0.9707641),
 (515343, 0.9681375),
 (319725, 0.9644809),
 (53222, 0.96094453),
 (1517932, 0.96056753),
 (1345555, 0.9604026),
 (686342, 0.9587432),
 (704754, 0.95739096),
 (249265, 0.95710075),
 (6164, 0.956691)]


In [39]:
corpus[518630]

[(414, 0.1365903136730849),
 (765, 0.17590438271376627),
 (780, 0.05713405394829304),
 (4423, 0.23041858142911847),
 (20803, 0.3372746962885819),
 (22589, 0.43506244688802703),
 (36560, 0.29564119039759584),
 (42951, 0.38746276962682985),
 (66615, 0.41586608786909596),
 (77043, 0.42489344990185823)]

In [None]:
import json

with open('dewac_noun_texts.json', 'r') as fp:
    texts = json.load(fp)
    

In [43]:
texts[515343]

['Frage',
 'Schwangerschaft',
 'Jugendalter',
 'Thema',
 'Lösung',
 'Konflikt',
 'Einrichtung',
 'Unterstützung',
 'Beratung',
 'Gymnasium',
 'Mitschüler',
 'Rahmen',
 'Projekt',
 'Thema',
 'Jugendliche',
 'Schwangerschaftskonflikt\x93',
 'Schwerpunkt',
 'Schwerpunkt',
 'Recherche',
 'Thema',
 'Einrichtung',
 'Jugendliche',
 'Entscheidung',
 'Situation',
 'Ende',
 'Entscheidung',
 'Gruß',
 'Fenja',
 'Buddenberg',
 'Anmerkung',
 'Präsentation',
 'Ausarbeitung',
 'Projekt',
 'Überarbeitung',
 'Ergänzung',
 'Seite',
 'Material',
 'Recherche',
 'Thema',
 'Foto',
 'Sternipark',
 'Weile']