## Векторная модель (VSM - vector space model)

Косинусная мера близости в векторной модели [Salton et. al, 1975]: $cos(d_i,d_j) = \frac{d_i\times d_j}{||d_i|||d_j||} = \frac{\sum_k f_{ki}\times f_{kj}}{\sqrt{(\sum_k{f_{ki}})^2}\sqrt{(\sum_k{f_{ki}})^2}}$

Если вектора нормировать $||d_i||=||d_j||=1, cos(d_i,d_j)=d_i\times d_j$

Зададим небольшую коллекцию документов

In [9]:
import warnings
warnings.filterwarnings('ignore')

from gensim import corpora
from gensim import corpora, models ,similarities

from pylab import pcolor, show, colorbar, xticks, yticks
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

Предобработка

In [10]:
stoplist = set('for a of and to in'.split()) ## стоп-слова
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents] ## удаляем стоп-слова

from collections import defaultdict ## задаем частотный словарь

frequency = defaultdict(int)

for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1] for text in texts] ## удаляем токены,к оторые встречсаются тольк один раз
from pprint import pprint
pprint(texts)

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['the', 'eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['the', 'trees'],
 ['the', 'graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]


### Векторное представление коллекции текстов

Задаем словарь и непосредственное представление текстов векторами. (Векторная модель - это матрица слово - документ 
$$
   d1  d2
w1 
w2 f11 f12
w3 f13 f23
$$
w - слово <br>
d - документ <br>
f - частота


In [11]:
dictionary = corpora.Dictionary(texts) ## инициализаируем словарь
print(dictionary)
print(dictionary.token2id)

Dictionary(13 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)
{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'the': 9, 'trees': 10, 'graph': 11, 'minors': 12}


In [21]:
corpus = [dictionary.doc2bow(text) for text in texts] ## здесь непосредственно зранится векторная модель
# Представление документов в корпусе
for doc in corpus:
    print(doc)
#print(corpus)

[(0, 1), (1, 1), (2, 1)]
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(2, 1), (5, 1), (7, 1), (8, 1), (9, 1)]
[(1, 1), (5, 2), (8, 1)]
[(3, 1), (6, 1), (7, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(10, 1), (11, 1), (12, 1)]
[(4, 1), (11, 1), (12, 1)]


### Поиск по запросу

Ищем ближайший документ к вектору запроса по косинусной мере близости

In [22]:
q = 'human computer interaction'
vec = dictionary.doc2bow(q.lower().split())  # функция doc2bow -преобразует документ в мешок слов bag of words - bow
print(vec)

[(0, 1), (1, 1)]


In [23]:
index = similarities.MatrixSimilarity(corpus)
print(index)

MatrixSimilarity<9 docs, 13 features>


In [27]:
sims = index[vec]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print('Q:', q)
for i in sims:
    print('doc', i[0], documents[i[0]], i[1])

Q: human computer interaction
doc 0 Human machine interface for lab abc computer applications 0.81649655
doc 1 A survey of user opinion of computer system response time 0.28867513
doc 3 System and human system engineering testing of EPS 0.28867513
doc 2 The EPS user interface management system 0.0
doc 4 Relation of user perceived response time to error measurement 0.0
doc 5 The generation of random binary unordered trees 0.0
doc 6 The intersection graph of paths in trees 0.0
doc 7 Graph minors IV Widths of trees and well quasi ordering 0.0
doc 8 Graph minors A survey 0.0
