## Векторная модель (VSM - vector space model)

Косинусная мера близости в векторной модели [Salton et. al, 1975]: $cos(d_i,d_j) = \frac{d_i\times d_j}{||d_i|||d_j||} = \frac{\sum_k f_{ki}\times f_{kj}}{\sqrt{(\sum_k{f_{ki}})^2}\sqrt{(\sum_k{f_{ki}})^2}}$

Если вектора нормировать $||d_i||=||d_j||=1, cos(d_i,d_j)=d_i\times d_j$

Зададим небольшую коллекцию документов

In [1]:
import warnings
warnings.filterwarnings('ignore')

from gensim import corpora
from gensim import corpora, models ,similarities

from pylab import pcolor, show, colorbar, xticks, yticks
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

Предобработка

In [2]:
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]


from collections import defaultdict ## задаем частотный словарь

frequency = defaultdict(int)

for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1] for text in texts] ## удаляем токены,к оторые встречсаются тольк один раз
from pprint import pprint
pprint(texts)

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]


### Векторное представление коллекции текстов

Задаем словарь и непосредственное представление текстов векторами. (Векторная модель - это матрица слово - документ 
$$
   d1  d2
w1 
w2 f11 f12
w3 f13 f23
$$
w - слово <br>
d - документ <br>
f - частота


In [3]:
dictionary = corpora.Dictionary(texts) ## инициализаируем словарь
print(dictionary)
print(dictionary.token2id)

Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)
{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}


In [4]:
corpus = [dictionary.doc2bow(text) for text in texts] ## здесь непосредственно зранится векторная модель
# Представление документов в корпусе
for doc in corpus:
    print(doc)
#print(corpus)

[(0, 1), (1, 1), (2, 1)]
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(2, 1), (5, 1), (7, 1), (8, 1)]
[(1, 1), (5, 2), (8, 1)]
[(3, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(4, 1), (10, 1), (11, 1)]


### Поиск по запросу

Ищем ближайший документ к вектору запроса по косинусной мере близости

In [5]:
q = 'human computer interaction'
vec = dictionary.doc2bow(q.lower().split())  # функция doc2bow -преобразует документ в мешок слов bag of words - bow
# print(vec)

for word in vec:
    print(word[0], dictionary[word[0]])

0 computer
1 human


In [6]:
vec

[(0, 1), (1, 1)]

In [7]:
index = similarities.MatrixSimilarity(corpus)
print(index)

MatrixSimilarity<9 docs, 12 features>


In [8]:
sims = index[vec]
print(sims)
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print('Q:', q)
for i in sims:
    print('doc', i[0], documents[i[0]], i[1])

[0.81649655 0.28867513 0.         0.28867513 0.         0.
 0.         0.         0.        ]
Q: human computer interaction
doc 0 Human machine interface for lab abc computer applications 0.81649655
doc 1 A survey of user opinion of computer system response time 0.28867513
doc 3 System and human system engineering testing of EPS 0.28867513
doc 2 The EPS user interface management system 0.0
doc 4 Relation of user perceived response time to error measurement 0.0
doc 5 The generation of random binary unordered trees 0.0
doc 6 The intersection graph of paths in trees 0.0
doc 7 Graph minors IV Widths of trees and well quasi ordering 0.0
doc 8 Graph minors A survey 0.0


Выполним $TF-IDF$ преобразование

In [9]:
tfidf = models.TfidfModel(corpus)

In [10]:
for word_id in dictionary:
    print(f'{dictionary[word_id]} - {tfidf.dfs[word_id]} - {tfidf.idfs[word_id]:.4f}')

computer - 2 - 2.1699
human - 2 - 2.1699
interface - 2 - 2.1699
response - 2 - 2.1699
survey - 2 - 2.1699
system - 3 - 1.5850
time - 2 - 2.1699
user - 3 - 1.5850
eps - 2 - 2.1699
trees - 3 - 1.5850
graph - 3 - 1.5850
minors - 2 - 2.1699


In [11]:
corpus[1]

[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]

In [12]:
tfidf[corpus[1]]

[(0, 0.44424552527467476),
 (3, 0.44424552527467476),
 (4, 0.44424552527467476),
 (5, 0.3244870206138555),
 (6, 0.44424552527467476),
 (7, 0.3244870206138555)]

In [13]:
vec_tfidf = tfidf[vec]

In [14]:
index = similarities.MatrixSimilarity(tfidf[corpus])
sims = index[vec_tfidf]
# print(sims)
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print('Q:', q)
for i in sims:
    print('doc', i[0], documents[i[0]], i[1])

Q: human computer interaction
doc 0 Human machine interface for lab abc computer applications 0.81649655
doc 3 System and human system engineering testing of EPS 0.3477732
doc 1 A survey of user opinion of computer system response time 0.31412902
doc 2 The EPS user interface management system 0.0
doc 4 Relation of user perceived response time to error measurement 0.0
doc 5 The generation of random binary unordered trees 0.0
doc 6 The intersection graph of paths in trees 0.0
doc 7 Graph minors IV Widths of trees and well quasi ordering 0.0
doc 8 Graph minors A survey 0.0


## Снижение размерности

Сингулярное разложение: $M = U\sum V^T$ <br>
Снижение размерномсти с помощью сингулярного разложения: $M^I_k = U\sum_k V_k^T$


In [15]:
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2) ## задаем LSI модель, число тем = 2
print(lsi)

LsiModel(num_terms=12, num_topics=2, decay=1.0, chunksize=20000)


In [16]:
corpus[0]

[(0, 1), (1, 1), (2, 1)]

In [17]:
lsi[corpus[0]]

[(0, -0.6594664059797392), (1, -0.14211544403729964)]

In [23]:
for i in range(len(corpus)):
    print(i, lsi[corpus[i]])

0 [(0, -0.6594664059797392), (1, -0.14211544403729964)]
1 [(0, -2.0245430433828755), (1, 0.4208875824630228)]
2 [(0, -1.5465535813286544), (1, -0.32358919425712007)]
3 [(0, -1.8111412473028827), (1, -0.5890524969932468)]
4 [(0, -0.9336738035634349), (1, 0.2713894049937518)]
5 [(0, -0.01274618303829467), (1, 0.4901617924531043)]
6 [(0, -0.04888203206047077), (1, 1.1129470269929551)]
7 [(0, -0.08063836099410687), (1, 1.5634559463442654)]
8 [(0, -0.2738100392127573), (1, 1.34694158495377)]


In [18]:
lsi[vec]

[(0, -0.4618210045327156), (1, -0.07002766527900038)]

In [20]:
index = similarities.MatrixSimilarity(lsi[corpus]) # индекс и векторное представление исходных текстов в пространстве

vec_lsi = lsi[vec] # конвертируем запрос в пространство меньшей размерности
print(vec_lsi)

sims = index[vec_lsi] 
sims_lsi = sorted(enumerate(sims), key=lambda item: -item[1])
print('Q:', q)
for i in sims_lsi:
    print('doc', i[0], documents[i[0]], i[1])

[(0, -0.4618210045327156), (1, -0.07002766527900038)]
Q: human computer interaction
doc 2 The EPS user interface management system 0.9984453
doc 0 Human machine interface for lab abc computer applications 0.998093
doc 3 System and human system engineering testing of EPS 0.9865886
doc 1 A survey of user opinion of computer system response time 0.93748635
doc 4 Relation of user perceived response time to error measurement 0.90755945
doc 8 Graph minors A survey 0.050041765
doc 7 Graph minors IV Widths of trees and well quasi ordering -0.09879464
doc 6 The intersection graph of paths in trees -0.10639259
doc 5 The generation of random binary unordered trees -0.12416792


In [21]:
lsi.show_topics()

[(0,
  '-0.644*"system" + -0.404*"user" + -0.301*"eps" + -0.265*"time" + -0.265*"response" + -0.240*"computer" + -0.221*"human" + -0.206*"survey" + -0.198*"interface" + -0.036*"graph"'),
 (1,
  '0.623*"graph" + 0.490*"trees" + 0.451*"minors" + 0.274*"survey" + -0.167*"system" + -0.141*"eps" + -0.113*"human" + 0.107*"time" + 0.107*"response" + -0.072*"interface"')]