# Пример использования библиотеки gensim для тематического моделирования

Такая полезная теорема Байеса! :)

![comic1](http://imgs.xkcd.com/comics/seashell.png)

In [1]:
from gensim import corpora, models

In [2]:
# Импортируем данные в формте UCI Bag of Words
data = corpora.UciCorpus("docword.xkcd.txt", "vocab.xkcd.txt")
dictionary = data.create_dictionary()

In [3]:
# обучение модель
%time ldamodel = models.ldamodel.LdaModel(data, id2word=dictionary, num_topics=5, passes=20, alpha=1.25, eta=1.25)

Wall time: 31 s


In [4]:
# Сохранение модели
ldamodel.save("ldamodel_xkcd")

In [5]:
# Загрузка модели
ldamodel = models.ldamodel.LdaModel.load("ldamodel_xkcd")

In [7]:
# выводим топы слов
for t, top_words in ldamodel.print_topics(num_topics=10, num_words=10):
    print( "Topic", t, ":", top_words)

Topic 0 : 0.002*"b'island'" + 0.002*"b'goggles'" + 0.001*"b'found'" + 0.001*"b'jelly'" + 0.001*"b'bean'" + 0.001*"b'link'" + 0.001*"b'acne'" + 0.001*"b'005'" + 0.001*"b'map'" + 0.001*"b'blogs'"
Topic 1 : 0.011*"b'guy'" + 0.004*"b'boy'" + 0.002*"b'wait'" + 0.002*"b'hat'" + 0.002*"b'paul'" + 0.001*"b'girl'" + 0.001*"b'peter'" + 0.001*"b'sagal'" + 0.001*"b'dont'" + 0.001*"b'ron'"
Topic 2 : 0.001*"b'base'" + 0.001*"b'nathan'" + 0.001*"b'bag'" + 0.001*"b'cop'" + 0.001*"b'turtle'" + 0.001*"b'hatboy'" + 0.001*"b'curie'" + 0.001*"b'marie'" + 0.001*"b'astley'" + 0.001*"b'rick'"
Topic 3 : 0.023*"b'man'" + 0.012*"b'text'" + 0.012*"b'person'" + 0.010*"b'title'" + 0.009*"b'woman'" + 0.007*"b'one'" + 0.005*"b'girl'" + 0.005*"b'just'" + 0.005*"b'guy'" + 0.005*"b'two'"
Topic 4 : 0.003*"b'figure'" + 0.003*"b'stick'" + 0.002*"b'exhibit'" + 0.001*"b'text'" + 0.001*"b'title'" + 0.001*"b'center'" + 0.001*"b'degree'" + 0.001*"b'mark'" + 0.001*"b'day'" + 0.001*"b'map'"


In [18]:
ldamodel.print_topics(num_topics=10, num_words=10)

[(0,
  '0.002*"b\'island\'" + 0.002*"b\'goggles\'" + 0.001*"b\'found\'" + 0.001*"b\'jelly\'" + 0.001*"b\'bean\'" + 0.001*"b\'link\'" + 0.001*"b\'acne\'" + 0.001*"b\'005\'" + 0.001*"b\'map\'" + 0.001*"b\'blogs\'"'),
 (1,
  '0.011*"b\'guy\'" + 0.004*"b\'boy\'" + 0.002*"b\'wait\'" + 0.002*"b\'hat\'" + 0.002*"b\'paul\'" + 0.001*"b\'girl\'" + 0.001*"b\'peter\'" + 0.001*"b\'sagal\'" + 0.001*"b\'dont\'" + 0.001*"b\'ron\'"'),
 (2,
  '0.001*"b\'base\'" + 0.001*"b\'nathan\'" + 0.001*"b\'bag\'" + 0.001*"b\'cop\'" + 0.001*"b\'turtle\'" + 0.001*"b\'hatboy\'" + 0.001*"b\'curie\'" + 0.001*"b\'marie\'" + 0.001*"b\'astley\'" + 0.001*"b\'rick\'"'),
 (3,
  '0.023*"b\'man\'" + 0.012*"b\'text\'" + 0.012*"b\'person\'" + 0.010*"b\'title\'" + 0.009*"b\'woman\'" + 0.007*"b\'one\'" + 0.005*"b\'girl\'" + 0.005*"b\'just\'" + 0.005*"b\'guy\'" + 0.005*"b\'two\'"'),
 (4,
  '0.003*"b\'figure\'" + 0.003*"b\'stick\'" + 0.002*"b\'exhibit\'" + 0.001*"b\'text\'" + 0.001*"b\'title\'" + 0.001*"b\'center\'" + 0.001*"b\'degre

In [9]:
# Вычисляем логарифм перплексии и немного преобразуем, чтобы привести к общепринятому виду
perplexity = ldamodel.log_perplexity(list(data))
print( 2**(-perplexity))

351.87577329987687


In [19]:
perp = ldamodel.bound(data)
2**(-perp/float(87409))

  This is separate from the ipykernel package so we can avoid doing imports until


inf

In [11]:
# Добавление в модель новых документов, содержащихся в новом корупсе data2
ldamodel.update(data2, passes=10)

NameError: name 'data2' is not defined

In [17]:
# Получение распределения тем для конкретного документа
doc = list(data)[0]
summ = 0
for i in ldamodel.get_document_topics(doc):
    summ += i[1]
summ

0.9999999441206455

In [20]:
ldamodel.get_document_topics(doc)

[(0, 0.056059226),
 (1, 0.15659672),
 (2, 0.056978),
 (3, 0.6590242),
 (4, 0.07134193)]

Эти люди не знают про тематические модели:

![comic2](http://imgs.xkcd.com/comics/the_problem_with_wikipedia.png) | ![comic3](http://imgs.xkcd.com/comics/mystery_news.png)