# Построение системы ражирования экспертов, с помощью библиотеки тематического моделирования - BigARTM.

Для того чтобы обучить модель была использована база экспертов, которая создана на основе HTML страниц, которые содержали в себе: ФИО автора, заголовок и анонс статьи, а также ключевые слова. Всего документов в базе - 2637. Входные данные поступали в формате Vowpal Wabbit, который тоже формировался специально написанным скриптом на языке Python. 

Обучение модели.

Подключение всех необходимых библиотек:

In [52]:
import artm 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Создаем объект batch_vectorizer, с помощью которого созданим батчи для обучения нашей модели, если у Вас еще нет батчей.

In [53]:
batch_vectorizer = artm.BatchVectorizer(data_path='C:/Users/azemlyan/Desktop/bigartm_test/training_hse_model/data/training_data.vw', data_format='vowpal_wabbit', 
                                        target_folder='C:/Users/azemlyan/Desktop/bigartm_test/training_hse_model/batch_file/')

Если уже есть батчи, то воспользуйтесь действием ниже:

In [54]:
batch_vectorizer = artm.BatchVectorizer(data_path='C:/Users/azemlyan/Desktop/bigartm_test/training_hse_model/batch_file/',
                                        data_format='batches')

Следующий шаг - создание словаря, который хранит в себе информацию обо всех уникальных словах коллекции.

In [55]:
dictionary = artm.Dictionary()
dictionary.gather(data_path='C:/Users/azemlyan/Desktop/bigartm_test/training_hse_model/batch_file/',
                  vocab_file_path='C:/Users/azemlyan/Desktop/bigartm_test/training_hse_model/vocabulary/vocab.txt')

Сохраним словарь для удобства, в дальнейшем его будем только загружать. 

In [57]:
dictionary.save(dictionary_path='C:/Users/azemlyan/Desktop/bigartm_test/training_hse_model/vocabulary/vocab.dict')
dictionary.save_text(dictionary_path='C:/Users/azemlyan/Desktop/bigartm_test/training_hse_model/vocabulary/my_dictionary.txt')

Загрузка словаря с диска. 

In [58]:
dictionary.load_text(dictionary_path='C:/Users/azemlyan/Desktop/bigartm_test/training_hse_model/vocabulary/my_dictionary.txt')

Создадим модель:

In [59]:
model = artm.ARTM(num_topics=5, topic_names=["topic_"+str(i) for i in range(10)], class_ids={'@author': 3.0, '@text': 10.0})
model.cache_theta = True

В модели создана матрица Φ размерами "число слов из коллеции словаря" на число тем (5). 
Задание метрик качества происходит через поле поле scores класса ARTM. Добавим метрики: по наиболее вероятным словам, разжененность матриц Φ и Θ для обеих модальностей: @text - текстовая модальность, @author - модальность по авторам.

In [60]:
model.scores.add(artm.TopTokensScore(name='top_words', num_tokens = 5,  class_id='@text'))
model.scores.add(artm.SparsityPhiScore(name='sparsity_phi_score', class_id='@text'))
model.scores.add(artm.SparsityThetaScore(name='sparsity_theta_score'))

Для улучшения качества модели модели существуют регуляризаторы, целью которых является сделать модель боле интрерпретируемой и качественной. 

In [61]:
model.regularizers.add(artm.DecorrelatorPhiRegularizer(name='decorrelator_phi_def', class_ids=['@author']))
model.regularizers.add(artm.DecorrelatorPhiRegularizer(name='decorrelator_phi_lab', class_ids=['@text']))
model.regularizers.add(artm.SmoothSparseThetaRegularizer(name='smooth_sparse_theta_regularizer', tau = 0.029011))

Инициализируем модель, в следствии чего матрица Φ заполнится случайными числами.  

In [62]:
model.initialize(dictionary)

Запустим обучение модели, так как объем данных большой, то пользуемся методом fit_offline().

In [63]:
model.fit_online(batch_vectorizer=batch_vectorizer)#, num_collection_passes=800)

Теперь мы можем посмотреть на результат нашего обучения модели, тем самым взглянуть на полученные матрицы Φ и Θ с помощью методов get_phi() и get_theta().

In [64]:
phi_author = model.get_phi(class_ids=['@author'])
theta = model.get_theta()

Посмотрим на матрицу Φ по модальности @author, которая выглядит следующим образом:

In [65]:
phi_author

Unnamed: 0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9
elibrary.ru/author_items.asp?authorid=108750,0.000038,0.000064,0.000226,6.119387e-04,0.000521,0.000037,0.000376,1.184747e-05,0.000043,5.867846e-04
o.m.ogorodnikova@bk.ru,0.000012,0.000024,0.000025,5.453406e-04,0.000084,0.000053,0.000389,2.677785e-05,0.000013,9.679207e-05
"abrashkin_a._a.,_oshmarina_o._e.",0.000229,0.000633,0.000847,2.717823e-04,0.000410,0.000144,0.000059,1.650735e-04,0.000081,1.024190e-04
at@kg.ru,0.000035,0.000049,0.000047,8.679819e-05,0.000021,0.000354,0.000081,6.437257e-05,0.000600,1.935749e-04
"зусман_в._г.,_в._г._зусман",0.001067,0.000018,0.000087,3.644360e-06,0.000018,0.000053,0.000061,1.152018e-04,0.000125,4.559410e-05
"zusman_v.,_gronskaya_n._e.,_batishcheva_t.",0.000122,0.000041,0.000054,4.995708e-06,0.000548,0.000165,0.000011,1.690591e-04,0.000230,1.355336e-04
elibrary.ru/author_items.asp?authorid=234569,0.000290,0.000017,0.000028,1.990163e-04,0.000072,0.000101,0.000141,1.639241e-04,0.000392,4.901796e-05
elibrary.ru/author_items.asp?authorid=592352,0.000037,0.000120,0.000137,4.408215e-04,0.000009,0.000036,0.000252,1.033740e-04,0.000132,9.074202e-05
elibrary.ru/author_items.asp?authorid=527392,0.000310,0.000486,0.000169,8.705339e-07,0.000039,0.000733,0.000026,1.256353e-05,0.000045,1.869095e-05
"визгунов_а._н.,_савченко_а._в.",0.000209,0.000034,0.000020,4.083183e-05,0.000165,0.000177,0.000422,2.274315e-04,0.000070,1.108250e-04


Посмотрим на матрицу Φ по модальности @text, которая выглядит следующим образом:

In [66]:
phi_text = model.get_phi(class_ids=['@text'])

In [67]:
phi_text

Unnamed: 0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9
transmission,2.590658e-05,3.882828e-05,1.707290e-06,1.682468e-06,5.815480e-05,2.024640e-06,4.893015e-05,9.341925e-06,2.076288e-05,4.553125e-06
сеть,8.420338e-04,2.907556e-04,8.206717e-05,1.830937e-03,3.466067e-03,1.619271e-04,2.045593e-03,9.137110e-04,9.754939e-04,5.369938e-05
ideal,3.048514e-06,1.121965e-05,9.961649e-06,2.739769e-05,5.180666e-06,1.516992e-06,1.746118e-06,4.769998e-06,1.115907e-06,6.740835e-06
wren,6.911568e-06,7.942217e-06,6.748610e-07,3.852182e-06,4.590440e-06,4.432752e-06,7.974114e-06,8.431519e-06,2.064039e-08,5.112323e-07
приборный,4.033723e-07,8.272171e-07,5.833193e-06,1.961491e-05,5.695474e-06,2.530003e-08,6.646956e-06,1.314358e-07,2.210485e-06,9.643732e-06
выгладить,5.834614e-07,3.937947e-07,2.851472e-06,1.606316e-06,5.017119e-06,8.072992e-07,1.706958e-07,2.726740e-06,6.452501e-06,1.187356e-06
построение,1.242957e-03,3.014067e-04,1.804785e-04,1.321414e-03,5.252755e-04,3.856828e-04,1.401349e-03,5.284331e-04,1.725259e-03,1.588798e-03
attitude,1.485835e-06,2.872867e-05,4.702157e-06,2.602811e-06,1.837073e-06,1.166224e-05,9.628765e-06,3.563604e-06,3.825849e-07,8.020012e-06
admissible,1.694932e-08,1.032386e-05,8.645195e-07,2.167004e-07,3.464460e-05,2.316794e-08,7.269249e-07,2.132214e-08,1.963193e-08,2.653954e-08
пятнадцать,4.443397e-06,3.057447e-06,4.952768e-06,2.821595e-06,1.389761e-06,4.163360e-05,3.379511e-06,1.288842e-05,3.359934e-06,2.508248e-06


Получим матрицу Θ, которая показывает нам распределение тем по документам.

In [68]:
theta

Unnamed: 0,4001,4002,4003,4004,4005,4006,4007,4008,4009,4010,...,2991,2992,2993,2994,2995,2996,2997,2998,2999,3000
topic_0,0.000535,0.043354,0.00076,0.012868,0.139058,0.262287,0.184035,0.030403,0.018266,0.201398,...,0.101567,0.280721,0.091683,0.311962,0.020526,0.313251,0.263394,0.037135,0.136657,0.001534
topic_1,0.000115,4.8e-05,0.000516,0.00026,7.6e-05,0.000164,8e-05,0.000109,7.7e-05,9e-05,...,0.052588,0.011607,0.003445,0.008625,0.00268,0.010771,0.006658,0.022612,0.008728,0.001361
topic_2,0.000371,0.015603,0.000488,0.00833,0.004483,0.001412,0.000433,0.003313,0.00033,0.003633,...,0.038003,0.015608,0.178857,0.136195,0.294865,0.134088,0.094863,0.250665,0.409013,0.00164
topic_3,0.000176,0.089861,0.010356,0.005434,0.001253,0.00155,0.011047,0.023873,0.000367,0.004183,...,0.012837,0.044459,0.020378,0.00734,0.01301,0.000153,0.026746,0.027465,0.02136,0.061417
topic_4,0.000102,0.006618,0.000595,0.036712,0.00036,0.001131,0.00011,9.1e-05,7.9e-05,0.000365,...,0.014278,0.00475,0.036688,0.032521,0.007351,0.003966,0.016404,0.241641,0.06177,0.328339
topic_5,0.00017,0.059893,0.002872,0.003599,0.004002,0.002575,0.02188,0.008125,0.00348,0.007752,...,0.204043,0.034375,0.062895,0.048153,0.009254,0.0847,0.019328,0.153629,0.00418,0.020129
topic_6,0.052805,0.056325,0.016768,0.181793,0.004374,0.018686,0.001753,0.00333,0.00926,0.039677,...,0.073709,0.142937,0.007228,0.035695,0.010418,0.008085,0.005095,0.03063,0.014257,0.019144
topic_7,0.50883,0.403284,0.620258,0.292374,0.783397,0.128164,0.39367,0.300786,0.241388,0.43308,...,0.333585,0.30403,0.473879,0.11973,0.564694,0.421597,0.029871,0.162193,0.232407,0.000854
topic_8,0.433577,0.289011,0.307417,0.406962,0.026897,0.553621,0.3833,0.57398,0.057464,0.280332,...,0.145038,0.141488,0.014355,0.038248,0.040092,0.008808,0.06424,0.016186,0.023135,0.001569
topic_9,0.003319,0.036003,0.03997,0.051668,0.0361,0.030409,0.003691,0.05599,0.669289,0.029489,...,0.024353,0.020025,0.110592,0.261532,0.037111,0.01458,0.473401,0.057844,0.088494,0.564011


Сохраним нашу обученную модель на диск для дальнейшего использования, минуя шаги которые выше. 

In [69]:
model.save(filename='C:/Users/azemlyan/Desktop/bigartm_test/training_hse_model/build_model/my_model')

Проведем эксперт целью которого является получение ранжированного списка экспертов на основе произвольного текста, который подается на вход обученной модели.
Для новых данных формируем новые батчи. 

In [70]:
batch_vectorizer = artm.BatchVectorizer(data_path='C:/Users/azemlyan/Desktop/bigartm_test/testing_model/data/testing_data.vw', data_format='vowpal_wabbit', 
                                        target_folder='C:/Users/azemlyan/Desktop/bigartm_test/testing_model/batch_file/')

Если уже есть батчи, то воспользуйтесь действием ниже:

In [71]:
batch_vectorizer = artm.BatchVectorizer(data_path='C:/Users/azemlyan/Desktop/bigartm_test/testing_model/batch_file/',
                                        data_format='batches')

Создадим новый словарь, который хранит в себе информацию о новой текстовой коллекции. 

In [72]:
dictionary = artm.Dictionary()
dictionary.gather(data_path='C:/Users/azemlyan/Desktop/bigartm_test/testing_model/batch_file/',
                  vocab_file_path='C:/Users/azemlyan/Desktop/bigartm_test/testing_model/vocabulary/vocab.txt')

Сохраним словарь для удобства, в дальнейшем его будем только загружать.

In [73]:
dictionary.save(dictionary_path='C:/Users/azemlyan/Desktop/bigartm_test/testing_model/vocabulary/vocab.dict')
dictionary.save_text(dictionary_path='C:/Users/azemlyan/Desktop/bigartm_test/testing_model/vocabulary/my_dictionary.txt')

Загрузка словаря с диска.

In [74]:
dictionary.load_text(dictionary_path='C:/Users/azemlyan/Desktop/bigartm_test/testing_model/vocabulary/my_dictionary.txt')

Создадим модель:

In [75]:
model = artm.ARTM(num_topics=10, class_ids={'@author': 20.0, '@text': 20.0},  cache_theta = True)

Загрузим с диска обученную модель:

In [76]:
model.load(filename='C:/Users/azemlyan/Desktop/bigartm_test/training_hse_model/build_model/my_model')

Инициализируем модель уже с новым словарем.

In [77]:
model.initialize(dictionary)

Строим новую модель:

In [78]:
model.fit_online(batch_vectorizer=batch_vectorizer)

Чтобы получить матрицу Θ - матрица распределения вероятностей по документам выполним:

In [79]:
theta_test = model.transform(batch_vectorizer=batch_vectorizer, predict_class_id='@author')

Извлекаем матрицу:

In [80]:
theta_test = model.get_theta()

Постмотрим на ее содержимое:

In [81]:
theta_test

Unnamed: 0,1
topic_0,0.104592
topic_1,0.102676
topic_2,0.154134
topic_3,0.072618
topic_4,0.103037
topic_5,0.079413
topic_6,0.11919
topic_7,0.082083
topic_8,0.104208
topic_9,0.078051


Извлечем матрицу Φ:

In [82]:
phi_test = model.get_phi()

Посмотрим на ее содердимое:

In [83]:
phi_test

Unnamed: 0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9
влиянием,0.032896,0.018318,0.000427,0.033951,0.011325,0.037496,0.021393,0.008958,0.00689,0.000268
квантовой,0.057806,0.014236,0.071678,0.003495,0.059985,0.004008,0.020352,0.041987,0.018677,0.004032
создание,0.014876,0.017419,0.017938,0.008901,0.032304,0.026051,0.011382,0.006263,0.011596,0.015106
измерить,0.009146,0.025133,0.0197,0.026846,0.021285,0.024724,0.009627,0.016404,0.008226,0.003672
хрупкости,0.010013,0.007885,0.010668,0.012586,0.027479,0.013438,0.029672,0.019471,0.014945,0.018537
работоспособного,0.030953,0.000324,0.013061,0.036494,0.027744,0.001437,0.015174,0.023454,0.015196,0.000699
под,0.02355,0.023688,0.022021,0.019307,0.001202,0.022448,0.012121,0.015275,0.004966,0.021371
канала,0.007215,0.020875,0.016381,0.00805,0.005457,0.018693,0.025889,0.012624,0.0232,0.026396
изменяются,0.018926,0.015705,0.005152,0.024198,0.004331,0.018009,0.032193,0.005388,0.021713,0.023779
передачи,0.000826,0.045234,0.041824,0.020928,0.010775,0.020714,0.043004,0.002327,0.021087,0.027047


Преобразуем матрицу Φ по модальности @author в формат DataFrame, а также, нашу новую матрицу Θ:

In [84]:
df_phi_author = pd.DataFrame(phi_author)
df_theta_test = pd.DataFrame(theta_test)

Извлечем названия header-ов строк матрицы Φ, и столбцов новой матрицы Θ:

In [85]:
row = list(df_phi_author.index)
col = list(df_theta_test)
col_t = []
for c in col:
    name = str('text_') + str(c)
    col_t.append(name)

Найдем скалярное произведение матриц: phi_authors и theta_test:

In [86]:
scalar_product = np.dot(phi_author.as_matrix(), theta_test.as_matrix())

Создадим новый DataFrame результатами произведения матриц, где столбцы - входные текстовые документы, а строки - авторы, на пересечении строк и столбцов результаты скалярного произведения:

In [87]:
scalar_product_df = pd.DataFrame(data = scalar_product, index = row, columns = col_t)

Получили неупорядоченный вектор весов экспертов, по входному документу:

In [88]:
scalar_product_df

Unnamed: 0,text_1
elibrary.ru/author_items.asp?authorid=108750,0.000243
o.m.ogorodnikova@bk.ru,0.000117
"abrashkin_a._a.,_oshmarina_o._e.",0.000330
at@kg.ru,0.000145
"зусман_в._г.,_в._г._зусман",0.000167
"zusman_v.,_gronskaya_n._e.,_batishcheva_t.",0.000145
elibrary.ru/author_items.asp?authorid=234569,0.000141
elibrary.ru/author_items.asp?authorid=592352,0.000132
elibrary.ru/author_items.asp?authorid=527392,0.000181
"визгунов_а._н.,_савченко_а._в.",0.000147


Отсортируем его по убыванию, что явно отразит самых релевантных экспертов в начале списка:

In [89]:
sort_list_experts = scalar_product_df.sort('text_1', axis=0, ascending=[False], na_position='last',inplace=False)

  if __name__ == '__main__':


Мы получили ранжированный список экспертов по входному документу, который выглядит следующим образом:

In [90]:
sort_list_experts

Unnamed: 0,text_1
панченко_п._н.,0.021976
романова_т._в.,0.017965
ковтун_н._н.,0.012850
pelinovsky_e.,0.012078
радина_н._к.,0.009503
пчелкин_а._в.,0.008712
гапонова_о._с.,0.008436
цветкова_м._в.,0.007486
михеева_и._в.,0.006828
пелиновский_е._н.,0.006710


Получим ТОП-10 экспертов:

In [91]:
sort_list_experts[0:9]

Unnamed: 0,text_1
панченко_п._н.,0.021976
романова_т._в.,0.017965
ковтун_н._н.,0.01285
pelinovsky_e.,0.012078
радина_н._к.,0.009503
пчелкин_а._в.,0.008712
гапонова_о._с.,0.008436
цветкова_м._в.,0.007486
михеева_и._в.,0.006828


Выведем графически рейтинг экспертов. 

In [92]:
sort_list_experts[1:10].plot(kind='barh')
plt.xlabel('Probability')
plt.ylabel('ID experts')
plt.title('The rating experts')
plt.grid(True)

In [51]:
plt.show()

# Выводы:
С помощью библиотеки BigARTM была реализована рекомендательная система поиска эксперов, с помощью которой удалось получить самых квалифицированных экспертов по документу, который вызвал у нас интерес. 