## Методы text-similarity ##

https://www.quora.com/What-are-the-most-popular-text-similarity-algorithms

1. doc2vec + cosine measure https://code.google.com/archive/p/word2vec/ 
<br>
    GloVe + cosine https://nlp.stanford.edu/projects/glove/
2. Jaccard similarity 
<br>
    https://en.wikipedia.org/wiki/Jaccard_index
<br>
    https://nickgrattan.wordpress.com/2014/02/18/jaccard-similarity-index-for-measuring-document-similarity/
3. Locality-sensitive hashing  
    https://en.wikipedia.org/wiki/Locality-sensitive_hashing 
<br>
    https://github.com/kayzhu/LSHash
<br>
    http://www.mmds.org/
4. Cosine Similarity and IDF Modified Cosine Similarity https://www.youtube.com/watch?v=C3Jt14Se9Cg&feature=youtu.be

### Jaccard Similarity

In [1]:
import sys
sys.path.append('C:\Program Files\Anaconda3\Lib\site-packages')

In [2]:
import pandas as pd
import re
#from nltk.corpus import stopwords
import sklearn
#from sklearn.pipeline import Pipeline

#stopwords_rus = stopwords.words('russian')
from stop_words import get_stop_words
stopwords = get_stop_words('russian')

In [3]:
from tqdm import tqdm

In [4]:
metadata = pd.read_csv('meta_rubrics_final.tsv', encoding = 'utf-8', sep = '\t')

In [121]:
#without lemmatization and stop-words
def simple_clean(texts):
    preprocessed_texts_list = []
    for text in tqdm(texts):
        #del_new_line = re.sub(r'\n', '', text.lower()) 
        extracted_text = re.findall(r'[a-zа-яё]+', text.lower())#разделяем на токены
        extracted_text = ' '.join(extracted_text)
        preprocessed_texts_list.append(extracted_text)
    return preprocessed_texts_list

In [5]:
small = metadata[metadata.final_rubrics != 'Мусор'].head(100)

In [6]:
articles_small = pd.DataFrame(data = list(zip(small.path, small.title, small.tags, small.final_rubrics, small.number_of_rubrics)), 
                        columns = ['path', 'title', 'tags', 'rubrics', 'rubrics_number'])

In [7]:
articles_small.head()

Unnamed: 0,path,title,tags,rubrics,rubrics_number
0,chrdk.ru/articles/sci_10_salt_lakes.txt,Десять самых известных соленых озер,География_Экология,Науки о земле,One
1,chrdk.ru/articles/sci_33_fractures.txt,33 перелома,Российская наука_Антропология,История,One
2,chrdk.ru/articles/sci_46_chromosomes.txt,46 — норма?,Генетика_Медицина,Физиология человека,One
3,chrdk.ru/articles/sci_750gev.txt,Несбывшиеся надежды на новую физику,Физика_Интервью_Закрытия,Физика,One
4,chrdk.ru/articles/sci_alien_anatomy.txt,Анатомия каменных пришельцев,Геология_Космос,Космос,Multi


In [8]:
texts = []
for path in tqdm(articles_small.path):
    path = path.replace('\n','').replace('//','/').replace('?code=','-code=')
    try:
        with open('C:/Users/Анна/YandexDisk/popular_science_texts_store/' + path, encoding = 'utf-8') as f:
            texts.append(f.read())
    except OSError:
        texts.append('')

100%|███████████████████████████████████████| 100/100 [00:00<00:00, 518.11it/s]


In [9]:
articles_small['text'] = texts

In [10]:
articles_small.head()

Unnamed: 0,path,title,tags,rubrics,rubrics_number,text
0,chrdk.ru/articles/sci_10_salt_lakes.txt,Десять самых известных соленых озер,География_Экология,Науки о земле,One,"Возможно, не все об этом знают, но объемы воды..."
1,chrdk.ru/articles/sci_33_fractures.txt,33 перелома,Российская наука_Антропология,История,One,Останки мужчины с зажившими переломами нашли п...
2,chrdk.ru/articles/sci_46_chromosomes.txt,46 — норма?,Генетика_Медицина,Физиология человека,One,"В отличие от зубов, хромосом человеку положено..."
3,chrdk.ru/articles/sci_750gev.txt,Несбывшиеся надежды на новую физику,Физика_Интервью_Закрытия,Физика,One,"В начале августа CERN официально объявил, что ..."
4,chrdk.ru/articles/sci_alien_anatomy.txt,Анатомия каменных пришельцев,Геология_Космос,Космос,Multi,Как выглядят и чем отличаются друг от друга го...


In [11]:
import pymorphy2

In [12]:
morph = pymorphy2.MorphAnalyzer()

In [13]:
def preprocess(texts, stopwords):
    clean_texts = []
    for text in tqdm(texts):
        words = re.findall(r'[a-zа-яё]+', text.lower())#разделяем на токены
        #words = re.findall(r'[а-яё]+', text.lower())
        lemmas = [morph.parse(word)[0].normal_form for word in words if word not in stopwords]
        clean_texts.append(' '.join(lemmas))
    return clean_texts

In [14]:
articles_small.text = preprocess(articles_small.text, stopwords)

100%|████████████████████████████████████████| 100/100 [00:52<00:00,  1.91it/s]


In [15]:
articles_small.head(50)

Unnamed: 0,path,title,tags,rubrics,rubrics_number,text
0,chrdk.ru/articles/sci_10_salt_lakes.txt,Десять самых известных соленых озер,География_Экология,Науки о земле,One,возможно знать объём вода пресный солёный озер...
1,chrdk.ru/articles/sci_33_fractures.txt,33 перелома,Российская наука_Антропология,История,One,останки мужчина зажить перелом найти раскопка ...
2,chrdk.ru/articles/sci_46_chromosomes.txt,46 — норма?,Генетика_Медицина,Физиология человека,One,отличие зуб хромосома человек положить строго ...
3,chrdk.ru/articles/sci_750gev.txt,Несбывшиеся надежды на новую физику,Физика_Интервью_Закрытия,Физика,One,начало август cern официально объявить частица...
4,chrdk.ru/articles/sci_alien_anatomy.txt,Анатомия каменных пришельцев,Геология_Космос,Космос,Multi,выглядеть отличаться друг друг гость различный...
5,chrdk.ru/articles/sci_almost_lifelike.txt,Почти как живой,Науки о живом_Молекулярная биология,Биология,One,бассейн очистный сооружение австрийский город ...
6,chrdk.ru/articles/sci_AlphaGo_vs_LiSedol.txt,"Игры, в которые играли люди",Информационные технологии_Футурология_Технолог...,Computer Science,One,вторник март сеул пройти последний встреча ком...
7,chrdk.ru/articles/sci_Alzheimers_news.txt,"Немец, от которого все без ума",Молекулярная биология_Медицина_Выбор редакции,Биология,Multi,болезнь альцгеймер болезнь паркинсон один самы...
8,chrdk.ru/articles/sci_animals_vs_climate.txt,Погоды не сделают,Науки о живом_Животные_Климат_Изменение климата,Биология,Multi,мир взаимосвязанный действие тянуть тысяча нит...
9,chrdk.ru/articles/sci_animal_dialects.txt,"На своем, на птичьем",Животные_Этология,Биология,One,животное вид разный часть планета говорят разн...


In [69]:
from sklearn.feature_extraction.text import CountVectorizer

In [70]:
vect = CountVectorizer(binary = True)
vectorized = vect.fit_transform(articles_small.text)

In [71]:
vectorized.toarray()[0]

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [72]:
vectorized

<100x12385 sparse matrix of type '<class 'numpy.int64'>'
	with 42410 stored elements in Compressed Sparse Row format>

In [73]:
vectorized.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [74]:
vect.vocabulary_.get('возможно')

1769

In [75]:
feature_names = vect.get_feature_names()

In [76]:
print(feature_names)

['abc', 'academy', 'acanthaomoeba', 'acriia', 'acta', 'activator', 'ad', 'adjacent', 'adustus', 'advanced', 'advances', 'agilis', 'airway', 'akatsuki', 'al', 'alamos', 'aleph', 'allium', 'alpha', 'alphago', 'alpphago', 'amblyrhynchus', 'american', 'and', 'anecdotal', 'anopheles', 'antares', 'aplhago', 'aplysia', 'apo', 'apobec', 'appl', 'apple', 'applied', 'appstore', 'araks', 'archaeornithura', 'architectures', 'arcii', 'arciia', 'arduino', 'arik', 'artificial', 'arts', 'associated', 'ast', 'astro', 'astrobiology', 'astronomy', 'astrophysical', 'astrophysics', 'at', 'atlas', 'august', 'aurora', 'auroral', 'australis', 'axis', 'backyard', 'bacon', 'based', 'bbc', 'be', 'beneficium', 'berger', 'bets', 'betularia', 'big', 'bioinspir', 'bioinspired', 'biologically', 'biology', 'biomim', 'biorxiv', 'biserialis', 'biston', 'blow', 'blue', 'bluedeep', 'bluetooth', 'bmj', 'borealis', 'boson', 'bradfordcoccus', 'brain', 'breakthrough', 'breathing', 'bubbles', 'californica', 'caltech', 'campbel




In [47]:
#через расстояние между векторами
#много нулей
#from sklearn.metrics import jaccard_similarity_score

In [55]:
#jaccard_similarity_score(vectorized.toarray()[5], vectorized.toarray()[18])

0.92603767430481942

In [77]:
from scipy.spatial.distance import jaccard, pdist, squareform

In [180]:
jaccard(vectorized[58], vectorized[18])

1.0

In [110]:
#через пересечение/объединение множеств слов в документах
def jaccard_similarity(doc1, doc2):
    words_doc1 = set(doc1)
    words_doc2 = set(doc2)
    intersection = words_doc1.intersection(words_doc2)
    union = words_doc1.union(words_doc2)
    return len(intersection)/len(union)

In [111]:
jaccard_similarity(articles_small.text[35], articles_small.text[36])

0.6666666666666666

In [112]:
jaccard_similarity(articles_small.text[58], articles_small.text[18])

0.6491228070175439

In [19]:
print(articles_small.text[58][:1000])

славянский язычество смотреть грозный взгляд волхв картина васнецов напоминать прыжок костра иван купала философ националист неоязыческий движение любить гордиться образ гордый могучий славянин глава велесом отбросить художественный образ откровенный выдумка сухой остаток наш знание славянский религия остаться учёный прекращать попытка извлечь скудный источник кроха информация верить славянин какой представлять мир славянин вопрос оставаться дискуссионный существовать гипотеза происхождение славянин целое учёный согласный славянин группа индоевропейский народ возникнуть территория центральный восточный европа ограниченный запад эльба одер север балтийский море восток волга юг адриатика предок славянин предположительно скотоводческо земледельческий племя культура шнуровой керамика iii ii тысяча наш эра расселиться северный причерноморье прикарпатие европа историк предполагать позднеантичный раннесредневековый автор упоминать праславянский объединение ант склавин венед ii v век наш э сла

In [20]:
print(articles_small.text[18][:1000])

полярный сияние aurora borealis aurora australis пазоря сполох название красочный явление племя индеец северный америка принимать свечение атмосфера свет фонарь нести дух разыскивать умерший охотник эскимос представлять небожитель играть подобие футбол череп морж чердак разбираться знать полярный сияние полярный сияние возникать благодаря солнечный ветер протон электрон родиться солнце преодолевать путь километр попадать верхний слоить атмосфера земля двигаться вдоль силовой линия магнитный поль планета переноситься сторона полюс частица опускаться высота километр уровень море начинать взаимодействовать атом азот кислород возбуждать вернуться первоначальный состояние атом излучать энергия вид квантовый свет ультрафиолетовый инфракрасный видимый свет спектр излучение вещество атмосфера определять цвета полярный сияние например зелёный цвета отвечать кислород фиолетовый азот некоторый наблюдатель отмечать интенсивный полярный сияние сопровождаться свистящий звук лёгкое треск учёный финск

In [198]:
jaccard_similarity(articles_small.text[11], articles_small.text[14])

0.6538461538461539

In [22]:
import numpy as np

In [113]:
def jaccard_matrix(text_list):
    n = len(text_list)
    similarity_matr = np.zeros((n, n))
    for i in range(n):
        for j in range(i+1):
            #if j <= i:
                similarity_matr[i, j] = jaccard_similarity(text_list[i], text_list[j])
                similarity_matr[j, i] = similarity_matr[i, j]
    return similarity_matr
    

In [114]:
%%time
m = jaccard_matrix(articles_small.text)


Wall time: 12.3 s


In [115]:
jaccard_m = pd.DataFrame(m.round(2))

In [116]:
jaccard_m

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,1.00,0.95,0.70,0.89,0.69,0.91,0.86,0.88,0.74,0.90,...,0.95,0.94,0.64,0.86,0.91,0.83,0.89,0.89,0.93,0.85
1,0.95,1.00,0.65,0.84,0.68,0.89,0.88,0.83,0.73,0.91,...,0.96,0.89,0.60,0.84,0.89,0.82,0.91,0.91,0.91,0.80
2,0.70,0.65,1.00,0.71,0.60,0.68,0.67,0.65,0.77,0.65,...,0.66,0.74,0.92,0.69,0.65,0.66,0.66,0.64,0.66,0.75
3,0.89,0.84,0.71,1.00,0.71,0.87,0.82,0.90,0.83,0.82,...,0.84,0.90,0.72,0.81,0.84,0.82,0.81,0.85,0.88,0.81
4,0.69,0.68,0.60,0.71,1.00,0.70,0.72,0.77,0.65,0.70,...,0.66,0.70,0.61,0.75,0.74,0.76,0.72,0.70,0.72,0.67
5,0.91,0.89,0.68,0.87,0.70,1.00,0.88,0.89,0.72,0.91,...,0.89,0.85,0.65,0.87,0.89,0.81,0.87,0.95,0.91,0.83
6,0.86,0.88,0.67,0.82,0.72,0.88,1.00,0.88,0.77,0.93,...,0.88,0.81,0.64,0.89,0.88,0.80,0.86,0.89,0.86,0.79
7,0.88,0.83,0.65,0.90,0.77,0.89,0.88,1.00,0.75,0.88,...,0.83,0.85,0.65,0.84,0.89,0.85,0.84,0.88,0.87,0.80
8,0.74,0.73,0.77,0.83,0.65,0.72,0.77,0.75,1.00,0.72,...,0.73,0.78,0.83,0.74,0.76,0.74,0.70,0.75,0.77,0.76
9,0.90,0.91,0.65,0.82,0.70,0.91,0.93,0.88,0.72,1.00,...,0.95,0.84,0.60,0.89,0.91,0.81,0.89,0.96,0.86,0.82


In [66]:
m[58,18]

0.64912280701754388

In [65]:
m.min()

0.067796610169491525

In [68]:
m.mean()

0.76909648433903965

#### На всех текстах

In [82]:
articles = pd.DataFrame(data = list(zip(metadata.path, metadata.title, metadata.tags, metadata.final_rubrics, metadata.number_of_rubrics)), 
                        columns = ['path', 'title', 'tags', 'rubrics', 'rubrics_number'])

In [84]:
texts = []
for path in tqdm(metadata.path):
    path = path.replace('\n','').replace('//','/').replace('?code=','-code=')
    try:
        with open('C:/Users/Анна/YandexDisk/popular_science_texts_store/' + path, encoding = 'utf-8') as f:
            texts.append(f.read())
    except OSError:
        texts.append('')

100%|██████████████████████████████████| 30793/30793 [00:18<00:00, 1700.19it/s]


In [85]:
articles['text'] = texts

In [86]:
articles[1000:1005]

Unnamed: 0,path,title,tags,rubrics,rubrics_number,text
1000,chrdk.ru/news/news_magnetic-stem-cells.txt,Теперь стволовыми клетками можно управлять при...,,Мусор,One,Французские ученые ввели наночастицы железа в ...
1001,chrdk.ru/news/news_magnetite_vs_trombus.txt,Химики создали магнитоуправляемый препарат для...,Материаловедение_Российская наука_Медицина_Нау...,Физиология человека,Multi,Ученые университета ИТМО вместе с санкт-петерб...
1002,chrdk.ru/news/news_maiya_rasschitali_period_vr...,Майя рассчитали период вращения Венеры вокруг ...,Астрономия_Археология_История_Антропология,История,One,Исследуя древнейшую рукописную книгу майя — Др...
1003,chrdk.ru/news/news_makaka-popytalas-sparitsya-...,Японский макак попытался спариться с самкой оленя,Науки о живом_Биология_Животные_Видео,Биология,One,Ученые из Японии и Франции впервые зафиксирова...
1004,chrdk.ru/news/news_makaki-raspoznayut-raznymi-...,Макаки распознают знакомые и незнакомые лица р...,Нейронауки_Биология_Животные,Биология,One,Ученые из Рокфеллеровского университета выясни...


In [None]:
#articles.text = preprocess(articles.text, stopwords)

In [122]:
articles.text = simple_clean(articles.text)

100%|██████████████████████████████████| 29282/29282 [00:18<00:00, 1581.04it/s]


In [124]:
articles = articles[articles.text != '']

In [125]:
articles[articles.text == '']

Unnamed: 0,path,title,tags,rubrics,rubrics_number,text


In [126]:
articles_sample = articles.sample(n = 5000)

In [127]:
articles_sample.head()

Unnamed: 0,path,title,tags,rubrics,rubrics_number,text
796,chrdk.ru/news/news_green_winter.txt,В Москве потеплеет только к концу мая,Метеорология,Науки о земле,One,судя по прогнозу который дали чердаку в росгид...
13986,nplus1.ru/nplus1_news/nplus1.ru-news-2016-06-0...,В Москве началась крупнейшая конференция по ко...,_Наука_,Мусор,One,витраж в чичестерском соборе jph flickr соглас...
27785,polit.ru_proscience/proscience_news/news-2015-...,Астрономы наблюдают за рождением планет,_астрономия,Космос,One,с помощью радиотелескопа alma a acama large mi...
9750,nplus1.ru/nplus1_news/nplus1.ru-news-2015-07-1...,В Chrome появится поддержка проверки правописа...,_Технологии_IT_,Технологии,One,космический аппарат розетта изображение wikime...
6467,geektimes.ru/post_291255.txt,Спросите Итана: какая невозможная физика стала...,Физика_Научно-популярное_Астрономия,Космос,Multi,когда дебют сериала звёздный путь лет назад вп...


In [152]:
%%time
jac_matr = jaccard_matrix(articles_sample.text.values)

KeyboardInterrupt: 

### TFIDF cosine similarity

In [153]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [154]:
tfidf_vect = TfidfVectorizer(min_df = 2, stop_words = stopwords)
tfidf_vectorized = tfidf_vect.fit_transform(articles_small.text)

In [181]:
tfidf_vectorized.shape

(100, 5238)

In [156]:
tfidf_matr = tfidf_vectorized.toarray()

In [157]:
from scipy.spatial.distance import cosine, pdist, squareform

In [171]:
def cos_similarity(x, y):
    return (1 - cosine(x,y))

In [159]:
tfidf_df = pd.DataFrame(tfidf_matr)

In [199]:
cos_similarity(tfidf_df[11], tfidf_df[14]).round(2)

0.070000000000000007

In [177]:
#dists = pdist(tfidf_df, cos_similarity)
cosine_matr = pd.DataFrame(squareform(pdist(tfidf_df, cos_similarity)))

  dist = 1.0 - np.dot(u, v) / (norm(u) * norm(v))


In [200]:
cosine_matr.round(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.00,0.03,0.02,0.03,0.02,0.07,0.03,0.03,0.17,0.07,...,0.03,0.04,0.02,0.03,0.04,0.05,0.05,0.03,0.04,0.02
1,0.03,0.00,0.03,0.04,0.01,0.04,0.02,0.02,0.02,0.05,...,0.03,0.03,0.03,0.02,0.05,0.04,0.07,0.03,0.03,0.03
2,0.02,0.03,0.00,0.03,0.01,0.06,0.03,0.08,0.04,0.05,...,0.02,0.21,0.02,0.04,0.07,0.03,0.06,0.03,0.06,0.02
3,0.03,0.04,0.03,0.00,0.02,0.05,0.04,0.04,0.04,0.08,...,0.10,0.03,0.08,0.03,0.05,0.09,0.04,0.44,0.04,0.06
4,0.02,0.01,0.01,0.02,0.00,0.04,0.04,0.02,0.02,0.10,...,0.06,0.01,0.06,0.02,0.04,0.05,0.02,0.01,0.03,0.04
5,0.07,0.04,0.06,0.05,0.04,0.00,0.06,0.09,0.08,0.07,...,0.05,0.15,0.04,0.04,0.11,0.06,0.12,0.04,0.07,0.03
6,0.03,0.02,0.03,0.04,0.04,0.06,0.00,0.06,0.06,0.09,...,0.06,0.04,0.05,0.03,0.09,0.02,0.04,0.03,0.04,0.04
7,0.03,0.02,0.08,0.04,0.02,0.09,0.06,0.00,0.03,0.05,...,0.03,0.11,0.04,0.06,0.05,0.05,0.02,0.03,0.11,0.04
8,0.17,0.02,0.04,0.04,0.02,0.08,0.06,0.03,0.00,0.23,...,0.03,0.06,0.03,0.05,0.04,0.04,0.07,0.05,0.08,0.06
9,0.07,0.05,0.05,0.08,0.10,0.07,0.09,0.05,0.23,0.00,...,0.07,0.10,0.08,0.07,0.09,0.14,0.07,0.06,0.07,0.03


In [184]:
def cos_matrix(tfidf_vects):
    n, m = tfidf_vects.shape
    cos_sim_matr = np.zeros((n, n))
    for i in range(n):
        for j in range(i+1):
            #if j <= i:
                cos_sim_matr[i, j] = cos_similarity(tfidf_vects[i], tfidf_vects[j])
                cos_sim_matr[j, i] = cos_similarity(tfidf_vects[i], tfidf_vects[j])
    return cos_sim_matr

In [186]:
cos_m = cos_matrix(tfidf_df)

In [191]:
cos_m.mean()

0.050421367487594466

In [201]:
pd.DataFrame(cos_m.round(2))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,1.00,0.10,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.18,0.00,0.00,0.00,0.00
1,0.10,1.00,0.27,0.00,0.17,0.00,0.00,0.00,0.00,0.00,...,0.00,0.17,0.04,0.00,0.00,0.05,0.00,0.00,0.00,0.39
2,0.00,0.27,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
3,0.00,0.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
4,0.00,0.17,0.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.03,0.00,0.00,0.00,0.00,0.00,0.00,0.00
5,0.00,0.00,0.00,0.00,0.00,1.00,0.00,0.00,0.00,0.12,...,0.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,0.00,0.00
6,0.00,0.00,0.00,0.00,0.00,0.00,1.00,0.00,0.00,0.00,...,0.00,0.00,0.59,0.00,0.00,0.00,0.00,0.00,0.00,0.00
7,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
8,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
9,0.00,0.00,0.00,0.00,0.00,0.12,0.00,0.00,0.00,1.00,...,0.00,0.00,0.14,0.00,0.00,0.00,0.00,0.00,0.00,0.14


In [None]:
cos_similarity(tfidf)

In [202]:
articles_small.text[14][:1000]

'август считаться днём аспирин основный долгожитель наш аптечка век медик изучать свойство обыватель находить новое новое применение честь день рождение чудо лекарство рассказывать пять случай выписывать врач народный способ использовать назначение герой сегодняшний день небольшой белый таблетка стать один рекордсмен производство продажа последний столетие внести книга рекорд гиннес самый продавать обезболивать скорее ваш аптечка содержимое аспирин ацетилсалициловый кислота наверняка найтись состав какой комплексный препарат любитель домашний эксперимент труд обнаружить капнуть таблетка раствор медный купорос позеленеть дом заваляться хлорид железо iii реакция он цвета получиться фиолетовый работать аспирин следовать название молекула ацетилсалициловый кислота состоять часть салициловый кислота ацетильный группа действие аспирин организм обеспечиваться способность молекула навешивать ацетильный группа разный клеточный белка результат белка мишень изменять форма следствие активность так

In [203]:
articles_small.text[6][:1000]

'вторник март сеул пройти последний встреча компьютерный программа alphago южнокорейский профессионал го седоль итог пять матч машина работать технология нейронный сеть выиграть счёт означать искусственный разум окончательно превзойти человеческий представить август последний день мировой война ивамото каорать хасимото утаро японский мастер го играть важный партия местечко ицукаити один пригород хиросима ослепительный вспышка город вырастать зловещий гриб разрушительный ударный волна выбивать стекло дом раскидывать камень игровой доска невозмутимый ивамото хасимото память восстанавливать позиция доигрывать партия атомный бомба узнать вечером игра го терпеть суета позволять отвлекаться го возникнуть древний китай примерно тысяча наш эра быстро стать популярный страна нынешний юго восточный азия правило стратегический игра простой игрок очередь ставить игровой пол чёрный белые камень стараться отгородить большой территория тактик го столь многообразный сложный начинающий внятно объяснить

In [192]:
tfidf_vect = TfidfVectorizer(min_df = 2, stop_words = stopwords)
tfidf_vectorized = tfidf_vect.fit_transform(articles_sample.text)

In [193]:
tfidf_matr = tfidf_vectorized.toarray()
tfidf_df = pd.DataFrame(tfidf_matr)

In [197]:
%%time
cosine_matr = cos_matrix(tfidf_df)