# Представление текста. Часть 1.

<br>

## Задачи:
- Первичная обработка текста с помощью **стемминга** и **лематизации**
- Преобразование текста с помощью **One-Hot Encoding**
- Преобразование текста с помощью **BoW (Bag of Words)**
- Преобразование текста с помощью **BoN (Bag of N-Grams)**
- Преобразование текста с помощью **TF-IDF (Term Frequency–Inverse Document Frequency)**
- Преобразование текста с помощью **LSA (Latent Semantic Analysis)**

<hr>

#### Импорт необходимых библиотек

In [1]:
# Импорт необходимых библиотек для дальнейшей работы
import json
import nltk
import numpy as np
import pandas as pd
import seaborn as sns
import warnings

from collections.abc import Mapping, Sequence
from itertools import chain
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from pathlib import Path
from regex import Pattern, compile as re_compile
from scipy import sparse
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import OneHotEncoder
from spacy import load as spacy_load_model
from spacy.cli import download as spacy_download_model
from typing import Any

# Инициализация дополнительных опций и настроек
pd.set_option('display.max_columns', 250)
# nltk.download('punkt_tab')
warnings.filterwarnings('ignore')

<hr>
<br>

## Стемминг и Лематизация


- **Стемминг** — процесс нахождения основы для заданного слова, т.е. приведения слова к некоторой базовой форме так, что все различные варианты этого слова могут быть представлены одним и тем же словом (путем удаления окончаний и суффиксов). Например:

    `шапки, шапку, шапок → шапка`

    Стемминг помогает уменьшить сложность текста и улучшить производительность алгоритмов анализа.


- **Лемматизация** — это процесс сопоставления всех различных форм слова с его основой или леммой. Хотя это определение кажется близким к определению стемминга, на самом деле они отличаются. Например:
  
    `позитивные → позитивный`

    В отличие от стемминга, лемматизация сводит слова к их лемме — это более сложный процесс, который учитывает морфологический анализ слов. Лемматизация более точно обрабатывает слова, приводя их к словарной форме.

<hr>

#### Функция для загрузки моделей spaCy

Для обработки текста с помощью библиотеки `spaCy` нужно загрузить натренированную модель. Перечень поддерживаемых языков и наличие моделей можно посмотреть [документации](https://spacy.io/usage/models#languages).

In [2]:
def spacy_model(model_name: str):
    """Функция скачивает указанную языковую модель."""

    try:
        return spacy_load_model(model_name)
    except OSError:
        spacy_download_model(model_name)
        return spacy_load_model(model_name)

Загрузим среднего размера модели для английского и русского языков:
- [en_core_web_md](https://spacy.io/models/en#en_core_web_md) 31Mb
- [ru_core_news_md](https://spacy.io/models/ru#ru_core_news_md) 39Mb

In [3]:
en_spacy_model = spacy_model('en_core_web_md')
ru_spacy_model = spacy_model('ru_core_news_md')

<hr>

#### Кастомный токенизатор для токенизации русских текстов для BoW

Для более качественного векторного представления текста с помощью `Bag of Words` определим свой собственных токенизатор, который будет возвращать леммы слов. Приведение слова к его основной форме (лемме) позволит уменьшить размерность векторов.

Токенизатор будет поддерживать следующий функционал:
- удаление цифр (опция `remove_numbers`, с возможностью отключения) — номера телефонов, года и т.п.;
- удаление знаков пунктуации (опция `remove_punctuation`, с возможностью отключения) — знаки препинания будут являтся шумом в векторной модели;
- удаление прочих символов (опция `remove_symbols`, с возможностью отключения) — смайлики, математические операторы и пр.;
- удаление лишних пробелов (опция `normalize_whitespaces`, с возможностью отключения);
- удаление стоп-слов (опция `keep_stopwords`, с возможностью отключения); стоп-слова — это общеупотребительные слова в языке, которые обычно несут мало смысловой нагрузки (например, "и", "в", "на");
- минимальная длина леммы (`min_length` с возможностью отключения) — удаление лемм с количеством символов меньше заданного порогового значения.

In [4]:
class RussianTokenizer:
    """RussianTokenizer.

    Класс для токенизации русских текстов с помощью
    spaCy. На вход принимается строка, на выходе список
    из лемм.
    """

    def __init__(
        self,
        remove_numbers: bool = True,
        remove_punctuation: bool = True,
        remove_symbols: bool = True,
        whitespace_normilize: bool = True,
        keep_stopwords: bool = False,
        min_length: int = 1,
    ):
        if remove_numbers:
            self._remove_numbers = re_compile(r'\p{Number}')
        if remove_punctuation:
            self._remove_punctuation = re_compile(r'\p{Punctuation}')
        if remove_symbols:
            self._remove_symbols = re_compile(r'\p{Symbol}')
        if whitespace_normilize:
            self._whitespace_normilize = re_compile(r'\s+')

        self._keep_stopwords = keep_stopwords
        self._min_length = min_length
        self._activated_patterns = tuple(
            pattern for attr, pattern in self.__dict__.items() if isinstance(pattern, Pattern)
        )

    def __call__(
        self,
        text_line: str,
        *args: Sequence[Any],
        **kwargs: Mapping[str, Any],
        ) -> Sequence[str]:

        if self._activated_patterns:
            for regex_pattern in self._activated_patterns:
                text_line = regex_pattern.sub(' ', text_line)

        tokenized_doc = ru_spacy_model(text_line)
        if self._keep_stopwords:
            return list(token.lemma_ for token in tokenized_doc if len(token.lemma_) >= self._min_length)
        return list(token.lemma_ for token in tokenized_doc if len(token.lemma_) >= self._min_length and not token.is_stop)

ru_tokenizer = RussianTokenizer(
    min_length=3,
)

#### Функции для загрузки метаинформации по документам и соединения номера документа с доменом и тематикой

In [5]:
def load_meta_data(meta_data_dir: Path) -> Mapping[str, Any]:
    """Функция для загрузки метаинформации по документам."""

    meta_data_path = next(meta_data_dir.glob('meta_data.json'))
    with open(meta_data_path) as file_in:
        meta_data = json.load(file_in)
    return meta_data

######################################################################################

def map_doc_to_subject(meta_data: Mapping[str, Any]) -> Sequence[str]:
    """Функция соединяет номер документа с доменом и тематикой."""

    mapped = []
    for doc, info in meta_data.items():
        doc_topic = info.get('topic', '')
        doc_subject = info.get('subject', '')
        doc_number = int(doc.split('_')[0])
        mapped.append(f'{doc_number}_{doc_topic}_{doc_subject}')
    return mapped

<hr>
<br>

### Стемминг с помощью `NLTK`

Для стемминга токенов на английском языке используется [PorterStemmer](https://www.nltk.org/api/nltk.stem.snowball.html#nltk.stem.snowball.PorterStemmer). По сравнению с другими алгоритмами стемминга он дает наилучший результат и имеет меньший процент ошибок.

In [6]:
remove_punctuation = re_compile(r'\p{Punctuation}')
stemmer = PorterStemmer()
en_text = '"Walking in the woods is pleasant!" - he said happily!'
en_text_no_punct = remove_punctuation.sub('', en_text)
en_tokens = word_tokenize(en_text_no_punct)
en_stemmed_tokens = [stemmer.stem(token) for token in en_tokens]

en_stemming = pd.DataFrame.from_dict(
    data={
        'token': en_tokens,
        'stemmed token': en_stemmed_tokens
    }
)
en_stemming

Unnamed: 0,token,stemmed token
0,Walking,walk
1,in,in
2,the,the
3,woods,wood
4,is,is
5,pleasant,pleasant
6,he,he
7,said,said
8,happily,happili


**Стемминг на русском языке.**

Для стемминга токенов на других языках используется [Snowball Stemmer](https://www.nltk.org/api/nltk.stem.snowball.html#nltk.stem.snowball.SnowballStemmer). Snowball Stemmer, по сравнению с Porter Stemmer, является мультиязычным. Он поддерживает различные языки и основан на языке программирования Snowball, известном своей эффективностью при обработке небольших строк.

In [7]:
ru_stemmer = SnowballStemmer('russian')
ru_text = 'Съешь еще этих мягких французских булок да выпей чаю.'
ru_text_no_punct = remove_punctuation.sub('', ru_text)
ru_tokens = word_tokenize(ru_text_no_punct)
ru_stemmed_tokens = [ru_stemmer.stem(token) for token in ru_tokens]

ru_stemming = pd.DataFrame.from_dict(
    data={
        'token': ru_tokens,
        'stemmed token': ru_stemmed_tokens
    }
)
ru_stemming

Unnamed: 0,token,stemmed token
0,Съешь,съеш
1,еще,ещ
2,этих,эт
3,мягких,мягк
4,французских,французск
5,булок,булок
6,да,да
7,выпей,вып
8,чаю,ча


<hr>

### Лемматизация с помощью `spaCy`

<div class="alert alert-info">

Полный перечень атрибутов токена в токенизированном тексте можно посмотреть в [документации](https://spacy.io/api/token#attributes).

In [8]:
# Для английского предложения
en_doc = en_spacy_model('"Walking in the woods is pleasant!" - he said happily!')
en_tokens = [token for token in en_doc if not token.is_punct]

en_lemmatization = pd.DataFrame.from_dict(
    data={
        'token': en_tokens,
        'lemma': [token.lemma_ for token in en_tokens]
    }
)
en_lemmatization

Unnamed: 0,token,lemma
0,Walking,walk
1,in,in
2,the,the
3,woods,wood
4,is,be
5,pleasant,pleasant
6,he,he
7,said,say
8,happily,happily


<hr>

In [9]:
# Для русского предложения
ru_doc = ru_spacy_model('Съешь еще этих мягких французских булок да выпей чаю.')
ru_tokens = [token for token in ru_doc if not token.is_punct]

ru_lemmatization = pd.DataFrame.from_dict(
    data={
        'token': ru_tokens,
        'lemma': [token.lemma_ for token in ru_tokens]
    }
)
ru_lemmatization

Unnamed: 0,token,lemma
0,Съешь,съешь
1,еще,ещё
2,этих,этот
3,мягких,мягкий
4,французских,французский
5,булок,булка
6,да,да
7,выпей,выпей
8,чаю,чай


<div class="alert alert-info">

`spaCy` не предоставляет функционал для стемминга, поскольку стемминг неточен. `spaCy` предназначен в основном для использования в продакшене и неточности источник ошибок в продакшн системах. Лемматизация выполняет ту же работу, но более точно с помощью словаря, специфичного для языка, и возвращает точный корень слова.

<hr>
<br>

## One-Hot Encoding

Каждому слову `w` в словаре корпуса присваивается уникальный целочисленный идентификатор  `id`, который находится в диапазоне от 1 до `V`, где V — словарь уникальных слов, полученный из корпуса. Затем каждое слово представляется двоичным вектором V-мерности из нулей и единиц.

In [10]:
sentences = [
    'We need a new truck.',
    'We painted the house green.',
    'We turned on the radio.',
    'Did you play tennis yesterday?'
]

# Создаем перечень уникальных слов (токенов)
text_to_docs = map(lambda line: en_spacy_model(line), sentences)
tokenized_sents = [[token.lemma_ for token in doc if not token.is_punct] for doc in text_to_docs]
unique_tokens = sorted(set(chain.from_iterable(tokenized_sents)))

# Каждому токену в словаре присваиваем индекс и формируем словарь
vocabulary = {token: idx for idx, token in enumerate(unique_tokens, 1)}
print(f'Vocubulary:\n{vocabulary}\n')

# Создаем числовое представления текста для One-Hot энкодера
numerical_data = [[vocabulary[word] for word in doc] for doc in tokenized_sents]

# Векторизация с помощью One-Hot энкодера
one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
encoded_text = one_hot_encoder.fit_transform(numerical_data)
print('One-Hot Encoded Representation:')
for encoded_line, tokenized_line in zip(encoded_text, tokenized_sents):
    print(f'{encoded_line} | {tokenized_line}')

Vocubulary:
{'a': 1, 'do': 2, 'green': 3, 'house': 4, 'need': 5, 'new': 6, 'on': 7, 'paint': 8, 'play': 9, 'radio': 10, 'tennis': 11, 'the': 12, 'truck': 13, 'turn': 14, 'we': 15, 'yesterday': 16, 'you': 17}

One-Hot Encoded Representation:
[0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0.] | ['we', 'need', 'a', 'new', 'truck']
[0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0.] | ['we', 'paint', 'the', 'house', 'green']
[0. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0.] | ['we', 'turn', 'on', 'the', 'radio']
[1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1.] | ['do', 'you', 'play', 'tennis', 'yesterday']


<div class="alert alert-danger"> 

### Недостатки:
- pазмер one-hot вектора прямо пропорционален размеру словаря, и у больших корпусов будут формироваться большие словари. Это приводит к разреженному представлению текста, где большинство записей в векторах являются нулями, что делает его вычислительно неэффективным для хранения, вычисления и обучения (разреженность приводит к переобучению);
- проблема слов не входящих в словарь OOV (out of vocabulary).

<hr>
<br>

## BoW (Bag of Words)

Основная идея заключается в том, чтобы представить рассматриваемый текст в виде мешка (набора) слов, игнорируя порядок и контекст. Основное предположение — текст, принадлежащий к определенному классу характеризуется уникальным набором слов. Если два текста содержат почти одинаковые слова, значит, они принадлежат к одному и тому же классу. Таким образом, анализируя слова, присутствующие в фрагменте текста, можно определить, к какому классу (мешку) он принадлежит.

Реалзиация `BoW` c помощью [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#countvectorizer) из библиотеки `Scikit-learn`

Основные параметры класса `sklearn.feature_extraction.text.CountVectorizer`:
- `input`:
    - `filename` — список файлов которые нужно считать
    - `file` — путь к файлу который нужно считать
    - `content` — последовательность строк или байт
    - **default**=`content`
- `strip_accents` — норамлизация по юникод символам или ascii; **default**=`None`Bag of Words
- `lowercasebool` — приведение к нижнему регистру текста перед токенизацей; **default**=`True`
- `tokenizer` — вызываемый объект; использутся для токенизации текста; **default**=`None`
- `max_features` — словарь будет ограничен количеством токенов, указанным в `max_features`; при этом будут учитываться только наиболее встречающиеся, упорядоченные по частоте в корпусе токены. Если параметр не задан, то словарь строится из всех токенов; **default**=`None`
- `ngram_range` — нижняя и верхняя граница диапазона значений n для различных n-грамм слов или символов, которые будут извлечены; используется для реализации Bag of N-Grams; **default**=(1, 1)
- `max_df` — при построении словаря игнорируются термины, частота встречаемости которых в документах строго выше заданного порога; порог рассчитывается по формуле: [max_df * n_doc](https://github.com/scikit-learn/scikit-learn/blob/ce4a40ffae5005ffa30f87b198b176dc6eb0f160/sklearn/feature_extraction/text.py#L1383), где `n_doc` — количество документов (строк); **default**=`1`
- `min_df` — при построении словаря игнорируются термины, частота встречаемости которых в документах строго ниже заданного порога; порог рассчитывается по формуле: [min_df * n_doc](https://github.com/scikit-learn/scikit-learn/blob/ce4a40ffae5005ffa30f87b198b176dc6eb0f160/sklearn/feature_extraction/text.py#L1384), где `n_doc` — количество документов (строк); **default**=`1`

In [11]:
# Проинициализируем векторизатор 
bow_vectorizer = CountVectorizer(
    lowercase=False,
    tokenizer=ru_tokenizer,
    analyzer='word',
    binary=False,
)

In [12]:
corpus = [
    'Яркой визитной карточкой сиамских кошек является их характерный окрас 😋',
    'Мейн-куны одни из самых крупных кошек, их вес может достигать 12 кг!',
    'По характеру сококе подвижные, задорные, любопытные и умные кошки.',
    'Животные средние по своим размерам, вес тела достигает 3-5 кг.',
    'хорошо развитое мускулистое тело среднего размера, вес от 2-х до 4-х килограмм',
]

# Трансформируем строки в мешок слов
bow = bow_vectorizer.fit_transform(corpus)
bow_df = pd.DataFrame(
    data=bow.toarray(),
    columns=bow_vectorizer.get_feature_names_out()
)
bow_df

Unnamed: 0,вес,визитный,достигать,животное,задорный,карточка,килограмм,кошка,крупный,куна,любопытный,мейн,мускулистый,окрас,подвижный,развитой,размер,сиамский,сококе,средний,тело,умный,характер,характерный,являться,яркий
0,0,1,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,1,1
1,1,0,1,0,0,0,0,1,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,1,1,0,0,0
3,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0
4,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,1,0,0,1,1,0,0,0,0,0


<br>

Посмотрим на частоты слов из нашего "мешка слов". Отобразим первые 10 слов в порядке убывания по частоте встречаемости в корпусе.

In [13]:
bow_df.sum(axis=0).to_frame(name='frequency').sort_values('frequency', ascending=False).head(10).T

Unnamed: 0,вес,кошка,достигать,тело,средний,размер,карточка,визитный,задорный,животное
frequency,3,3,2,2,2,2,1,1,1,1


Из полученных частот видно, что слово «вес» встречается в корпусе 3 раза, в строках 2, 4 и 5. А слово «кошка» в строках 1, 2 и 3.

Сформируем «мешок слов» для текстовых файлов, в директории `./data`:

In [14]:
# Сформируем путь к файлам и отсортируем список файлов по имени
data_dir = Path('data').resolve()
doc_files = sorted(data_dir.glob('*_doc*'), key=lambda file: int(file.name.split('_')[0]))

# Проинициализируем векторизатор с аргументом `filename` для считывания файлов 
docs_bow_vectorizer = CountVectorizer(
    input='filename',
    strip_accents='unicode',
    lowercase=True,
    tokenizer=ru_tokenizer,
    analyzer='word',
    max_features=350,
    binary=False,
    max_df=0.85,
    min_df=0.3,
)
# Трансформируем документы в мешок слов
docs_bow = docs_bow_vectorizer.fit_transform(doc_files)
docs_bow_df = pd.DataFrame(
    data=docs_bow.toarray(),
    columns=docs_bow_vectorizer.get_feature_names_out()
)
docs_bow_df

Unnamed: 0,беларусь,больший,большинство,большои,большой,век,вероятно,взрослои,взрослый,взять,вместе,внимание,вода,водоём,возникнуть,возраст,время,всеи,встречаться,второи,второй,вывести,выполнять,высокий,высоко,высота,высотои,выявить,главный,глаз,говорить,год,группа,давать,дать,два,деиствительно,делать,дерево,диапазон,длинный,дом,домашнеи,домашний,достаточно,дыхательный,единый,животное,жизнь,жить,зависеть,задний,значение,зона,зрение,игра,известный,иметься,использовать,использоваться,исследование,источник,каждый,качество,книга,количество,конец,континент,корень,костеи,кость,которои,которыи,котёнок,кошачий,кошка,краинеи,красный,крупный,лежать,линия,любопытный,людеи,маленький,малый,мелкий,мера,место,минимум,мир,мнение,многих,многочисленный,момент,мощный,название,называть,называться,найти,наличие,научный,начать,начинать,нашеи,небольшой,невозможный,неи,несколько,несмотря,новои,новый,обитание,обитать,обладать,обнаружить,образ,образовать,общий,ограничить,однои,оказаться,окрас,основа,основание,особенность,...,пища,площадь,поведение,поверхность,позволить,позволять,поздний,показать,покров,пол,полностью,полный,половина,получить,популярный,порода,последний,посмотреть,похожий,появиться,правда,правило,практически,предок,предполагать,представитель,представлять,приблизительно,приводить,примерно,принимать,природа,проблема,происходить,происхождение,простой,пространство,протяжение,проходить,пять,работа,работать,равный,радость,раз,развитие,различный,размер,разный,раионе,ранний,распространение,распространить,редкий,результат,речь,решить,род,роль,ряд,самые,самыи,свежий,свободный,своеи,связать,сделать,северный,сезон,семеиства,система,сказать,слово,сложный,служить,случай,собака,собои,современный,согласно,соединить,создать,составлять,состояние,сравнение,среда,старый,стиль,сто,стоить,суметь,существовать,считать,считаться,такои,тело,территория,точка,три,тыс,тысяча,увеличить,удаться,уровень,условие,установить,факт,форма,характер,характерный,хороший,цвет,центральный,частично,часть,человек,челюсть,череп,число,чувствовать,чёрный,шерсть,этои,являться,язык
0,0,2,2,1,0,1,1,1,1,0,2,0,0,1,1,0,7,1,3,0,1,1,0,2,1,0,0,0,0,0,1,14,7,1,1,4,1,0,0,1,1,0,3,18,1,1,0,11,5,1,0,1,0,1,2,0,1,0,3,1,2,3,0,2,1,0,0,1,1,2,2,1,0,3,1,95,0,0,3,1,1,0,0,1,0,4,0,1,1,3,3,2,1,2,1,13,1,0,1,0,2,0,1,1,1,0,0,1,2,1,0,1,1,0,1,6,2,1,0,0,0,0,1,1,1,...,2,0,5,1,0,0,1,2,2,1,1,0,0,1,2,3,0,0,0,1,0,1,1,5,2,1,0,1,1,2,1,1,0,2,1,0,0,1,0,2,0,0,0,0,2,2,2,1,2,1,1,0,1,1,1,0,0,1,0,1,1,1,0,1,0,1,0,1,0,1,2,0,7,0,0,2,5,2,1,2,2,0,4,0,1,1,0,0,0,0,1,0,4,1,0,5,2,2,2,2,4,0,1,0,2,3,3,0,0,1,0,0,1,0,1,16,3,5,2,0,0,1,0,11,10
1,0,1,0,0,1,1,1,1,0,1,0,1,1,0,0,1,1,0,1,0,0,0,0,0,1,2,0,1,0,2,0,5,0,1,0,0,0,0,1,0,1,0,1,0,2,0,0,7,3,0,1,0,0,0,0,2,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,2,1,19,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,1,2,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,3,0,0,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,0,0,1,0,0,1,0,1,0,1,0,0,1,0,1,0,2,0,3,0,2,1,1,0,0,0,0,0,0,0,1,1,2,0,0,0,0,0,0,0,1,0,0,0,0,0,0,2,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,1,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,2,2,1,0,1,0,0,0,2,0,0,0,1,0,3,3,0,0
2,0,0,1,0,0,0,0,0,1,0,2,0,0,0,0,2,0,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,3,0,4,0,1,0,6,2,0,1,0,0,0,0,0,1,0,1,0,0,1,0,0,0,2,0,0,0,0,0,0,0,2,2,12,0,0,0,0,0,0,1,0,2,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,0,1,2,0,0,1,...,0,0,0,0,2,1,0,0,1,1,3,0,0,0,0,5,0,0,0,0,1,1,0,0,0,1,0,0,0,0,2,1,0,0,0,0,0,0,0,0,0,0,1,1,0,3,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,3,1,0,0,0,1,1,0,0,1,0,0,1,0,0,1,0,0,0,2,0,1,0,0,0,2,1,0,0,0,1,1,1,0,0,0,0,1,0,0,2,0,0,0,0,1,0,0,0,0,0,3,0,0,0,0,0,7,0,0,0
3,0,0,0,1,1,0,1,0,0,0,0,0,0,0,1,5,5,0,2,1,0,1,0,0,1,0,1,2,0,0,0,23,4,0,1,2,0,1,0,0,0,0,3,8,0,0,1,6,1,1,0,0,1,0,3,1,1,1,0,2,3,3,0,0,0,0,0,1,0,2,2,0,0,0,0,0,1,0,2,0,3,0,1,1,0,0,1,4,2,4,3,0,0,0,1,1,0,2,1,0,2,1,1,0,1,0,0,1,1,0,2,0,0,0,2,1,0,1,1,0,0,0,1,3,0,...,1,0,3,0,0,1,0,5,0,2,0,1,0,0,1,5,1,0,1,0,0,0,0,9,4,2,1,1,0,0,0,0,0,4,4,0,0,1,0,1,1,0,0,0,0,0,1,2,6,0,2,0,1,1,4,2,0,0,1,1,0,0,0,0,0,1,2,0,0,1,0,1,5,0,1,0,62,1,4,1,0,0,0,0,0,1,2,0,0,0,0,3,3,1,0,0,1,3,0,14,3,0,0,1,0,1,1,0,1,0,0,0,1,0,0,10,1,1,3,0,0,0,1,4,2
4,0,4,3,1,0,0,0,1,6,0,1,0,5,2,0,0,5,1,0,0,0,0,0,2,1,1,0,1,0,1,0,0,1,0,0,2,0,0,2,1,2,0,0,1,0,1,1,6,5,3,1,7,1,0,0,0,0,1,1,0,1,1,0,1,0,0,1,1,0,1,7,1,0,0,0,0,1,1,0,1,0,0,0,0,0,1,1,0,0,0,0,1,1,0,1,2,1,1,0,1,0,0,0,0,1,0,0,2,1,1,0,2,4,3,2,3,1,0,1,1,1,1,0,0,2,...,1,2,1,3,0,7,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,5,2,0,0,3,2,0,1,0,0,0,0,0,0,0,2,0,2,0,0,0,0,0,0,3,3,1,1,1,0,1,2,1,0,1,0,3,0,0,0,0,0,1,0,0,0,1,1,3,0,0,0,1,1,1,0,2,0,0,1,1,1,1,2,2,0,0,0,0,0,1,0,0,0,2,0,0,1,0,0,1,0,0,1,0,0,0,0,1,1,1,1,1,3,2,1,4,0,0,1,0,1,1,1
5,0,1,1,0,0,0,1,0,0,0,1,2,0,0,0,0,4,0,0,1,1,0,1,0,0,0,0,0,1,1,5,6,2,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,0,0,0,0,0,0,1,0,0,0,3,0,0,0,0,1,1,0,0,0,2,1,1,0,1,0,0,0,1,0,2,0,0,1,0,0,0,3,0,2,0,0,0,0,1,1,0,0,0,0,2,1,1,1,3,0,0,0,0,2,0,3,0,4,1,0,0,0,0,...,0,0,0,0,1,0,2,1,0,0,0,0,1,1,0,0,1,1,1,1,2,0,1,0,0,0,0,0,0,2,0,0,2,1,0,1,0,0,0,0,1,1,0,0,3,1,0,0,0,0,0,0,0,0,0,1,0,0,3,1,0,1,1,0,0,0,2,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,1,0,2,1,0,0,0,0,1,0,2,0,0,0,1,1,1,0,0,1,0,0,0,3,0,0,2,1,4,0,0,1,1,1,0,0,1,0
6,0,1,1,0,0,2,0,0,0,1,0,0,0,0,1,0,0,2,0,0,1,0,1,1,0,3,1,0,1,0,0,6,0,0,0,0,1,1,0,0,0,3,0,0,1,0,1,0,2,1,0,0,0,0,0,0,1,0,0,1,0,0,1,1,1,2,1,0,3,0,0,1,2,0,0,0,0,0,1,0,1,1,0,1,0,0,0,1,1,1,3,0,0,0,0,1,1,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,2,0,0,0,1,0,0,1,1,0,...,0,3,0,1,2,0,0,1,0,0,0,0,0,2,0,0,0,0,2,0,1,0,0,0,2,0,2,0,0,0,0,1,2,0,0,0,1,0,0,2,0,2,1,0,0,0,0,1,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,3,0,3,0,0,0,1,0,0,0,0,1,0,2,1,0,0,0,1,0,0,1,0,2,1,0,0,2,1,0,3,0,0,0,1,0,2,0,0,0,0,0,0,2,1,0,0,0,0,0,2,3,0,0,0,0,0,0,0,3,0
7,3,3,0,0,3,3,0,0,0,1,0,2,3,3,0,0,2,0,0,1,0,0,0,2,0,0,0,0,1,0,0,5,3,0,0,1,0,1,0,0,0,2,0,0,0,0,0,0,0,2,0,0,0,2,0,1,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,1,0,0,0,0,0,1,1,1,1,2,1,1,1,0,6,0,0,0,0,1,1,0,2,0,0,0,0,0,1,0,0,1,1,1,0,0,0,0,1,2,1,0,0,0,0,0,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,0,0,1,1,0,2,0,1,3,0,0,0,0,1,0,0,0,0,1,1,1,0,2,0,0,2,1,1,0,2,1,0,1,1,1,0,0,1,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,2,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,1,0,0,1,2,1,1,5,0,0,0,2,0,1,6,0,0,0,1,0,0,1,2,0,0,0,1,0,1,0,0,0,2,0,0,0,0,0,1,0,1,2,0
8,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,1,1,0,1,0,0,0,0,0,0,14,0,0,0,1,0,0,0,2,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,2,0,0,0,0,0,0,2,0,0,0,0,0,0,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0,1,1,0,2,1,0,1,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,1,0,0,2,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,3,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,1,7,0,0,0,0,1,2,0,0,0,0,1,0,0,0,0,1,3,1,1,0,0,2,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0
9,2,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,2,0,0,0,0,2,1,1,2,0,0,0,1,0,0,2,1,0,0,0,7,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,2,0,2,4,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,0,3,2,0,0,0,0,0,0,1,0,1,1,0,0,0,1,0,0,0,0,0,2,0,0,0,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,1,0,0,0,2,0,2,0,0,1,0,0,1,1,1,0,2,0,0,1,0,1,0,0,0,1,1,0,0,1,0,0,0,1,0,0,0,0,2,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,7,0,0,2,1,0,0,1,0,0,0,0,1,1,0,0,0,0


Отобразим первые 15 слов в порядке убывания по частоте встречаемости в текстовых файлах.

In [15]:
docs_bow_df.sum(axis=0).to_frame(name='frequency').sort_values('frequency', ascending=False).head(15).T

Unnamed: 0,кошка,год,собака,человек,животное,домашний,время,название,порода,являться,жизнь,тыс,группа,предок,место
frequency,126,73,68,40,36,31,25,25,22,22,18,17,17,15,14


Отобразим последние 15 слов в текстовых файлах.

In [16]:
docs_bow_df.sum(axis=0).to_frame(name='frequency').sort_values('frequency', ascending=True).head(15).T

Unnamed: 0,второи,второй,вывести,выполнять,удаться,уровень,сложный,суметь,служить,дыхательный,самые,ряд,самыи,дать,сто
frequency,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3


#### Сравним анализируемые документы на похожесть с помощью косинусного расстояния

In [17]:
# Произведем сжатие разреженных векторов
sparse_matrix = sparse.csr_matrix(docs_bow.toarray())
# Рассчитаем схожесть между векторами
cosine_similarities = cosine_similarity(sparse_matrix)

meta_data = load_meta_data(data_dir)
doc_subject_names = map_doc_to_subject(meta_data)

# Сформируем датафрейм с коэффициентами похожести
docs_similarity_df = pd.DataFrame(
    cosine_similarities,
    columns=doc_subject_names,
    index=doc_subject_names,
)

In [18]:
matrix_similarity = docs_similarity_df.where(
    np.tril(np.ones(docs_similarity_df.shape), k=-1).astype(np.bool)
)
styled_similarity = (matrix_similarity
    .style
    .background_gradient(cmap='YlGnBu')
    .highlight_null('white')
    .format("{:.2%}", na_rep="")
)
styled_similarity

Unnamed: 0,1_animals_cat,2_animals_cat,3_animals_cat,4_animals_dog,5_animals_frog,6_movie_thunderbolts*,7_real_estate_skyscraper,8_nature_lake_naroch,9_auto_geely,10_plants_birch
1_animals_cat,,,,,,,,,,
2_animals_cat,77.59%,,,,,,,,,
3_animals_cat,67.01%,72.39%,,,,,,,,
4_animals_dog,20.04%,16.33%,14.44%,,,,,,,
5_animals_frog,17.38%,19.59%,23.80%,11.51%,,,,,,
6_movie_thunderbolts*,18.15%,14.32%,10.37%,22.59%,20.81%,,,,,
7_real_estate_skyscraper,17.35%,20.59%,16.85%,23.77%,20.51%,37.79%,,,,
8_nature_lake_naroch,14.02%,17.59%,13.74%,17.78%,24.34%,32.58%,38.36%,,,
9_auto_geely,13.25%,17.64%,7.23%,27.53%,10.68%,42.59%,39.53%,30.93%,,
10_plants_birch,7.15%,13.38%,8.51%,6.15%,17.29%,15.22%,22.07%,23.05%,4.06%,


<hr>
<br>

<div class="alert alert-danger">

А что если мы не будем заморачиваться с реализаций собственного токенизатора и откажемся также от нормализации по юникоду?

In [19]:
default_docs_bow_vectorizer = CountVectorizer(
    input='filename',
    lowercase=True,
    analyzer='word',
    binary=False,
)
default_docs_bow = default_docs_bow_vectorizer.fit_transform(doc_files)
default_docs_bow_df = pd.DataFrame(
    data=default_docs_bow.toarray(),
    columns=default_docs_bow_vectorizer.get_feature_names_out()
)

default_sparse_matrix = sparse.csr_matrix(default_docs_bow.toarray())
default_cosine_similarities = cosine_similarity(default_sparse_matrix)

default_docs_similarity_df = pd.DataFrame(
    default_cosine_similarities,
    columns=doc_subject_names,
    index=doc_subject_names,
)

default_matrix_similarity = default_docs_similarity_df.where(
    np.tril(np.ones(default_docs_similarity_df.shape), k=-1).astype(np.bool)
)
default_styled_similarity = (default_matrix_similarity
    .style
    .background_gradient(cmap='YlGnBu')
    .highlight_null('white')
    .format("{:.2%}", na_rep="")
)
default_styled_similarity

Unnamed: 0,1_animals_cat,2_animals_cat,3_animals_cat,4_animals_dog,5_animals_frog,6_movie_thunderbolts*,7_real_estate_skyscraper,8_nature_lake_naroch,9_auto_geely,10_plants_birch
1_animals_cat,,,,,,,,,,
2_animals_cat,51.19%,,,,,,,,,
3_animals_cat,34.99%,35.67%,,,,,,,,
4_animals_dog,41.66%,26.05%,28.06%,,,,,,,
5_animals_frog,30.44%,25.78%,30.03%,26.94%,,,,,,
6_movie_thunderbolts*,35.92%,29.72%,41.28%,36.90%,36.68%,,,,,
7_real_estate_skyscraper,32.30%,29.18%,40.32%,37.10%,28.70%,50.62%,,,,
8_nature_lake_naroch,26.26%,26.71%,30.54%,27.37%,27.73%,40.09%,38.55%,,,
9_auto_geely,23.13%,20.93%,28.64%,26.05%,23.58%,39.03%,33.64%,29.63%,,
10_plants_birch,24.39%,23.92%,25.73%,23.24%,26.17%,37.61%,33.06%,31.21%,25.76%,


<div class="alert alert-info">

По коэффициентам похожести видно, что при дефолтных параметрах, документы которые относятся к разным домена стали более похожи между собой. Из этого следует, что дополнительная обработка и подбор параметров является важной составляющей.

<div class="alert alert-warning">

<strong>Нужна ли лемматизация?</strong> Посмотрим на следующих примерах.

In [20]:
# Bag of Words с лемматизацей
lemmas_docs_bow_vectorizer = CountVectorizer(
    input='filename',
    lowercase=True,
    tokenizer=ru_tokenizer,
    analyzer='word',
)

lemma_docs_bow = lemmas_docs_bow_vectorizer.fit_transform(doc_files)
lemma_vocabulary = sorted(lemmas_docs_bow_vectorizer.get_feature_names_out())
lemma_vocabulary[767:777]

['животное',
 'жидкость',
 'жизнедеятельность',
 'жизненный',
 'жизнь',
 'жильё',
 'житель',
 'жить',
 'журналист',
 'жёлтый']

In [21]:
# Bag of Words без лемматизации
raw_docs_bow_vectorizer = CountVectorizer(
    input='filename',
    lowercase=True,
    analyzer='word',
)

raw_docs_bow = raw_docs_bow_vectorizer.fit_transform(doc_files)
raw_vocabulary = sorted(raw_docs_bow_vectorizer.get_feature_names_out())
raw_vocabulary[1304:1314]

['животного',
 'животное',
 'животные',
 'животным',
 'животных',
 'живут',
 'живущих',
 'живших',
 'жидкостей',
 'жидкости']

<div class="alert alert-info">


Из среза словарей видно, что в `BoW` без лемматизации в словаре несколько форм слова `животное` → ['животное', 'животные', животным', 'животных'], а в словаре `BoW` с лемматизацией только одно слово `животное`. Таким образом, лемматизация позволяет наполнить словарь более разнообразными словами и уменьшить размер самого словаря что является важным для огромных текстовых корпусов.

<div class="alert alert-danger"> 

### Недостатки:
- размер вектора увеличивается с ростом словаря; разреженность векторов остается проблемой. Один из способов борьбы с ней — ограничение словаря `n`-ным количеством наиболее часто встречающихся слов (параметр `max_features`);
- проблема слов не входящих в словарь OOV (out of vocabulary);
- как видно из названия, это «мешок слов» — информация о порядке слов при таком представлении теряется.

<hr>
<br>

## BoN (Bag of N-Grams)


`Bag of N-Grams` может помочь нам уловить некоторый контекст, чего нельзя добиться используя `One-Hot Encoding` и `Bag of Words`. Из слов корпуса будут формироваться `n`-граммы (связки слов). `n` задается через параметр `ngram_range`. Например, если задать параметр **ngram_range=(1, 2)**, то 
в словаре будут формироваться монограммы и биграммы. При **ngram_range=(1, 3)** будут формироваться монограммы, биграммы и триграммы.

При **ngram_range=(1, 2)** наш словарь будет выглядеть следующим образом:
```
животное
животное активный
животное встречаться
животное выжить
животное говорить
животное индоевропейский
```
При **ngram_range=(1, 3)** наш словарь будет выглядеть следующим образом:
```
животное
животное активный
животное активный умный
животное встречаться
животное встречаться крайний
животное выжить
животное выжить три
животное говорить
животное говорить многочисленный
```

<div class="alert alert-warning">

Таким образом, документы с одинаковыми `n`-граммами будут иметь векторы ближе друг к другу по сравнению с документами с совершенно разными `n`-граммами. В векторном представлении будет в некоторой степени отражена информация о контексте и порядке слов.

In [22]:
docs_bon_vectorizer = CountVectorizer(
    input='filename',
    strip_accents='unicode',
    lowercase=True,
    tokenizer=ru_tokenizer,
    ngram_range=(1, 2),
)
docs_bon = docs_bon_vectorizer.fit_transform(doc_files)
docs_bon_df = pd.DataFrame(
    data=docs_bon.toarray(),
    columns=docs_bon_vectorizer.get_feature_names_out()
)
docs_bon_df

Unnamed: 0,abys,abys bunny,amauensis,amauensis лягушка,ambon,ambon бурдее,arthroleptis,arthroleptis hematogaster,atlas,atlas pro,atlas знаковый,atlas литровый,atlas получить,atlas споилер,belgee,belgee автомобиль,belgee оказываться,betula,betula nana,betula pendula,betula pubescens,betulaceae,betulaceae широко,bose,bose динамик,boyue,boyue год,broadacre,broadacre city,bunny,bunny cat,caerulea,caerulea проводящеи,cancrivora,cancrivora обитать,canis,canis familiaris,canis lupus,carelica,carelica красивый,cat,cat абиссинский,cat арма,cath,cath ирл,catt,catt исп,catto,catto chat,cattus,cattus кошка,catus,catus domesticus,catus год,catus домашний,catus иоганн,catus независимо,catus оговорить,catus основание,champreveyres,champreveyres отриве,chat,chat первоначальныи,city,city отражать,comfort,comfort покупаться,contact,contact suv,continental,continental ultra,coolray,coolray вариант,coolray первый,coolray случай,crockford,crockford университет,cамое,cамое высокий,domestica,domestica слово,domesticus,domesticus felis,domesticus изначально,domesticus предложить,dtc,dtc ступенеи,erralla,erralla сестон,ewingii,ewingii основать,fallingwater,fallingwater дом,familiaris,familiaris домашний,familiaris лат,familiaris линнеем,familiaris собака,fastigiata,fastigiata узкопирамидальнои,fejervarya,fejervarya cancrivora,fel,fel fel,fel аллергия,fel высоко,fel прикрепляться,fel способствовать,felis,felis catus,felis domestica,felis domesticus,felis margarita,felis silvestris,felis близкий,felis время,flagship,flagship plus,gato,gato итал,gatto,gatto рут,geely,geely boyue,geely раз,...,эмоциональный привязка,эндрю,эндрю гарфилд,энергия,энергия микроуровень,энергия требоваться,эпизод,эпизод одомашнивание,эпителия,эпителия контакт,эпителия поверхность,эпоха,эпоха автор,эпоха верхний,эра,эра установить,эркслебеном,эркслебеном начало,этаж,этаж демпфер,этаж предполагаться,этаж третий,этажеи,этажеи лифт,эталонныи,эталонныи образец,этои,этои деревня,этои комплектация,этои лягушка,этои порода,этои теория,этолог,этолог конрад,эфиопия,эфиопия время,эфиопия название,эфиопия первый,эффект,эффект небоскрёб,эффективный,эффективный способ,эффектный,эффектный одиночный,юго,юго восточнои,южныи,южныи остров,южный,южный часть,юнг,юнг youngii,юноша,юноша мастер,юный,юный посетителеи,юта,юта тыс,явление,явление половой,являться,являться внутренний,являться диминутивом,являться древнеишим,являться европа,являться единственный,являться заповеднои,являться кошка,являться название,являться общепризнанный,являться одиночный,являться одновременно,являться одомашненным,являться полуодомашненным,являться популярный,являться приёмный,являться пять,являться родственный,являться следствие,являться территория,являться типичный,являться хищник,являться чистый,ягода,ягода гриб,ядернои,ядернои днк,ядовитость,ядовитость мимикрирующих,ядовиты,ядовиты окрас,язык,язык mao,язык вытягиваться,язык европа,язык исследователь,язык обозначать,язык означать,язык похожий,язык слово,язык согнуть,язык устремляться,язык являться,яицевиднои,яицевиднои форма,яндекс,яндекс музыка,яндекс навигатор,янцзы,янцзы юго,яркий,яркий актёрский,яркий симптом,ярко,ярко пурпурными,яростный,яростный адепт,ярчаиших,ярчаиших красный,ячменек,ячменек карасик,ёрник,ёрник территория,կատու,կատու katu
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,1,1,1,1,1,1,1,7,1,1,1,1,1,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,3,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,19,5,1,1,1,9,1,1,0,0,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,11,0,1,0,0,0,0,1,1,0,1,0,1,1,1,0,1,1,1,0,1,0,0,0,0,0,0,0,0,0,0,10,1,1,1,1,1,1,0,2,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1
1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,3,0,0,0,3,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,4,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,2,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,4,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,2,2,0,0,0,0,0,0,0,0,0,0,1,1,0,0,4,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,2,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,3,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,1,1,0,0,0,0,2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0
8,0,0,0,0,0,0,0,0,8,4,1,1,1,1,2,1,1,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,3,1,1,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,2,1,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,1,2,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,1,0,0


In [23]:
vocabulary = sorted(docs_bon_vectorizer.get_feature_names_out())
vocabulary[2507:2518]

['животное',
 'животное активный',
 'животное встречаться',
 'животное выжить',
 'животное говорить',
 'животное индоевропеиских',
 'животное использовать',
 'животное компаньон',
 'животное концентрация',
 'животное кошка',
 'животное любить']

In [24]:
bon_sparse_matrix = sparse.csr_matrix(docs_bon.toarray())
bon_cosine_similarities = cosine_similarity(bon_sparse_matrix)

bon_similarity_df = pd.DataFrame(
    bon_cosine_similarities,
    columns=doc_subject_names,
    index=doc_subject_names,
)

bon_matrix_similarity = bon_similarity_df.where(
    np.tril(np.ones(bon_similarity_df.shape), k=-1).astype(np.bool)
)
bon_styled_similarity = (bon_matrix_similarity
    .style
    .background_gradient(cmap='cividis')
    .highlight_null('white')
    .format("{:.2%}", na_rep="")
)
bon_styled_similarity

Unnamed: 0,1_animals_cat,2_animals_cat,3_animals_cat,4_animals_dog,5_animals_frog,6_movie_thunderbolts*,7_real_estate_skyscraper,8_nature_lake_naroch,9_auto_geely,10_plants_birch
1_animals_cat,,,,,,,,,,
2_animals_cat,49.06%,,,,,,,,,
3_animals_cat,27.79%,26.33%,,,,,,,,
4_animals_dog,18.42%,10.01%,5.95%,,,,,,,
5_animals_frog,8.53%,6.23%,5.67%,4.38%,,,,,,
6_movie_thunderbolts*,6.08%,3.70%,2.20%,7.43%,3.10%,,,,,
7_real_estate_skyscraper,5.79%,5.93%,3.55%,7.43%,3.43%,6.35%,,,,
8_nature_lake_naroch,3.91%,4.13%,2.24%,4.96%,3.60%,3.85%,5.41%,,,
9_auto_geely,5.01%,5.15%,1.86%,8.89%,2.16%,6.84%,6.06%,4.11%,,
10_plants_birch,2.68%,4.23%,2.31%,2.22%,5.84%,1.92%,3.03%,3.33%,0.93%,


<div class="alert alert-danger"> 

### Недостатки:
- При увеличении `n` размерность и, следовательно, разреженность векторов очень быстро растет;
- этот подход по-прежнему не позволяет решить проблему OOV.

<hr>
<br>

## TF-IDF (Term Frequency–Inverse Document Frequency)



Во всех трех рассмотренных выше подходах слова в тексте рассматриваются как одинаково важные — нет понятия, что некоторые слова в документе более
важны, чем другие. _TF-IDF_, или _частота терминов_-_обратная частота документов_, решает эту проблему. Она направлена на количественную оценку важности данного слова относительно других слов в документе и в корпусе.

Принцип, лежащий в основе _TF-IDF_, таков: если слово $w$ многократно встречается в документе $d_i$, но не часто встречается в остальных документах $d_j$, то слово $w$ должно иметь большое значение для документа $d_i$. Важность $w$ должна возрастать пропорционально его частоте в $d_i$, но в то же время его важность должна уменьшаться пропорционально частоте слова в других документах $d_j$. Математически это отражается с помощью двух величин: _TF_ и _IDF_. Затем эти две величины объединяются, чтобы получить показатель **TF-IDF**.

_TF_ (частота терминов) измеряет, как часто термин или слово встречается в данном документе. Поскольку документы могут быть разной длины, термин может встречаться чаще в более длинном документе по сравнению с более коротким. Чтобы нормализировать такие частоты, мы делим количество вхождений на длину документа:<br>

$TF = \frac{\hugeКоличество\hspace{0.1cm}вхождений\hspace{0.1cm}термина\hspace{0.1cm}t\hspace{0.1cm}в\hspace{0.1cm}документе\hspace{0.1cm}d}{\hugeОбщее\hspace{0.1cm}количество\hspace{0.1cm}терминов\hspace{0.1cm}в\hspace{0.1cm}документе\hspace{0.1cm}d}$

_IDF_ (обратная частота документа) измеряет важность термина по отношению ко всем документам. При вычислении _TF_ всем терминам придается одинаковая важность (вес). Однако хорошо известно, что не все слова являются важными, даже если они встречаются часто (например, предлоги, союзы). Чтобы учесть такие случаи, _IDF_ снижает вес терминов, которые очень часто встречаются в документах, и повышает вес редких терминов. _IDF_ термина $t$ рассчитывается следующим образом:<br>

$IDF = log_2 * \left(\frac{\hugeОбщее\hspace{0.1cm}количество\hspace{0.1cm}документов\hspace{0.1cm}}{\hugeКоличество\hspace{0.1cm}документов\hspace{0.1cm}с\hspace{0.1cm}термином\hspace{0.1cm}t}\right)$

Показатель **TF-IDF** рассчитывается как $TF * IDF$.

Рассмотрим следующий пример. У нас есть 4 документа и слово «кошка» встречается в одном документе 3 раза. Тогда для слова «кошка» __TF-IDF__ будет рассчитываться следующим образом:<br>

| Word    | TF score     | TF-IDF score        | TF-IDF score          |
| ------- | ------------ | ------------------- | --------------------- |
| `кошка` | 1 / 3 = 0.33 | $log_2$ (4/3) = 0.4114 | 0.33 × 0.4114 = 0.136 |


In [25]:
tfi_df_vectorizer = TfidfVectorizer(
    input='filename',
    strip_accents='unicode',
    lowercase=True,
    tokenizer=ru_tokenizer,
    max_features=350,
    max_df=0.85,
    min_df=0.3,
)
tf_idf = tfi_df_vectorizer.fit_transform(doc_files)
tf_idf_df = pd.DataFrame(
    data=tf_idf.toarray(),
    columns=tfi_df_vectorizer.get_feature_names_out()
)
tf_idf_df

Unnamed: 0,беларусь,больший,большинство,большои,большой,век,вероятно,взрослои,взрослый,взять,вместе,внимание,вода,водоём,возникнуть,возраст,время,всеи,встречаться,второи,второй,вывести,выполнять,высокий,высоко,высота,высотои,выявить,главный,глаз,говорить,год,группа,давать,дать,два,деиствительно,делать,дерево,диапазон,длинный,дом,домашнеи,домашний,достаточно,дыхательный,единый,животное,жизнь,жить,зависеть,задний,значение,зона,зрение,игра,известный,иметься,использовать,использоваться,исследование,источник,каждый,качество,книга,количество,конец,континент,корень,костеи,кость,которои,которыи,котёнок,кошачий,кошка,краинеи,красный,крупный,лежать,линия,любопытный,людеи,маленький,малый,мелкий,мера,место,минимум,мир,мнение,многих,многочисленный,момент,мощный,название,называть,называться,найти,наличие,научный,начать,начинать,нашеи,небольшой,невозможный,неи,несколько,несмотря,новои,новый,обитание,обитать,обладать,обнаружить,образ,образовать,общий,ограничить,однои,оказаться,окрас,основа,основание,особенность,...,пища,площадь,поведение,поверхность,позволить,позволять,поздний,показать,покров,пол,полностью,полный,половина,получить,популярный,порода,последний,посмотреть,похожий,появиться,правда,правило,практически,предок,предполагать,представитель,представлять,приблизительно,приводить,примерно,принимать,природа,проблема,происходить,происхождение,простой,пространство,протяжение,проходить,пять,работа,работать,равный,радость,раз,развитие,различный,размер,разный,раионе,ранний,распространение,распространить,редкий,результат,речь,решить,род,роль,ряд,самые,самыи,свежий,свободный,своеи,связать,сделать,северный,сезон,семеиства,система,сказать,слово,сложный,служить,случай,собака,собои,современный,согласно,соединить,создать,составлять,состояние,сравнение,среда,старый,стиль,сто,стоить,суметь,существовать,считать,считаться,такои,тело,территория,точка,три,тыс,тысяча,увеличить,удаться,уровень,условие,установить,факт,форма,характер,характерный,хороший,цвет,центральный,частично,часть,человек,челюсть,череп,число,чувствовать,чёрный,шерсть,этои,являться,язык
0,0.0,0.012854,0.015659,0.009806,0.0,0.008718,0.008718,0.009806,0.009806,0.0,0.017436,0.0,0.0,0.009806,0.009806,0.0,0.044989,0.009806,0.023488,0.0,0.009806,0.009806,0.0,0.014156,0.006427,0.0,0.0,0.0,0.0,0.0,0.009806,0.089979,0.054806,0.009806,0.009806,0.028312,0.008718,0.0,0.0,0.009806,0.008718,0.0,0.029418,0.156928,0.008718,0.009806,0.0,0.086124,0.03539,0.007829,0.0,0.009806,0.0,0.009806,0.019612,0.0,0.007078,0.0,0.026155,0.008718,0.019612,0.026155,0.0,0.019612,0.008718,0.0,0.0,0.008718,0.009806,0.019612,0.019612,0.007078,0.0,0.029418,0.009806,0.931566,0.0,0.0,0.023488,0.009806,0.007829,0.0,0.0,0.007829,0.0,0.034873,0.0,0.007078,0.009806,0.026155,0.029418,0.019612,0.009806,0.019612,0.009806,0.076088,0.007078,0.0,0.009806,0.0,0.019612,0.0,0.009806,0.009806,0.007829,0.0,0.0,0.008718,0.015659,0.009806,0.0,0.009806,0.009806,0.0,0.009806,0.046977,0.019612,0.009806,0.0,0.0,0.0,0.0,0.007829,0.008718,0.008718,...,0.019612,0.0,0.04903,0.009806,0.0,0.0,0.009806,0.017436,0.019612,0.009806,0.009806,0.0,0.0,0.008718,0.017436,0.026155,0.0,0.0,0.0,0.009806,0.0,0.008718,0.008718,0.04903,0.019612,0.007829,0.0,0.009806,0.008718,0.017436,0.009806,0.008718,0.0,0.019612,0.008718,0.0,0.0,0.009806,0.0,0.017436,0.0,0.0,0.0,0.0,0.019612,0.017436,0.015659,0.007078,0.015659,0.009806,0.009806,0.0,0.008718,0.009806,0.007829,0.0,0.0,0.009806,0.0,0.009806,0.009806,0.009806,0.0,0.009806,0.0,0.009806,0.0,0.009806,0.0,0.007829,0.019612,0.0,0.068642,0.0,0.0,0.015659,0.04903,0.014156,0.009806,0.019612,0.019612,0.0,0.028312,0.0,0.009806,0.007078,0.0,0.0,0.0,0.0,0.009806,0.0,0.039224,0.008718,0.0,0.03539,0.015659,0.019612,0.019612,0.019612,0.031318,0.0,0.009806,0.0,0.017436,0.029418,0.029418,0.0,0.0,0.008718,0.0,0.0,0.009806,0.0,0.007078,0.102833,0.029418,0.04903,0.017436,0.0,0.0,0.009806,0.0,0.077858,0.09806
1,0.0,0.026944,0.0,0.0,0.041109,0.036549,0.036549,0.041109,0.0,0.041109,0.0,0.036549,0.041109,0.0,0.0,0.036549,0.026944,0.0,0.032823,0.0,0.0,0.0,0.0,0.0,0.026944,0.073098,0.0,0.041109,0.0,0.082218,0.0,0.134719,0.0,0.041109,0.0,0.0,0.0,0.0,0.041109,0.0,0.036549,0.0,0.041109,0.0,0.073098,0.0,0.0,0.22976,0.089018,0.0,0.036549,0.0,0.0,0.0,0.0,0.082218,0.0,0.0,0.0,0.0,0.0,0.0,0.041109,0.0,0.0,0.036549,0.0,0.036549,0.0,0.0,0.0,0.0,0.0,0.082218,0.041109,0.78107,0.0,0.0,0.0,0.0,0.0,0.041109,0.0,0.0,0.0,0.0,0.0,0.029673,0.0,0.0,0.0,0.041109,0.0,0.0,0.0,0.024537,0.059345,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.032823,0.041109,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.032823,0.123327,0.0,0.0,0.036549,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.328939,0.0,0.0,0.036549,0.0,0.0,0.036549,0.0,0.041109,0.0,0.032823,0.0,0.0,0.036549,0.0,0.041109,0.0,0.082218,0.0,0.109646,0.0,0.073098,0.041109,0.041109,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.032823,0.029673,0.065646,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041109,0.0,0.0,0.0,0.0,0.0,0.0,0.082218,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041109,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041109,0.0,0.0,0.036549,0.0,0.0,0.0,0.036549,0.0,0.032823,0.0,0.036549,0.0,0.0,0.036549,0.029673,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.082218,0.073098,0.036549,0.0,0.041109,0.0,0.0,0.0,0.053888,0.0,0.0,0.0,0.041109,0.0,0.123327,0.098469,0.0,0.0
2,0.0,0.0,0.041595,0.0,0.0,0.0,0.0,0.0,0.052096,0.0,0.092634,0.0,0.0,0.0,0.0,0.092634,0.0,0.0,0.041595,0.0,0.0,0.0,0.0,0.037603,0.034145,0.0,0.0,0.0,0.046317,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.046317,0.0,0.0,0.0,0.0,0.156288,0.0,0.185269,0.0,0.052096,0.0,0.249573,0.075207,0.0,0.046317,0.0,0.0,0.0,0.0,0.0,0.037603,0.0,0.046317,0.0,0.0,0.046317,0.0,0.0,0.0,0.092634,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.104192,0.104192,0.625153,0.0,0.0,0.0,0.0,0.0,0.0,0.052096,0.0,0.104192,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037603,0.0,0.0,0.052096,0.0,0.0,0.0,0.052096,0.0,0.0,0.046317,0.0,0.041595,0.0,0.0,0.0,0.0,0.046317,0.0,0.0,0.0,0.0,0.052096,0.0,0.041595,0.104192,0.0,0.0,0.046317,...,0.0,0.0,0.0,0.0,0.104192,0.052096,0.0,0.0,0.052096,0.052096,0.156288,0.0,0.0,0.0,0.0,0.231586,0.0,0.0,0.0,0.0,0.046317,0.046317,0.0,0.0,0.0,0.041595,0.0,0.0,0.0,0.0,0.104192,0.046317,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.052096,0.052096,0.0,0.138952,0.0,0.0,0.0,0.0,0.0,0.052096,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.052096,0.0,0.0,0.0,0.138952,0.052096,0.0,0.0,0.0,0.041595,0.052096,0.0,0.0,0.052096,0.0,0.0,0.052096,0.0,0.0,0.052096,0.0,0.0,0.0,0.092634,0.0,0.037603,0.0,0.0,0.0,0.083191,0.052096,0.0,0.0,0.0,0.046317,0.037603,0.041595,0.0,0.0,0.0,0.0,0.046317,0.0,0.0,0.092634,0.0,0.0,0.0,0.0,0.046317,0.0,0.0,0.0,0.0,0.0,0.102435,0.0,0.0,0.0,0.0,0.0,0.364673,0.0,0.0,0.0
3,0.0,0.0,0.0,0.014229,0.014229,0.0,0.012651,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014229,0.063253,0.04663,0.0,0.022722,0.014229,0.0,0.014229,0.0,0.0,0.009326,0.0,0.014229,0.028458,0.0,0.0,0.0,0.214497,0.045443,0.0,0.014229,0.020541,0.0,0.012651,0.0,0.0,0.0,0.0,0.042687,0.101204,0.0,0.0,0.014229,0.068165,0.01027,0.011361,0.0,0.0,0.014229,0.0,0.042687,0.014229,0.01027,0.014229,0.0,0.025301,0.042687,0.037952,0.0,0.0,0.0,0.0,0.0,0.012651,0.0,0.028458,0.028458,0.0,0.0,0.0,0.0,0.0,0.014229,0.0,0.022722,0.0,0.034083,0.0,0.014229,0.011361,0.0,0.0,0.014229,0.041082,0.028458,0.050602,0.042687,0.0,0.0,0.0,0.014229,0.008493,0.0,0.028458,0.014229,0.0,0.028458,0.012651,0.014229,0.0,0.011361,0.0,0.0,0.012651,0.011361,0.0,0.022722,0.0,0.0,0.0,0.028458,0.011361,0.0,0.014229,0.014229,0.0,0.0,0.0,0.011361,0.037952,0.0,...,0.014229,0.0,0.042687,0.0,0.0,0.014229,0.0,0.063253,0.0,0.028458,0.0,0.014229,0.0,0.0,0.012651,0.063253,0.012651,0.0,0.012651,0.0,0.0,0.0,0.0,0.12806,0.056916,0.022722,0.012651,0.014229,0.0,0.0,0.0,0.0,0.0,0.056916,0.050602,0.0,0.0,0.014229,0.0,0.012651,0.014229,0.0,0.0,0.0,0.0,0.0,0.011361,0.020541,0.068165,0.0,0.028458,0.0,0.012651,0.014229,0.045443,0.025301,0.0,0.0,0.014229,0.014229,0.0,0.0,0.0,0.0,0.0,0.014229,0.025301,0.0,0.0,0.011361,0.0,0.014229,0.071144,0.0,0.014229,0.0,0.882191,0.01027,0.056916,0.014229,0.0,0.0,0.0,0.0,0.0,0.01027,0.028458,0.0,0.0,0.0,0.0,0.037952,0.042687,0.012651,0.0,0.0,0.011361,0.042687,0.0,0.199205,0.034083,0.0,0.0,0.014229,0.0,0.014229,0.014229,0.0,0.012651,0.0,0.0,0.0,0.014229,0.0,0.0,0.09326,0.014229,0.014229,0.037952,0.0,0.0,0.0,0.011361,0.041082,0.028458
4,0.0,0.107284,0.09802,0.040921,0.0,0.0,0.0,0.040921,0.245529,0.0,0.036382,0.0,0.204607,0.081843,0.0,0.0,0.134105,0.040921,0.0,0.0,0.0,0.0,0.0,0.059075,0.026821,0.036382,0.0,0.040921,0.0,0.040921,0.0,0.0,0.032673,0.0,0.0,0.059075,0.0,0.0,0.081843,0.040921,0.072764,0.0,0.0,0.036382,0.0,0.040921,0.040921,0.196039,0.147687,0.09802,0.036382,0.28645,0.040921,0.0,0.0,0.0,0.0,0.040921,0.036382,0.0,0.040921,0.036382,0.0,0.040921,0.0,0.0,0.040921,0.036382,0.0,0.040921,0.28645,0.029537,0.0,0.0,0.0,0.0,0.040921,0.040921,0.0,0.040921,0.0,0.0,0.0,0.0,0.0,0.036382,0.040921,0.0,0.0,0.0,0.0,0.040921,0.040921,0.0,0.040921,0.04885,0.029537,0.040921,0.0,0.040921,0.0,0.0,0.0,0.0,0.032673,0.0,0.0,0.072764,0.032673,0.040921,0.0,0.081843,0.163686,0.109146,0.081843,0.09802,0.040921,0.0,0.040921,0.040921,0.032673,0.040921,0.0,0.0,0.072764,...,0.040921,0.081843,0.040921,0.122764,0.0,0.28645,0.0,0.0,0.040921,0.0,0.0,0.0,0.040921,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.181911,0.072764,0.0,0.0,0.09802,0.072764,0.0,0.036382,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.072764,0.0,0.081843,0.0,0.0,0.0,0.0,0.0,0.0,0.109146,0.09802,0.029537,0.032673,0.040921,0.0,0.040921,0.072764,0.040921,0.0,0.036382,0.0,0.122764,0.0,0.0,0.0,0.0,0.0,0.040921,0.0,0.0,0.0,0.040921,0.040921,0.09802,0.0,0.0,0.0,0.040921,0.040921,0.032673,0.0,0.059075,0.0,0.0,0.040921,0.040921,0.029537,0.036382,0.081843,0.059075,0.0,0.0,0.0,0.0,0.0,0.036382,0.0,0.0,0.0,0.059075,0.0,0.0,0.040921,0.0,0.0,0.036382,0.0,0.0,0.036382,0.0,0.0,0.0,0.0,0.036382,0.032673,0.040921,0.040921,0.040921,0.088612,0.053642,0.040921,0.163686,0.0,0.0,0.036382,0.0,0.032673,0.029537,0.040921
5,0.0,0.042301,0.051531,0.0,0.0,0.0,0.057381,0.0,0.0,0.0,0.057381,0.114762,0.0,0.0,0.0,0.0,0.169205,0.0,0.0,0.06454,0.06454,0.0,0.06454,0.0,0.0,0.0,0.0,0.0,0.057381,0.06454,0.3227,0.253807,0.103062,0.0,0.0,0.046585,0.057381,0.057381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.12908,0.0,0.12908,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06454,0.0,0.0,0.0,0.19362,0.0,0.0,0.0,0.0,0.046585,0.057381,0.0,0.0,0.0,0.12908,0.06454,0.051531,0.0,0.051531,0.0,0.0,0.0,0.06454,0.0,0.12908,0.0,0.0,0.057381,0.0,0.0,0.0,0.19362,0.0,0.077045,0.0,0.0,0.0,0.0,0.06454,0.057381,0.0,0.0,0.0,0.0,0.114762,0.057381,0.051531,0.06454,0.154594,0.0,0.0,0.0,0.0,0.103062,0.0,0.19362,0.0,0.25816,0.051531,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.06454,0.0,0.12908,0.057381,0.0,0.0,0.0,0.0,0.06454,0.057381,0.0,0.0,0.057381,0.06454,0.057381,0.06454,0.114762,0.0,0.057381,0.0,0.0,0.0,0.0,0.0,0.0,0.114762,0.0,0.0,0.12908,0.06454,0.0,0.06454,0.0,0.0,0.0,0.0,0.06454,0.057381,0.0,0.0,0.19362,0.057381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.057381,0.0,0.0,0.19362,0.06454,0.0,0.06454,0.06454,0.0,0.0,0.0,0.114762,0.0,0.06454,0.0,0.0,0.06454,0.0,0.0,0.0,0.051531,0.0,0.0,0.0,0.0,0.0,0.0,0.046585,0.0,0.06454,0.0,0.0,0.057381,0.0,0.103062,0.06454,0.0,0.0,0.0,0.0,0.046585,0.0,0.12908,0.0,0.0,0.0,0.057381,0.06454,0.06454,0.0,0.0,0.06454,0.0,0.0,0.0,0.154594,0.0,0.0,0.12908,0.046585,0.169205,0.0,0.0,0.057381,0.06454,0.057381,0.0,0.0,0.046585,0.0
6,0.0,0.046491,0.056635,0.0,0.0,0.126127,0.0,0.0,0.0,0.070932,0.0,0.0,0.0,0.0,0.070932,0.0,0.0,0.141864,0.0,0.0,0.070932,0.0,0.070932,0.051199,0.0,0.189191,0.070932,0.0,0.063064,0.0,0.0,0.278944,0.0,0.0,0.0,0.0,0.063064,0.063064,0.0,0.0,0.0,0.212796,0.0,0.0,0.063064,0.0,0.070932,0.0,0.102398,0.056635,0.0,0.0,0.0,0.0,0.0,0.0,0.051199,0.0,0.0,0.063064,0.0,0.0,0.070932,0.070932,0.063064,0.126127,0.070932,0.0,0.212796,0.0,0.0,0.051199,0.126127,0.0,0.0,0.0,0.0,0.0,0.056635,0.0,0.056635,0.070932,0.0,0.056635,0.0,0.0,0.0,0.051199,0.070932,0.063064,0.212796,0.0,0.0,0.0,0.0,0.042337,0.051199,0.070932,0.0,0.070932,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.056635,0.0,0.0,0.063064,0.0,0.113269,0.0,0.0,0.0,0.070932,0.0,0.0,0.056635,0.063064,0.0,...,0.0,0.212796,0.0,0.070932,0.141864,0.0,0.0,0.063064,0.0,0.0,0.0,0.0,0.0,0.126127,0.0,0.0,0.0,0.0,0.126127,0.0,0.063064,0.0,0.0,0.0,0.141864,0.0,0.126127,0.0,0.0,0.0,0.0,0.063064,0.141864,0.0,0.0,0.0,0.063064,0.0,0.0,0.126127,0.0,0.126127,0.070932,0.0,0.0,0.0,0.0,0.051199,0.0,0.0,0.0,0.070932,0.0,0.0,0.056635,0.0,0.070932,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.189191,0.0,0.189191,0.0,0.0,0.0,0.070932,0.0,0.0,0.0,0.0,0.056635,0.0,0.102398,0.070932,0.0,0.0,0.0,0.051199,0.0,0.0,0.051199,0.0,0.126127,0.070932,0.0,0.0,0.126127,0.070932,0.0,0.189191,0.0,0.0,0.0,0.070932,0.0,0.113269,0.0,0.0,0.0,0.0,0.0,0.0,0.141864,0.063064,0.0,0.0,0.0,0.0,0.0,0.102398,0.139472,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.153597,0.0
7,0.194074,0.127201,0.0,0.0,0.194074,0.172546,0.0,0.0,0.0,0.064691,0.0,0.115031,0.194074,0.194074,0.0,0.0,0.084801,0.0,0.0,0.064691,0.0,0.0,0.0,0.093389,0.0,0.0,0.0,0.0,0.057515,0.0,0.0,0.212002,0.154956,0.0,0.0,0.046695,0.0,0.057515,0.0,0.0,0.0,0.129383,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.103304,0.0,0.0,0.0,0.129383,0.0,0.064691,0.046695,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.057515,0.0,0.0,0.064691,0.0,0.0,0.046695,0.057515,0.0,0.0,0.0,0.0,0.0,0.051652,0.064691,0.051652,0.064691,0.129383,0.051652,0.064691,0.057515,0.0,0.280168,0.0,0.0,0.0,0.0,0.064691,0.064691,0.0,0.077225,0.0,0.0,0.0,0.0,0.0,0.057515,0.0,0.0,0.051652,0.064691,0.057515,0.0,0.0,0.0,0.0,0.064691,0.129383,0.057515,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.051652,0.0,0.0,...,0.0,0.064691,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.064691,0.064691,0.0,0.115031,0.0,0.057515,0.194074,0.0,0.0,0.0,0.0,0.057515,0.0,0.0,0.0,0.0,0.064691,0.057515,0.057515,0.0,0.115031,0.0,0.0,0.115031,0.064691,0.057515,0.0,0.129383,0.057515,0.0,0.057515,0.064691,0.064691,0.0,0.0,0.051652,0.0,0.0,0.064691,0.0,0.0,0.0,0.0,0.051652,0.0,0.064691,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.115031,0.0,0.0,0.0,0.064691,0.0,0.0,0.0,0.064691,0.0,0.0,0.0,0.0,0.046695,0.0,0.0,0.0,0.064691,0.046695,0.0,0.0,0.046695,0.129383,0.057515,0.064691,0.25826,0.0,0.0,0.0,0.115031,0.0,0.046695,0.309912,0.0,0.0,0.0,0.051652,0.0,0.0,0.064691,0.115031,0.0,0.0,0.0,0.057515,0.0,0.051652,0.0,0.0,0.0,0.093389,0.0,0.0,0.0,0.0,0.0,0.057515,0.0,0.051652,0.093389,0.0
8,0.12776,0.041868,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.113588,0.0,0.0,0.0,0.0,0.0,0.06388,0.06388,0.0,0.041868,0.0,0.0,0.0,0.0,0.0,0.0,0.586159,0.0,0.0,0.0,0.046109,0.0,0.0,0.0,0.12776,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.056794,0.06388,0.0,0.0,0.0,0.0,0.0,0.12776,0.0,0.0,0.0,0.0,0.0,0.0,0.113588,0.0,0.0,0.0,0.0,0.0,0.0,0.092218,0.056794,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06388,0.0,0.0,0.056794,0.06388,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.102008,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.051004,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.06388,0.0,0.0,0.0,0.06388,0.06388,0.0,0.113588,0.056794,0.0,0.056794,0.06388,0.0,0.06388,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.056794,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.12776,0.056794,0.0,0.0,0.12776,0.0,0.0,0.0,0.0,0.0,0.06388,0.0,0.0,0.0,0.0,0.056794,0.0,0.0,0.06388,0.0,0.0,0.0,0.06388,0.0,0.0,0.0,0.170381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.102008,0.0,0.0,0.0,0.0,0.0,0.0,0.046109,0.397557,0.0,0.0,0.0,0.0,0.06388,0.102008,0.0,0.0,0.0,0.0,0.056794,0.0,0.0,0.0,0.0,0.06388,0.153012,0.056794,0.06388,0.0,0.0,0.12776,0.0,0.0,0.0,0.0,0.051004,0.0,0.0,0.06388,0.0,0.0,0.0,0.0,0.056794,0.0,0.0,0.0,0.051004,0.0,0.0
9,0.143748,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.063901,0.0,0.0,0.0,0.0,0.047108,0.0,0.114774,0.0,0.0,0.0,0.0,0.103758,0.047108,0.063901,0.143748,0.0,0.0,0.0,0.071874,0.0,0.0,0.143748,0.071874,0.0,0.0,0.0,0.503118,0.0,0.063901,0.0,0.0,0.0,0.063901,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071874,0.0,0.0,0.103758,0.0,0.127802,0.255604,0.0,0.0,0.0,0.0,0.063901,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071874,0.0,0.0,0.0,0.0,0.0,0.057387,0.0,0.063901,0.0,0.051879,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.128699,0.103758,0.0,0.0,0.0,0.0,0.0,0.0,0.071874,0.0,0.071874,0.063901,0.0,0.0,0.0,0.057387,0.0,0.0,0.0,0.0,0.0,0.143748,0.0,0.0,0.0,0.057387,0.0,0.0,0.063901,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.063901,0.0,0.0,0.0,0.0,0.0,0.063901,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071874,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071874,0.0,0.0,0.0,0.051879,0.057387,0.0,0.0,0.0,0.127802,0.0,0.114774,0.0,0.0,0.071874,0.0,0.0,0.071874,0.071874,0.071874,0.0,0.127802,0.0,0.0,0.071874,0.0,0.057387,0.0,0.0,0.0,0.071874,0.071874,0.0,0.0,0.051879,0.0,0.0,0.0,0.071874,0.0,0.0,0.0,0.0,0.143748,0.0,0.0,0.0,0.0,0.0,0.0,0.063901,0.0,0.0,0.057387,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.503118,0.0,0.0,0.114774,0.071874,0.0,0.0,0.051879,0.0,0.0,0.0,0.0,0.071874,0.063901,0.0,0.0,0.0,0.0


In [26]:
tf_idf_sparse_matrix = sparse.csr_matrix(tf_idf.toarray())
tf_idf_cosine_similarities = cosine_similarity(tf_idf_sparse_matrix)

tf_idf_similarity_df = pd.DataFrame(
    tf_idf_cosine_similarities,
    columns=doc_subject_names,
    index=doc_subject_names,
)

tf_idf_matrix_similarity = tf_idf_similarity_df.where(
    np.tril(np.ones(tf_idf_similarity_df.shape), k=-1).astype(np.bool)
)
tf_idf_styled_similarity = (tf_idf_matrix_similarity
    .style
    .background_gradient(cmap='cividis')
    .highlight_null('white')
    .format("{:.2%}", na_rep="")
)
tf_idf_styled_similarity

Unnamed: 0,1_animals_cat,2_animals_cat,3_animals_cat,4_animals_dog,5_animals_frog,6_movie_thunderbolts*,7_real_estate_skyscraper,8_nature_lake_naroch,9_auto_geely,10_plants_birch
1_animals_cat,,,,,,,,,,
2_animals_cat,80.22%,,,,,,,,,
3_animals_cat,68.92%,73.87%,,,,,,,,
4_animals_dog,15.55%,10.89%,12.39%,,,,,,,
5_animals_frog,13.51%,15.44%,19.76%,9.33%,,,,,,
6_movie_thunderbolts*,11.78%,9.45%,8.28%,14.78%,16.62%,,,,,
7_real_estate_skyscraper,11.56%,15.89%,15.62%,16.22%,17.27%,30.49%,,,,
8_nature_lake_naroch,9.62%,14.07%,13.23%,12.04%,22.34%,26.12%,33.52%,,,
9_auto_geely,9.16%,11.58%,7.51%,18.37%,12.34%,36.12%,30.37%,24.87%,,
10_plants_birch,4.63%,12.13%,6.98%,4.60%,14.91%,12.68%,20.85%,20.25%,4.39%,


<hr>
<br>

## Latent Semantic Analysis (LSA)


Латентно-семантический анализ — это метод поиска информации, используемый в обработке естественного языка для выявления латентных (скрытых) связей между словами и понятиями в тексте.

Наиболее распространенный вариант ЛСА основан на использовании разложения матрицы по сингулярным значениям или SVD-разложение (SVD – Singular Value Decomposition). Это помогает выявить закономерности во взаимосвязях между словами и документами.

Разложение матрицы по сингулярным значениям — математическая методика, которая вычисляет разложение исходной матрицы 𝐗 на три матрицы: 𝐗 = 𝐔𝚺𝐕T

Из полученных матриц, матрицы 𝐔 и 𝐕 являются ортогональными матрицами размерностей 𝑚 × 𝑟 и 𝑛 × 𝑟 соответственно, а матрица 𝚺 – это 𝑟 × 𝑟 диагональная матрица, содержащая собственные значения. Обоснование заключается в том, что каждое из 𝑟 собственных значений соответствует одному из вышеупомянутых высокоуровневых компонентов, отслеживаемых в коллекции документов, и обозначает, насколько этот компонент актуален во всей коллекции.

<img src='../images/svd.jpg' width="428" height="428" >

Реалзиация `SVD` c помощью [**TruncatedSVD**](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) из библиотеки `Scikit-learn`

Например, родственные термины «врач» и «медик» будут располагаться ближе друг к другу в высокоразмерном графе LSA, отражая одну и ту же концепцию.

Латентно-семантическое индексирование (Latent Semantic Indexing, LSI) является частным случаем латентно-семантического анализа. LSI — одна из основополагающих техник для понимания и поиска документов. Благодаря своей простоте и низкой стоимости вычислений этот подход до сих пор используется.

In [27]:
# Проинициализируем векторизатор 
doc_vectorizer = CountVectorizer(
    input='content',
    lowercase=True,
    tokenizer=ru_tokenizer,
    analyzer='word',
    binary=False,
)

# Текст для анализа
docs = [
    'Кошки и собаки - прекрасные домашние животные.',
    'Собаки - преданные питомцы.',
    'Домашние животные приносят радость и счастье.',
    'Счастье и радость наполняют жизнь смыслом.',
]

# Трансформируем текст в вектора
docs_vectors = doc_vectorizer.fit_transform(docs)
docs_df = pd.DataFrame(
    data=docs_vectors.toarray(),
    columns=doc_vectorizer.get_feature_names_out()
)
docs_df

Unnamed: 0,домашний,животное,жизнь,кошка,наполнять,питомец,преданный,прекрасный,приносить,радость,смысл,собака,счастие
0,1,1,0,1,0,0,0,1,0,0,0,1,0
1,0,0,0,0,0,1,1,0,0,0,0,1,0
2,1,1,0,0,0,0,0,0,1,1,0,0,1
3,0,0,1,0,1,0,0,0,0,1,1,0,1


In [28]:
# Формируем LSA матрицу, n_components - количество топиков (тем)
svd = TruncatedSVD(n_components=2, random_state=42)
lsa_matrix = svd.fit_transform(docs_vectors)

print(f'Singular Values (Concept Strength):\n{svd.singular_values_}\n{"-"*30}')
print(f'Document-Concept Similarity Matrix:\n{lsa_matrix}\n{"-"*30}')
print(f'Term-Concept Similarity Matrix:\n{svd.components_.T}')

Singular Values (Concept Strength):
[2.80749364 2.28548946]
------------------------------
Document-Concept Similarity Matrix:
[[ 1.46019866  1.52688008]
 [ 0.2990972   0.6867129 ]
 [ 1.95461268 -0.17275656]
 [ 1.35641828 -1.54618235]]
------------------------------
Term-Concept Similarity Matrix:
[[ 0.4332406   0.2592387 ]
 [ 0.4332406   0.2592387 ]
 [ 0.17209017 -0.29600719]
 [ 0.1852569   0.29231189]
 [ 0.17209017 -0.29600719]
 [ 0.03794677  0.131467  ]
 [ 0.03794677  0.131467  ]
 [ 0.1852569   0.29231189]
 [ 0.24798371 -0.03307319]
 [ 0.42007388 -0.32908038]
 [ 0.17209017 -0.29600719]
 [ 0.22320366  0.42377889]
 [ 0.42007388 -0.32908038]]


Применяя LSA можно получить и проранжировать документы на основе их релевантности некоторому запросу.

In [29]:
# Запрос
query = 'Здоровые питомцы приносят счастье.'

# Преобразуем запрос в вектор
query_vector = doc_vectorizer.transform([query])

# Формируем LSA матрицу
query_lsa = svd.transform(query_vector)

# Сравниваем запрос с документами по косинусному расстоянию
similarities = cosine_similarity(query_lsa, lsa_matrix)

# Отсортируем в порядке убывания по значимости
doc_indices = np.argsort(similarities[0])[::-1]

for idx in doc_indices:
 print(f'Doc {idx + 1}: similarity — {similarities[0][idx]:.2%} | {docs[idx]}')
 

Doc 3: similarity — 97.42% | Домашние животные приносят радость и счастье.
Doc 4: similarity — 86.03% | Счастье и радость наполняют жизнь смыслом.
Doc 1: similarity — 43.25% | Кошки и собаки - прекрасные домашние животные.
Doc 2: similarity — 9.48% | Собаки - преданные питомцы.


<br>

**Применим LSA для небольшого набора документов по тематикам.**

In [30]:
doc_files_vectorizer = CountVectorizer(
    input='filename',
    strip_accents='unicode',
    lowercase=True,
    tokenizer=ru_tokenizer,
    analyzer='word',
    max_features=1000,
    binary=False,
)
doc_files_vectors = doc_files_vectorizer.fit_transform(doc_files)

lsa_model = TruncatedSVD(n_components=7, random_state=42)
lsa_docs_matrix = lsa_model.fit_transform(doc_files_vectors)

In [31]:
docs_similarities = cosine_similarity(lsa_docs_matrix)

lsa_similarity_df = pd.DataFrame(
    docs_similarities,
    columns=doc_subject_names,
    index=doc_subject_names,
)

lsa_matrix_similarity = lsa_similarity_df.where(
    np.tril(np.ones(tf_idf_similarity_df.shape), k=-1).astype(np.bool)
)
lsa_styled_similarity = (lsa_matrix_similarity
    .style
    .background_gradient(cmap='cividis')
    .highlight_null('white')
    .format("{:.2%}", na_rep="")
)
lsa_styled_similarity

Unnamed: 0,1_animals_cat,2_animals_cat,3_animals_cat,4_animals_dog,5_animals_frog,6_movie_thunderbolts*,7_real_estate_skyscraper,8_nature_lake_naroch,9_auto_geely,10_plants_birch
1_animals_cat,,,,,,,,,,
2_animals_cat,99.11%,,,,,,,,,
3_animals_cat,97.88%,99.20%,,,,,,,,
4_animals_dog,21.14%,19.51%,18.48%,,,,,,,
5_animals_frog,9.98%,13.04%,20.99%,5.19%,,,,,,
6_movie_thunderbolts*,10.26%,14.86%,20.24%,12.90%,5.51%,,,,,
7_real_estate_skyscraper,14.14%,20.57%,25.84%,18.74%,8.56%,98.14%,,,,
8_nature_lake_naroch,5.18%,10.38%,10.03%,6.86%,4.97%,6.28%,18.66%,,,
9_auto_geely,6.81%,13.52%,7.44%,12.46%,3.06%,12.87%,20.31%,6.47%,,
10_plants_birch,3.26%,12.51%,17.04%,2.86%,7.87%,1.21%,13.52%,4.79%,1.29%,


<div class="alert alert-warning">

Из полученной матрицы видно, что документы с тематикой про кошек очень сильно похожи между собой, образовался кластер документов по похожести.