### Задание
реализуйте задачу классификации на основе BERT-like модели и KNN на данных Russian Intents Dataset с Kaggle.

### Решение

#### Подключаем библиотеки

In [2]:
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import torch
import os
from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModel, AutoModelForMaskedLM

  from .autonotebook import tqdm as notebook_tqdm


#### Загрузим данные

In [3]:
base_path = './dz__13'

df_train = pd.read_csv(os.path.join(base_path, "dataset_train.tsv"), sep='\t', header=None)
df_test = pd.read_csv(os.path.join(base_path, "dataset_test.tsv"), sep='\t', header=None)

In [24]:
df_test.head()

Unnamed: 0,0,1
0,как получить справку,statement_general
1,мне нужна справка,statement_general
2,справка студента эф петь,conform
3,справка студента фф оформлять,conform
4,как мне заказать справка об обучении,conform


In [25]:
df_train.head()

Unnamed: 0,0,1
0,мне нужна справка,statement_general
1,оформить справку,statement_general
2,взять справку,statement_general
3,справку как получить,statement_general
4,справку ммф где получаться,statement_general


In [12]:
Train_data = torch.utils.data.DataLoader(df_train.to_records(index=False).tolist(), batch_size=10)
Test_data = torch.utils.data.DataLoader(df_test.to_records(index=False).tolist(), batch_size=10)

#### Определим модели

Стандартный BERT

In [13]:
tokenizer_1 = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
model_1 = BertModel.from_pretrained("bert-base-multilingual-cased")

tokenizer_config.json: 100%|██████████| 29.0/29.0 [00:00<00:00, 23.1kB/s]
vocab.txt: 100%|██████████| 996k/996k [00:00<00:00, 1.93MB/s]
tokenizer.json: 100%|██████████| 1.96M/1.96M [00:00<00:00, 2.68MB/s]
config.json: 100%|██████████| 625/625 [00:00<00:00, 1.20MB/s]
model.safetensors: 100%|██████████| 714M/714M [01:37<00:00, 7.33MB/s] 


TwHIN-BERTt

In [14]:
tokenizer_2 = AutoTokenizer.from_pretrained('Twitter/twhin-bert-base')
model_2 = AutoModel.from_pretrained('Twitter/twhin-bert-base')

tokenizer_config.json: 100%|██████████| 372/372 [00:00<00:00, 995kB/s]
tokenizer.json: 100%|██████████| 17.1M/17.1M [00:02<00:00, 7.10MB/s]
special_tokens_map.json: 100%|██████████| 239/239 [00:00<00:00, 1.09MB/s]
config.json: 100%|██████████| 632/632 [00:00<00:00, 1.18MB/s]
model.safetensors: 100%|██████████| 1.12G/1.12G [02:08<00:00, 8.72MB/s]
Some weights of BertModel were not initialized from the model checkpoint at Twitter/twhin-bert-base and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Sber BERT large model multitask (cased) for Sentence Embeddings in Russian language.

In [15]:
tokenizer_3 = AutoTokenizer.from_pretrained("ai-forever/sbert_large_mt_nlu_ru")
model_3 = AutoModel.from_pretrained("ai-forever/sbert_large_mt_nlu_ru")

tokenizer_config.json: 100%|██████████| 331/331 [00:00<00:00, 507kB/s]
config.json: 100%|██████████| 752/752 [00:00<00:00, 1.17MB/s]
vocab.txt: 100%|██████████| 1.78M/1.78M [00:00<00:00, 4.91MB/s]
special_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 170kB/s]
pytorch_model.bin: 100%|██████████| 1.71G/1.71G [03:37<00:00, 7.87MB/s]


#### Получим признаки на основе Train и Tets датасетов

In [16]:
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return sum_embeddings / sum_mask


def get_embeddings(Data, tokenizer, model):
    target_array = []
    num = 1
    for X, Y in Data:
        encoded_input = tokenizer(X, return_tensors='pt', padding=True, truncation=True, max_length=30)
        with torch.no_grad():
            output = model(**encoded_input)

        sentence_embeddings = mean_pooling(output, encoded_input['attention_mask'])

        for i, sentence in enumerate(sentence_embeddings):
            target_array.append((sentence, Y[i]))
        print(f"Done {num} out of {len(Data)}", end='\r')
        num += 1
    print('\n')
    
    return target_array

In [17]:
Bert_base_train = get_embeddings(Train_data, tokenizer_1, model_1)
Bert_base_test = get_embeddings(Test_data, tokenizer_1, model_1)

Done 1323 out of 1323

Done 89 out of 89



In [18]:
Bert_twit_train = get_embeddings(Train_data, tokenizer_2, model_2)
Bert_twit_test = get_embeddings(Test_data, tokenizer_2, model_2)

Done 1323 out of 1323

Done 89 out of 89



In [19]:
Sber_train = get_embeddings(Train_data, tokenizer_3, model_3)
Sber_test = get_embeddings(Test_data, tokenizer_3, model_3)

Done 1323 out of 1323

Done 89 out of 89



#### Сделаеми классфикацию KNN Test датасета, найдём точность

In [20]:

def test_accuracy(name, Data_train, Data_test, neigh):
    X_train = [row[0].detach().numpy() for row in Data_train]
    Y_train = [row[1] for row in Data_train]
    neigh.fit(X_train, Y_train)
    X_test = [row[0].detach().numpy() for row in Data_test]
    Y_test = [row[1] for row in Data_test]
    result = neigh.predict(X_test)
    accuracy = result == Y_test
    print(f"Для модели {name} получена точность {sum(accuracy)/len(accuracy)*100} %")
    return accuracy

Стандартный BERT

In [21]:
neigh_1 = KNeighborsClassifier(n_neighbors=10, weights='distance', n_jobs=-1)
acc_1 = test_accuracy('Base BERT', Bert_base_train, Bert_base_test, neigh_1)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md



Для модели Base BERT получена точность 88.56172140430351 %


TwHIN-BERT

In [22]:
neigh_2 = KNeighborsClassifier(n_neighbors=10, weights='distance', n_jobs=-1)
acc_2 = test_accuracy('Twitter BERT', Bert_twit_train, Bert_twit_test, neigh_2)

Для модели Twitter BERT получена точность 86.8629671574179 %


Sber BERT large model multitask (cased) for Sentence Embeddings in Russian language.

In [23]:
neigh_3 = KNeighborsClassifier(n_neighbors=10, weights='distance', n_jobs=-1)
acc_3 = test_accuracy('Sber BERT', Sber_train, Sber_test, neigh_3)

Для модели Sber BERT получена точность 87.76896942242357 %
