# Проект для «Викишоп» BERT

В данном проекте необходимо классифицировать комментарии на токсичные и нет. В распоряжении размеченный датасет, содержащий комментарий (необработанный текст) и таргет, говорящий о его токсичности.

## Импорт и установка библиотек

In [1]:
!pip install transformers

[0m

In [2]:
!pip install -q pandarallel catboost nltk spacy

[0m

In [3]:
import re
from tqdm import notebook

import numpy as np
import pandas as pd
from pandarallel import pandarallel

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import torch
import transformers
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from catboost import CatBoostClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

In [4]:
nltk.download('stopwords')
pandarallel.initialize(progress_bar=True)
device = torch.device("cuda")
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...


INFO: Pandarallel will run on 1 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


True

## Предобработка данных

In [5]:
df = pd.read_csv('https://code.s3.yandex.net/datasets/toxic_comments.csv')
df = df[:(len(df) // 100) * 100]

In [6]:
df['toxic'].value_counts()

0    143020
1     16180
Name: toxic, dtype: int64

Чтобы все батчи эмбеддингов были целыми и не было лишних таргетов, сразу обрежу данные

In [7]:
a = re.compile(r'[\w]+')
df['preprocessed_text'] = df['text'].parallel_apply(lambda x: re.findall(a, x.lower()))
df['text'] = df['text'].parallel_apply(lambda x: ' '.join(re.findall(a, x))) # для bert
lemmatizer = WordNetLemmatizer()
df['preprocessed_text'] = df['preprocessed_text'].parallel_apply\
(lambda x: ' '.join([lemmatizer.lemmatize(s) for s in x]))

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=159200), Label(value='0 / 159200')…

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=159200), Label(value='0 / 159200')…

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=159200), Label(value='0 / 159200')…

In [8]:
df['len'] = df['preprocessed_text'].parallel_apply(lambda x: len(x))
df['len'].describe()

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=159200), Label(value='0 / 159200')…

count    159200.000000
mean        371.725515
std         558.875260
min           0.000000
25%          89.000000
50%         193.000000
75%         412.000000
max        5000.000000
Name: len, dtype: float64

In [9]:
df

Unnamed: 0.1,Unnamed: 0,text,toxic,preprocessed_text,len
0,0,Explanation Why the edits made under my userna...,0,explanation why the edits made under my userna...,259
1,1,D aww He matches this background colour I m se...,0,d aww he match this background colour i m seem...,100
2,2,Hey man I m really not trying to edit war It s...,0,hey man i m really not trying to edit war it s...,229
3,3,More I can t make any real suggestions on impr...,0,more i can t make any real suggestion on impro...,591
4,4,You sir are my hero Any chance you remember wh...,0,you sir are my hero any chance you remember wh...,63
...,...,...,...,...,...
159195,159354,LaserActive Hi SchuminWeb I am confused by you...,0,laseractive hi schuminweb i am confused by you...,210
159196,159355,Lists of Islamic Jihads,0,list of islamic jihad,21
159197,159356,URL Update update to my previous and only post...,0,url update update to my previous and only post...,197
159198,159357,A person living in Canada has just as much rig...,0,a person living in canada ha just a much right...,437


## BERT и эмбеддинги

In [10]:
tokenizer = transformers.BertTokenizer(
    vocab_file='../input/bertmodels/bert-base-uncased-vocab.txt')

tokenized = df['preprocessed_text'].parallel_apply(
    lambda x: tokenizer.encode(x, add_special_tokens=True, max_length=512, truncation=True))

max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len - len(i)) for i in tokenized.values])

attention_mask = np.where(padded != 0, 1, 0)

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=159200), Label(value='0 / 159200')…

In [11]:
config = transformers.BertConfig.from_json_file(
    '../input/bertmodels/bert-base-uncased-config.json')
model = transformers.BertModel.from_pretrained(
    'unitary/toxic-bert', config=config, ignore_mismatched_sizes=True).to("cuda")

Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

Some weights of the model checkpoint at unitary/toxic-bert were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [12]:
batch_size = 100
embeddings = []
for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
        batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)]).to(device)
        attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)]).to(device)
        
        with torch.no_grad():
            batch_embeddings = model(batch, attention_mask=attention_mask_batch)
        
        embeddings.append(batch_embeddings[0][:,0,:].cpu().numpy())

  0%|          | 0/1592 [00:00<?, ?it/s]

In [13]:
features_train, features_valid, y_train, y_valid = train_test_split(np.concatenate(embeddings), \
                                                                    df['toxic'][:len(np.concatenate(embeddings))], \
                                                                    test_size=0.4)
features_valid, features_test, y_valid, y_test = train_test_split(features_valid, y_valid, test_size=0.5)

In [14]:
%%time
log_reg = LogisticRegression(random_state=42, max_iter=1000, class_weight='balanced',n_jobs=-1)
log_reg.fit(features_train, y_train)

CPU times: user 833 ms, sys: 1.09 s, total: 1.93 s
Wall time: 24min 48s


LogisticRegression(class_weight='balanced', max_iter=1000, n_jobs=-1,
                   random_state=42)

In [15]:
f1_score(y_valid, log_reg.predict(features_valid))

0.8789685290447803

In [16]:
%%time
cat = CatBoostClassifier(verbose=False, iterations=100)
cat.fit(features_train, y_train)

CPU times: user 1min 41s, sys: 796 ms, total: 1min 42s
Wall time: 55.5 s


<catboost.core.CatBoostClassifier at 0x7f0a6eb43950>

In [17]:
f1_score(y_valid, cat.predict(features_valid))

0.9047324466439839

## TF-IDF

In [18]:
data_train, data_valid, target_train, target_valid = train_test_split(df['preprocessed_text'], \
                                                                      df['toxic'],  \
                                                                      test_size=0.2, stratify= df['toxic'], \
                                                                     random_state=42)
data_valid, data_test, target_valid, target_test = train_test_split(data_valid, target_valid, \
                                                                    test_size=0.5, \
                                                                   random_state=42)

Сначала разобьем данные, а потом уже векторизуем с помощью TfidfVectorizer, обученного на тренировочных данных

In [19]:
%%time
stop_words = stopwords.words('english')
vectorizer = TfidfVectorizer(stop_words=stop_words)
data_train = vectorizer.fit_transform(data_train)
data_valid = vectorizer.transform(data_valid)
data_test = vectorizer.transform(data_test)

CPU times: user 7.34 s, sys: 60 ms, total: 7.4 s
Wall time: 7.43 s


Чтобы модель не засорять популярными и бессмысленными словами, передадим стоп-слова из корпуса nltk

In [20]:
%%time
tf_idf_log_reg = LogisticRegression(random_state=42, max_iter=1000, \
                                    class_weight = 'balanced',n_jobs=-1)
tf_idf_log_reg.fit(data_train, target_train)

CPU times: user 37.5 ms, sys: 101 ms, total: 138 ms
Wall time: 8.3 s


LogisticRegression(class_weight='balanced', max_iter=1000, n_jobs=-1,
                   random_state=42)

In [21]:
f1_score(target_valid, tf_idf_log_reg.predict(data_valid))

0.7565824825362707

In [22]:
%%time
cat = CatBoostClassifier(verbose=False, iterations=20)
cat.fit(data_train, target_train)

CPU times: user 2min 35s, sys: 1.91 s, total: 2min 37s
Wall time: 1min 26s


<catboost.core.CatBoostClassifier at 0x7f0a618563d0>

In [23]:
f1_score(target_valid, cat.predict(data_valid))

0.6742683390345876

## Тестирование

Catboost с признаками, созданными моделью BERT получил наибольшую метрику F1 на валидационной выборке, поэтому оставлю его для предсказания на тестовой выборке

In [25]:
%%time
cat = CatBoostClassifier(verbose=False, iterations=100)
cat.fit(features_train, y_train)

CPU times: user 1min 42s, sys: 521 ms, total: 1min 42s
Wall time: 56.8 s


<catboost.core.CatBoostClassifier at 0x7f0a9a384590>

In [26]:
f1_score(y_test, cat.predict(features_test))

0.897133220910624

## Вывод

В данной раюоте были использованы TF-IDF и BERT, предобученный на классификации токсичности текста, с градиентым бустингом и логистической регрессией.
Наилучший результат дала модель, использующая признаки от BERT и catboost. Значение метрики F1 на тестовой выборке 0.897.