# Проект для «Викишоп» с BERT

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию.

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75.

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели.
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

In [1]:
!pip install datasets -q

In [2]:
!pip install imblearn -q

In [3]:
!pip install huggingface_hub==0.21.2 -q

In [4]:
import os
import re

import numpy as np
import pandas as pd
import torch

from transformers import (AutoModelForSequenceClassification, AutoTokenizer, BertModel, BertTokenizer, 
                          DataCollatorWithPadding, Trainer, TrainingArguments)

from datasets import Dataset

from imblearn.under_sampling import RandomUnderSampler

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords as nltk_stopwords
from nltk.corpus import wordnet

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, f1_score
from sklearn.model_selection import train_test_split

import spacy

from tqdm import notebook, tqdm

import warnings
warnings.filterwarnings('ignore')

2024-05-02 21:36:47.227093: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered

2024-05-02 21:36:47.227231: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered

2024-05-02 21:36:47.341266: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [None]:
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

In [5]:
os.environ['WANDB_DISABLED'] = 'true'

In [6]:
tqdm.pandas()

In [7]:
RANDOM_STATE=42

In [8]:
if torch.cuda.is_available():
    device = torch.device('cuda:0')
else:
    device = torch.device('cpu')

In [9]:
def clear_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z0-9]', ' ', text)
    text = ' '.join(text.split())
    return text

In [10]:
def lemmatize(text):
    return ' '.join([token.lemma_ for token in nlp(text)])

In [11]:
def preprocess_tokenizer(df):
    return tokenizer(df['text'], truncation=True)

## Подготовка

### Сохраняем в датафрейм имеющийся csv-файл.

In [12]:
try:
    df_source = pd.read_csv('/datasets/toxic_comments.csv')
except:
    df_source = pd.read_csv('https://code.s3.yandex.net/datasets/toxic_comments.csv')

### Изучим общую информацию о данных.

In [13]:
df = df_source.copy()
df

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0
...,...,...,...
159287,159446,""":::::And for the second time of asking, when ...",0
159288,159447,You should be ashamed of yourself \n\nThat is ...,0
159289,159448,"Spitzer \n\nUmm, theres no actual article for ...",0
159290,159449,And it looks like it was actually you who put ...,0


In [14]:
# Значения столбца 'Unnamed: 0' не совпадают с индексами. Удалим этот столбец за ненужностью.

df = df.drop(['Unnamed: 0'], axis=1)

In [15]:
# Проверим на дубликаты

print(df[df.duplicated()].shape[0])
df = df.drop_duplicates()

0


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 159292 entries, 0 to 159291

Data columns (total 2 columns):

 #   Column  Non-Null Count   Dtype 

---  ------  --------------   ----- 

 0   text    159292 non-null  object

 1   toxic   159292 non-null  int64 

dtypes: int64(1), object(1)

memory usage: 2.4+ MB


In [17]:
round(df['toxic'].value_counts() / df.shape[0], 3)

toxic
0    0.898
1    0.102
Name: count, dtype: float64

**Промежуточные ыводы:**
- В выборке 159 тыс. комментариев.
- Пропуски, дубликаты отсутствуют.
- Наблюдается явный дисбаланс классов: 10% негативных комметариев против 90% позитивных.


Для решения нашей задачи далее пойдем двумя путями:

1. Проведем UnderSampling датасета, очистку, лемматизацию, удаление стоп-слов и т.д. и применим модель BERT
2. Используем уменьшенную версию BERT - DistilBERT, модель AutoModelForSequenceClassification, без предобработки.

In [18]:
df_bert1 = df.copy()
df_bert2 = df.copy()

### Подготовка датасета для toxic-bert

In [19]:
df_bert1['text'] = df_bert1['text'].str.lower()

In [20]:
df_sample = df_bert1.sample(int(df_bert1.shape[0]*0.1))

In [21]:
df_sample['toxic'].value_counts()

toxic
0    14354
1     1575
Name: count, dtype: int64

In [22]:
df_sample.sample(5)

Unnamed: 0,text,toxic
84416,freedom that must be held dear in this country,0
87103,"""\n\nfurthermore i can find where it says this...",0
88589,the attempt to delete this article was created...,0
43069,edwin van der sar/andres palop \n\nre: your ed...,0
153080,semi-protection requests\nif you want to reque...,0


In [24]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner']) 

In [25]:
for w in nltk_stopwords.words('english'):
    nlp.vocab[w].is_stop = True

In [26]:
%%time

# Добавляем новые столбцы

df_sample['clear_text'] = df_sample['text'].progress_apply(clear_text)
df_sample['lemma_text'] = df_sample['clear_text'].progress_apply(lemmatize)
df_sample['lemma_text_no_sw'] = df_sample['lemma_text'].progress_apply(lambda 
                                                              text: ' '.join(token.lemma_ for token in nlp(text) 
                                                                             if not token.is_stop))

100%|██████████| 15929/15929 [00:00<00:00, 19224.09it/s]

100%|██████████| 15929/15929 [02:24<00:00, 110.38it/s]

100%|██████████| 15929/15929 [02:14<00:00, 118.86it/s]

CPU times: user 4min 38s, sys: 1.68 s, total: 4min 40s

Wall time: 4min 39s





In [27]:
df_sample

Unnamed: 0,text,toxic,clear_text,lemma_text,lemma_text_no_sw
25595,"picture. \n\ni uploaded that picture, however ...",0,picture i uploaded that picture however it doe...,picture I upload that picture however it do no...,picture upload picture copywright know stuff f...
153110,your bondpedia has the potential to cover ever...,0,your bondpedia has the potential to cover ever...,your bondpedia have the potential to cover eve...,bondpedia potential cover tiny bit bond like c...
61765,"""\n\n""""yworo"""" keeps incorrectly referring to ...",0,yworo keeps incorrectly referring to expresso ...,yworo keep incorrectly refer to expresso as a ...,yworo incorrectly refer expresso misspelling f...
126760,stop to be a crybaby. you messed the article w...,1,stop to be a crybaby you messed the article wi...,stop to be a crybaby you mess the article with...,stop crybaby mess article wrong information fi...
54178,"""\n\njawohl, mein führer. yeah like your an ex...",0,jawohl mein f hrer yeah like your an expert yo...,jawohl mein f hrer yeah like your an expert yo...,jawohl mein f hrer yeah like expert opinionate...
...,...,...,...,...,...
158699,blocked. \n blocked. \n\ni was told by the con...,0,blocked blocked i was told by the contact wiki...,block block I be tell by the contact wikipedia...,block block tell contact wikipedia page e mail...
58430,"""\n\nwarangal fort\n\ni deleted warangal fort ...",0,warangal fort i deleted warangal fort as it co...,warangal fort I delete warangal fort as it con...,warangal fort delete warangal fort contain pic...
100505,lack of variety in news sources \n\nalmost eve...,0,lack of variety in news sources almost every r...,lack of variety in news source almost every re...,lack variety news source reference reuters yne...
38187,i think this should be merged with the article...,0,i think this should be merged with the article...,I think this should be merge with the article ...,think merge article port salut cheese describe...


In [28]:
# Проверим дубликаты по итоговому столбцу 'lemma_text_no_sw' и удалим при наличии

print(df_sample[df_sample['lemma_text_no_sw'].duplicated()].shape)
df_sample = df_sample.drop_duplicates(subset='lemma_text_no_sw')

(55, 5)


### Подготовка датасета для DistilBERT

In [29]:
df_bert2.rename(columns = {'toxic': 'label'}, inplace = True)

## Обучение

### Обучение с toxic-bert

In [30]:
# Загрузка предобученного токенизатора 

tokenizer = BertTokenizer.from_pretrained('unitary/toxic-bert', do_lower_case=True)

tokenizer_config.json:   0%|          | 0.00/174 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/811 [00:00<?, ?B/s]

In [31]:
# Загрузка предобученной модели

model = BertModel.from_pretrained('unitary/toxic-bert')
model = model.to(device)

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [32]:
%%time

tokenized = df_sample['lemma_text_no_sw'].progress_apply(
    lambda x: tokenizer.encode(x, add_special_tokens=True, max_length=512, truncation=True))

max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

100%|██████████| 15874/15874 [00:27<00:00, 582.74it/s]

CPU times: user 27.2 s, sys: 109 ms, total: 27.3 s

Wall time: 27.2 s





In [33]:
padded = np.array([i + [0] * (max_len - len(i)) for i in tokenized.values])

In [34]:
attention_mask = np.where(padded != 0, 1, 0)

In [35]:
batch_size = 100

In [36]:
%%time

embeddings = []

for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
    batch = torch.LongTensor(padded[batch_size * i:batch_size * (i+1)])
    attention_mask_batch = torch.LongTensor(attention_mask[batch_size * i:batch_size * (i+1)])
    with torch.no_grad():
        batch_embeddings = model(batch.to(device), attention_mask=attention_mask_batch.to(device))
    embeddings.append(batch_embeddings[0][:,0,:].cpu().numpy())

  0%|          | 0/158 [00:00<?, ?it/s]

CPU times: user 4min 36s, sys: 228 ms, total: 4min 36s

Wall time: 4min 36s


Цикл по батчам проходит за ~9~ 4 минуты

In [37]:
target = df_sample['toxic'][: padded.shape[0] // batch_size * batch_size]

In [38]:
features = np.concatenate(embeddings)

X_train, X_test, y_train, y_test = train_test_split(features, target, train_size=0.8, random_state=42)

log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)

predictions = log_reg.predict(X_test)

In [39]:
f1_score = f1_score(predictions, y_test)
f1_score

0.8311258278145697

Метрика F1 = 0.83 неплохой результат по BERT на ~андерсэмнлированной~ 10% случайной выборке

### Обучение с DistilBERT

In [40]:
train, test = train_test_split(df_bert2, train_size=0.7)

In [41]:
dataset_train = Dataset.from_pandas(train)
dataset_test = Dataset.from_pandas(test)

In [42]:
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [43]:
train_tokenized = dataset_train.map(preprocess_tokenizer, batched=True)
test_tokenized = dataset_test.map(preprocess_tokenizer, batched=True)

Map:   0%|          | 0/111504 [00:00<?, ? examples/s]

Map:   0%|          | 0/47788 [00:00<?, ? examples/s]

In [44]:
# Преобразуем данные в вектора и упакуе в матрицу

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [45]:
# Для задачи бинарной классификации num_labels = 2

model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels = 2)
model = model.to(device)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']

You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [46]:
training_args = TrainingArguments(
    output_dir = './result',
    per_device_train_batch_size = 16,
    learning_rate = 2e-5,
    num_train_epochs = 1,
    weight_decay = 0.01)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [47]:
trainer = Trainer(
    model = model,
    args = training_args,
    data_collator = data_collator,
    train_dataset = train_tokenized,
    tokenizer = tokenizer)

In [48]:
trainer.train()

Step,Training Loss
500,0.1482
1000,0.1223
1500,0.1148
2000,0.1072
2500,0.106
3000,0.0993
3500,0.112
4000,0.0958
4500,0.1004
5000,0.0928


TrainOutput(global_step=6969, training_loss=0.10316115319258445, metrics={'train_runtime': 2089.086, 'train_samples_per_second': 53.375, 'train_steps_per_second': 3.336, 'total_flos': 1.0104617309085888e+16, 'train_loss': 0.10316115319258445, 'epoch': 1.0})

На GPU P100 обучение длится ~35 минут.

In [49]:
predictions = trainer.predict(test_tokenized)

In [50]:
print(classification_report(test_tokenized['label'], predictions.label_ids))

              precision    recall  f1-score   support



           0       1.00      1.00      1.00     42967

           1       1.00      1.00      1.00      4821



    accuracy                           1.00     47788

   macro avg       1.00      1.00      1.00     47788

weighted avg       1.00      1.00      1.00     47788




## Выводы

- В выборке 159 тыс. комментариев.
- Пропуски, дубликаты отсутствуют.
- Наблюдается явный дисбаланс классов: 10% негативных комметариев против 90% позитивных.
- Лучший результат показала модель DistilBERT, правильно определяя все комментарии. Метрика F1 = 1.