# Проект для «Викишоп» c BERT


Интернет магазин разрабатывает инструмент по поиску токсичных комментариев, которые будут отправляться на модерацию. \
В нашем распоряжениинабор данных с разметкой о токсичности правок. \
Необходимо создать модель которая будет отделять токсичные комментарии. Желаемое качество на метрике *F1* не меньше 0.75.

 Столбец *text* содержит текст комментария, а *toxic* — целевой признак.

 В ходе проекта будут рассмотрены варианты обработки комментариев с помощью технологии Bert

## Предподготовка

### Установки

In [2]:
!pip install --user --upgrade catboost
!pip install --user --upgrade ipywidgets
!jupyter nbextension enable --py widgetsnbextension
!pip install lightgbm
!pip
!pip install torch
!pip install transformers



Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: ok



Usage:   
  pip <command> [options]

Commands:
  install                     Install packages.
  download                    Download packages.
  uninstall                   Uninstall packages.
  freeze                      Output installed packages in requirements format.
  list                        List installed packages.
  show                        Show information about installed packages.
  check                       Verify installed packages have compatible dependencies.
  config                      Manage local and global configuration.
  search                      Search PyPI for packages.
  cache                       Inspect and manage pip's wheel cache.
  index                       Inspect information available from package indexes.
  wheel                       Build wheels from your requirements.
  hash                        Compute hashes of package archives.
  completion                  A helper command used for command completion.
  debug                    

### Загрузка библиотек

In [3]:
import numpy as np
import pandas as pd

In [4]:
import catboost
import lightgbm as lgb
import torch
import transformers as ppb
from tqdm import notebook
import nltk

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import  RandomForestClassifier
from sklearn.utils import shuffle
from sklearn.model_selection import RandomizedSearchCV
from catboost import CatBoostClassifier, Pool, metrics, cv
from lightgbm import LGBMClassifier

from sklearn.model_selection import train_test_split

In [5]:
from sklearn.metrics import f1_score

In [6]:
import matplotlib.pyplot as plt

In [7]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer 

In [8]:
print('Cat Boost version', catboost.__version__)
print('LightGBM version', lgb.__version__) #выводим номер версии

Cat Boost version 1.1
LightGBM version 3.3.2


In [9]:
from pymystem3 import Mystem
m = Mystem()

In [10]:
import re

In [11]:
import nltk 
from nltk.stem import WordNetLemmatizer 

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt') 

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ivano\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ivano\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ivano\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### DataSets

Загружаем датасет.

In [12]:
df_tweets = pd.read_csv(r'C:\Users\ivano\Downloads\toxic_comments.csv')

In [13]:
df_tweets

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0
...,...,...,...
159287,159446,""":::::And for the second time of asking, when ...",0
159288,159447,You should be ashamed of yourself \n\nThat is ...,0
159289,159448,"Spitzer \n\nUmm, theres no actual article for ...",0
159290,159449,And it looks like it was actually you who put ...,0


## Подготовка

В этом проекте была реализована попытка обучения с помощью модели Bert. 

Но так же было интересно подготовить датасет с помощью TF-IDF. 

Поэтому я использовала оба метода и сравнила результаты на модели логистической регрессии, чтобы потом лучше подготовленные признаки использовать для обучения моделей.

Проверяем на баланс классов

In [14]:
df_tweets['toxic'].value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

Классы очень не сбалансированы, что плохо сказывается на обучении модели. Используем метод sampling.

Создаем обучающую и тестовую выборку

In [15]:
train, test = train_test_split(df_tweets, test_size =0.2)

Собранные для обучения твиты на латинице. Необходимо использовать `stopwords` : `english` и передать параметру токенизации `language = "english"`

Для лиматизации используем `nltk`

In [16]:
nltk.download('wordnet') 
lemmatize = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ivano\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [78]:
stopwords = set(nltk_stopwords.words('english'))
count_tf_idf = TfidfVectorizer(min_df = 0.0001, stop_words='english')

Напишем функцию для процесса предобработки

In [18]:
train_clear = []
for value in train['text']:
    text = re.sub("[^a-zA-Z]"," ",value) #удаляем неалфавитные символы
    text = nltk.word_tokenize(text,language = "english") # токенизируем слова
    text = [lemmatize.lemmatize(word) for word in value] # лемматирзируем слова
    value = "".join(text) # соединяем слова


In [19]:
train.shape

(127433, 3)

In [20]:
def downsample(features, target, fraction, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)] + [features_ones]*repeat)
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)] + [target_ones]*repeat)
    
    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345)
    
    return features_downsampled, target_downsampled

In [21]:
features_train, target_train = downsample(train['text'], train['toxic'], 0.12, 1)

In [22]:
features_train = pd.DataFrame(data=features_train)

In [23]:
target_train.value_counts()

0    13737
1    12960
Name: toxic, dtype: int64

In [24]:
features_test = test['text']
target_test = test['toxic']

### TF-IDF

Векторизируем обработанные и очищенные признаки

In [79]:
tf_idf_train = count_tf_idf.fit_transform(features_train['text']).toarray()

In [80]:
tf_idf_test = count_tf_idf.transform(features_test).toarray()

Переводим полученные выборки в датафрейм, чтобы передать модели обучения

In [81]:
train_tf_idf = pd.DataFrame(data=tf_idf_train)
tf_idf_test = pd.DataFrame(data=tf_idf_test)

#### Проверка метрики F1 на элементарной модели без настроек

In [82]:
model = LogisticRegression(random_state=12345)
model.fit(train_tf_idf,target_train)

In [83]:
prediction = model.predict(tf_idf_test)

In [84]:
f1_score(target_test , prediction)

0.6902146779830255

### Bert

Обучаем модель Bert (DistilBert - облегченную модель). Для этой модели была взята только маленькая часть датасета, тк даже DistilBert обучается долго.

In [44]:
train_bert, test_bert = train_test_split(df_tweets, test_size =0.2)

In [45]:
train_bert['toxic'].value_counts()

0    114446
1     12987
Name: toxic, dtype: int64

In [46]:
print('Размер тестовой выборки', test_bert.shape)
print('Баланс классов', "\n", test_bert['toxic'].value_counts())

Размер тестовой выборки (31859, 3)
Баланс классов 
 0    28660
1     3199
Name: toxic, dtype: int64


In [47]:
features_train_bert, target_train_bert = downsample(train_bert['text'], train_bert['toxic'], 0.12, 1)

In [48]:
target_train_bert.value_counts()

0    13734
1    12987
Name: toxic, dtype: int64

Баланс классов в тестовой выборке восстановлен.

In [49]:
features_train_bert = pd.DataFrame(data=features_train_bert)

После нескольких попыток обучения Bert остановилась на размере выборки в 2000 объектов, чтобы обработка занимала оптимальное время.

In [50]:
features_train_bert = features_train_bert[:3500]
target_train_bert=  target_train_bert[:3500]

In [51]:
features_train_bert

Unnamed: 0,text
84482,"Hey asshole, she's 18 years old. The briefing ..."
119639,Dear Wiki Editors. I still have another coupl...
92489,What did i vandalise?\n- william@iamfake.com
108811,Ты нацистская сволочь и арабская подстилка. Ид...
55498,"""\n\n A barnstar for you! \n\n The Real Life ..."
...,...
85368,fucking bitch suck my Vick prick cracker
31206,List of assholes \n\n]
108727,Grant\nHi Berean Hunter. More opinions are nee...
140754,"""\n\n Please do not vandalize pages, as you di..."


Передаем классы переменным

In [52]:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')


In [53]:
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [54]:
tokenized = features_train_bert['text'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True, truncation=True)))

In [55]:
tokenized

84482     [101, 4931, 22052, 1010, 2016, 1005, 1055, 232...
119639    [101, 6203, 15536, 3211, 10195, 1012, 1045, 21...
92489     [101, 2054, 2106, 1045, 3158, 9305, 5562, 1029...
108811    [101, 1197, 29113, 1192, 10260, 29751, 10325, ...
55498     [101, 1000, 1037, 25684, 7559, 2005, 2017, 999...
                                ...                        
85368     [101, 8239, 7743, 11891, 2026, 10967, 2243, 24...
31206             [101, 2862, 1997, 22052, 2015, 1033, 102]
108727    [101, 3946, 7632, 2022, 16416, 2078, 4477, 101...
140754    [101, 1000, 3531, 2079, 2025, 3158, 9305, 4697...
86429     [101, 1000, 3308, 2592, 1999, 1996, 3720, 2009...
Name: text, Length: 3500, dtype: object

In [56]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

In [57]:
np.array(padded).shape

(3500, 512)

In [58]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(3500, 512)

In [59]:
batch_size = 100
embeddings = []
for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
        batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)]) 
        attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)])
        
        with torch.no_grad():
            batch_embeddings = model(batch, attention_mask=attention_mask_batch)
        
        embeddings.append(batch_embeddings[0][:,0,:].numpy())

  0%|          | 0/35 [00:00<?, ?it/s]

In [60]:
features_emb_bert = np.concatenate(embeddings)

In [61]:
df_features_emb_bert = pd.DataFrame(data = features_emb_bert)

In [62]:
target_train_bert = pd.DataFrame(data = target_train_bert)

#### Подготовка тестовой выборки

In [63]:
features_test_bert = test_bert['text']
target_test_bert= test_bert['toxic']

In [64]:
tokenized_test = features_test_bert.apply((lambda x: tokenizer.encode(x, add_special_tokens=True, truncation=True)))

In [65]:
tokenized_test = tokenized_test[:2000]
target_test_bert = target_test_bert[:2000]

In [66]:
max_len = 0
for i in tokenized_test.values:
    if len(i) > max_len:
        max_len = len(i)

padded_test = np.array([i + [0]*(max_len-len(i)) for i in tokenized_test.values])

In [67]:
np.array(padded_test).shape

(2000, 512)

In [68]:
attention_mask_test = np.where(padded_test != 0, 1, 0)
attention_mask_test.shape

(2000, 512)

In [69]:
batch_size = 100
embeddings_test = []
for i in notebook.tqdm(range(padded_test.shape[0] // batch_size)):
        batch_test = torch.LongTensor(padded_test[batch_size*i:batch_size*(i+1)]) 
        attention_mask_batch_test = torch.LongTensor(attention_mask_test[batch_size*i:batch_size*(i+1)])
        
        with torch.no_grad():
            batch_embeddings_test = model(batch_test, attention_mask=attention_mask_batch_test)
        
        embeddings_test.append(batch_embeddings_test[0][:,0,:].numpy())

  0%|          | 0/20 [00:00<?, ?it/s]

In [70]:
features_emb_test = np.concatenate(embeddings_test)

In [71]:
features_emb_test = pd.DataFrame(data = features_emb_test)

In [72]:
target_test_bert = pd.DataFrame(data = target_test_bert)

In [73]:
tokenized_test = pd.DataFrame( data=tokenized_test)

In [74]:
tokenized_test

Unnamed: 0,text
123130,"[101, 2204, 2005, 2017, 1012, 25591, 12845, 10..."
106713,"[101, 1045, 1005, 2310, 2196, 2657, 1997, 1996..."
142414,"[101, 1011, 2339, 1029, 2339, 2079, 2017, 2031..."
12565,"[101, 1006, 1037, 2978, 2125, 8476, 1007, 1045..."
121402,"[101, 3287, 24918, 1528, 2831, 3931, 1528, 585..."
...,...
73254,"[101, 3504, 2066, 1996, 2060, 2518, 2017, 2123..."
157888,"[101, 2417, 7442, 6593, 2831, 1024, 2198, 2798..."
144281,"[101, 2178, 3437, 2000, 14163, 23102, 1045, 20..."
9508,"[101, 1000, 1045, 1005, 1049, 2145, 2025, 2469..."


#### Проверка метрики F1 на элементарной модели без настроек

Даже несмотря на то, что выборка значительно меньше, модель дала лучшие результаты.

In [75]:
model_lr = LogisticRegression(random_state=12345)
model_lr.fit(features_emb_bert,target_train_bert)

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [76]:
prediction = model_lr.predict(features_emb_test)

In [77]:
f1_score(target_test_bert, prediction)

0.6240601503759399

### Вывод

Для обучения модели используем признаки обработанные методом TF-IDF. На данной выборке и при учете мощностей компьютера этот метод дале лучше результаты.

## Обучение

### CatBoost

Создаем объект Pool

In [92]:
train_pool = Pool(
    data=train_tf_idf,
    label=target_train
)

test_pool = Pool(
    data=tf_idf_test,
    label=target_test
)

In [93]:
model = CatBoostClassifier(
    task_type='CPU',
    iterations=5000,
    eval_metric='F1',
    od_type='Iter',
    learning_rate = 0.1,
    early_stopping_rounds =100
)

In [94]:
model.fit(
    train_pool,
    eval_set=test_pool,
    verbose=50,
    plot=True,
    use_best_model=True
)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

0:	learn: 0.5003742	test: 0.4785860	best: 0.4785860 (0)	total: 744ms	remaining: 1h 1m 57s
50:	learn: 0.7941882	test: 0.7020210	best: 0.7031654 (49)	total: 18.8s	remaining: 30m 27s
100:	learn: 0.8339278	test: 0.7160345	best: 0.7160345 (100)	total: 36.4s	remaining: 29m 25s
150:	learn: 0.8583848	test: 0.7254844	best: 0.7282939 (119)	total: 53.9s	remaining: 28m 50s
200:	learn: 0.8773696	test: 0.7339291	best: 0.7354988 (190)	total: 1m 11s	remaining: 28m 36s
250:	learn: 0.8872964	test: 0.7411211	best: 0.7411849 (249)	total: 1m 29s	remaining: 28m 12s
300:	learn: 0.8981378	test: 0.7410651	best: 0.7449126 (253)	total: 1m 47s	remaining: 27m 52s
350:	learn: 0.9049558	test: 0.7419899	best: 0.7449126 (253)	total: 2m 5s	remaining: 27m 37s
Stopped by overfitting detector  (100 iterations wait)

bestTest = 0.7449125824
bestIteration = 253

Shrink model to first 254 iterations.


<catboost.core.CatBoostClassifier at 0x1d2904e7e50>

###  LightGBM

In [95]:
params = {"max_depth": 50,
          "learning_rate" : 0.1,
          "num_leaves": 500,
          "n_estimators": 1000,
          'objective': 'binary',
          "metric" : 'F1'
         }

lgb_train = lgb.Dataset(train_tf_idf, target_train)
lgb_test = lgb.Dataset(tf_idf_test, target_test)

In [96]:
lgbm_model_1 = lgb.LGBMClassifier()

In [97]:
def lgb_f1_score(predict_y, df):
    true_y = df.get_label()
    predict_y = np.round(predict_y)
    return 'f1', f1_score(target_test, predict_y), True

In [98]:
%%time
lgbm_model_1 = lgb.train(params,
                         train_set=lgb_train,
                         valid_sets=lgb_test,
                         num_boost_round=500,
                         feval=lgb_f1_score
)



[LightGBM] [Info] Number of positive: 12960, number of negative: 13737
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 142767
[LightGBM] [Info] Number of data points in the train set: 26697, number of used features: 4204
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.485448 -> initscore=-0.058225
[LightGBM] [Info] Start training from score -0.058225
[1]	valid_0's f1: 0.64584
[2]	valid_0's f1: 0.654856
[3]	valid_0's f1: 0.655085
[4]	valid_0's f1: 0.639764
[5]	valid_0's f1: 0.657092
[6]	valid_0's f1: 0.666171
[7]	valid_0's f1: 0.667847
[8]	valid_0's f1: 0.669725
[9]	valid_0's f1: 0.662111
[10]	valid_0's f1: 0.663554
[11]	valid_0's f1: 0.668705
[12]	valid_0's f1: 0.670142
[13]	valid_0's f1: 0.672867
[14]	valid_0's f1: 0.674011
[15]	valid_0's f1: 0.675185
[16]	valid_0's f1: 0.677373
[17]	valid_0's f1: 0.681798
[18]	valid_0's f1: 0.684955
[19]	valid_0's f1: 0.687699
[20]	valid_0's f1: 0

0.71

In [99]:
lgbm_predict = lgbm_model_1.predict(tf_idf_test)

In [100]:
tf_idf_test

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,17168,17169,17170,17171,17172,17173,17174,17175,17176,17177
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31854,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
31855,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
31856,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
31857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [101]:
lgbm_predict_1 = np.where(lgbm_predict > 0.5, 1, 0)

In [102]:
f1_score(target_test, lgbm_predict_1)

0.6575441123514584

In [109]:
lgbm_predict_2 = np.where(lgbm_predict > 0.97, 1, 0)

In [110]:
f1_score(target_test, lgbm_predict_2)

0.7504898758981058

#### Обучение

### RandomForest

#### Выбор модели

In [31]:
def F1 (target, predict):
    F1 =f1_score(target, predict)
    return F1

In [32]:
from sklearn import model_selection

In [33]:
parameter_space = {'bootstrap': [True, False],
'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [1, 2, 4],
'min_samples_split': [2, 5, 10],
'n_estimators': [200, 400, 600, 800, 1000, 2000]
}

In [34]:
ensemble = RandomForestClassifier(random_state=12345)

In [35]:
model_rf = model_selection.RandomizedSearchCV(
    estimator =ensemble,
    param_distributions= parameter_space,
    n_iter=10,
    scoring = 'f1',
    verbose =10,
    n_jobs = -1,
    cv = 4
)

In [None]:
%%time
model_rf.fit(train_tf_idf,target_train)

In [None]:
print (model_rf.best_score_)
print(model_rf.best_estimator_.get_params())

#### Обучение

In [40]:
ensemble_1 = RandomForestClassifier(bootstrap= False, max_depth = 40, criterion= 'gini', max_features='sqrt',
                                    min_samples_leaf=2, min_samples_split = 5, n_estimators = 200, random_state=12345)

In [41]:
%%time
ensemble_1.fit(train_tf_idf,target_train)

Wall time: 44.1 s


In [42]:
predict_ens1 = ensemble_1.predict(tf_idf_test)

In [43]:
f1_score(target_test, predict_ens1)

0.617176997759522

## Выводы

Тексты прошли обработку через токенизацию, лемматизацию и очистку от лишних символов. Модель DistilBert дает хорошие результаты, но мощность личного компьютера не позволяет использовать этот алгоритм в полном объеме. Среди моделей наилучший результат показал LightGBM. 

Для оценки качества использовалась метрика F1. На простой не модифицированной линейной регрессии Bert дает качество 0.62, вместо 0.69 на TF-IDF векторизации.

**Модели после обучения дают результат метрики 0.61-0.75 на тестовой выборке. Лучший результат дала модель LightGBM.**