# Проект для «Викишоп» c BERT 

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

В проекте обучается модель классифицикации комментариев интернет-магазина «Викишоп»  на позитивные и негативные. В распоряжении набор данных с разметкой о токсичности правок.

Целевое значением метрики качества *F1* - не меньше 0.75. 

В проекта используются три модели 
- частотный анализ TF-IDF + logistic regression
- частотный анализ TF-IDF + CatBoost
- *BERT* + logistic regression 


**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.m

## Подготовка

In [35]:
# !pip install transformers
# !pip install catboost

In [36]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
#from fast_ml.model_development import train_valid_test_split

In [37]:
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

import torch
import transformers

from tqdm import notebook
from tqdm.notebook import tqdm

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score


from catboost import CatBoostClassifier
from catboost import cv, Pool


import re
import spacy
import time

In [38]:
try:
    data = pd.read_csv('toxic_comments.csv',index_col=[0])
except:
    data=pd.read_csv('https://code.s3.yandex.net/datasets/toxic_comments.csv',index_col=[0])

In [39]:
full_data_length=data.shape[0]
full_data_length

159292

### Сleaning

In [40]:
def clear_text(text):
    # only letters
    t = re.sub(r"[^a-zA-Z ]", ' ', text)
    # no one letter words
    t = re.sub(r'\s+[a-zA-Z]\s+', ' ', t)
    t = re.sub(r'\^[a-zA-Z]\s+', ' ', t)
    t = t.lower()
    t =' '.join(t.split())
    return t

In [41]:
tqdm.pandas()

data['clean_text'] = data['text'].progress_apply(lambda x: clear_text(x))

  0%|          | 0/159292 [00:00<?, ?it/s]

### Lemmatization

In [42]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
def lemmatize(text):
    iter = nlp(text)
    lemma = " ".join([token.lemma_ for token in iter])
    return lemma

In [43]:
#data.to_csv('tweets_lemm.csv')
#data = pd.read_csv('tweets_lemm.csv', index_col=0)

In [44]:
#try:
#    data = pd.read_csv('tweets_lemm.csv',index_col=[0])
#except:
#    data['lemma'] = data['clean_text'].progress_apply(lambda x: lemmatize(x))

In [45]:
data=data.dropna()

### Stopwords

In [46]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [47]:
def remove_stopwords(text):
  words=text.split()
  filtered_words = [w for w in words if w not in stopwords.words('english')]
  return ' '.join(filtered_words)

In [48]:
#data['no_stopwords'] = data['lemma'].progress_apply(lambda x: remove_stopwords(x))

In [49]:
#data.to_csv('tweets_no_stop.csv')

In [50]:
try:
    data = pd.read_csv('/kaggle/input/tweets/tweets_no_stop.csv',index_col=[0])
except:
    data['no_stopwords'] = data['lemma'].progress_apply(lambda x: remove_stopwords(x))

### Stopwords и TF-IDF

In [51]:
count_tf_idf = TfidfVectorizer(stop_words=list(stop_words))

In [52]:
data=data[['no_stopwords','toxic','text']]

In [53]:
X, y='no_stopwords', 'toxic'

In [54]:
data[data[X].isna()].head(5)

Unnamed: 0,no_stopwords,toxic,text
2091,,0,"No, it doesn´t.80.228.65.162"
2400,,0,"Here, here and here."
3983,,0,From here\n\nFrom here 160.80.2.8
8837,,0,What is I 78.146.102.144
9386,,0,They do too. their ... -


In [55]:
data=data.dropna()

In [56]:
print('Доля удаленных из анализа сообщений состовляет:',1-data.shape[0]/full_data_length)

Доля удаленных из анализа сообщений состовляет: 0.00039550008788891144


### Balance and sets

In [57]:
print('Дисбаланс классов, доля токсичных сообщений состовляет:',100*data['toxic'].mean().round(1),'%')

Дисбаланс классов, доля токсичных сообщений состовляет: 10.0 %


In [58]:
train_valid, test = train_test_split(data, test_size=0.2, random_state=12345, stratify=data[y])
train, valid  = train_test_split(train_valid, test_size=0.2, random_state=12345, stratify=train_valid[y])

In [59]:
# Dispalance in sets in their size
print(train_valid[y].mean(),train_valid.shape)
print(train[y].mean(),train.shape)
print(valid[y].mean(),valid.shape)
print(test[y].mean(),test.shape)

0.10164621652810815 (127383, 3)
0.10164269032245403 (101906, 3)
0.1016603210739098 (25477, 3)
0.10164541857690133 (31846, 3)


### TF-IDF 

In [60]:
#Создадим матрицу cо значениями TF-IDF
#count_tf_idf = TfidfVectorizer(min_df=1,stop_words=list(stopwords))
#count_tf_idf
tf_idf_train_valid = count_tf_idf.fit_transform(train_valid[X])
tf_idf_train = count_tf_idf.transform(train[X])
tf_idf_valid = count_tf_idf.transform(valid[X])
tf_idf_test = count_tf_idf.transform(test[X])

In [61]:
tf_idf_train.shape

(101906, 132838)

### Выводы

Выводы

Процесс предобработки состоял из следущих этапов:
- чистка текста ( оставили только слова, убрали символы и слова из одной буквы)
- лемматизация (привели слова к основаной форме )
- убрали стоп слова
- посчитали метрики TF-IDF для ее последущего использования как признак для обучения


Наблюдается дисбаланс классов: доля токсичных сообщений состовляет: 10 %, нетоксичные сообщения 90 %

## Обучение

### LogisticRegression

In [62]:
%%time

model=LogisticRegression()
model.fit(tf_idf_train,train[y])

CPU times: user 17.1 s, sys: 17.4 s, total: 34.5 s
Wall time: 8.92 s


In [63]:
prediction=model.predict(tf_idf_valid)

f1_score(valid[y], prediction)

0.714115686741252

In [64]:
predition_prob=model.predict_proba(tf_idf_valid)
predition_prob_test=model.predict_proba(tf_idf_test)

In [65]:
best_thr=0
best_score=0
for i in range(150,500):
    thr=i/1000
    pred=(predition_prob[:,1]>thr).astype('int')
    score=f1_score(valid[y], pred)
    if score>best_score:
        best_score=score
        best_thr=thr
print('validation: best_score=',best_score)
print('best_score=',best_thr)

validation: best_score= 0.778119237861094
best_score= 0.262


In [66]:
prediction_test=(predition_prob_test[:,1]>best_thr).astype('int')
print('test set score:',f1_score(test[y], prediction_test))

test set score: 0.7814102564102564


### CatBoost

In [67]:
tf_idf_test = count_tf_idf.fit_transform(test[X])
tf_idf_train_valid = count_tf_idf.transform(train_valid[X])
tf_idf_valid = count_tf_idf.transform(valid[X])
tf_idf_train = count_tf_idf.transform(train[X])

In [68]:
train_valid_data = Pool(data=tf_idf_train_valid,
                  label=train_valid[y]
                 )
train_data = Pool(data=tf_idf_train,
                  label=train[y]
                 )
valid_data = Pool(data=tf_idf_valid,
                  label=valid[y]
                 )
test_data = Pool(data=tf_idf_test,
                  label=test[y]
                 )

In [None]:
### start_time=time.time()
rate=0.2 #0.25
params = {'eval_metric': 'F1',
          'loss_function': 'Logloss',
              'learning_rate': rate,
              'random_seed': 12345,
              'verbose':100,
          'iterations': 500  #300
          #,  'task_type':"GPU"
             # ,'devices':'0:1'
             } # test -> 

model = CatBoostClassifier(**params)

In [None]:
%%time
model.fit(train_data)

In [74]:
prediction=model.predict(valid_data)
score=f1_score(valid[y], prediction)
print('for rate=',rate ,
          'score=', score, 
          ' model.best_iteration_= ',model.best_iteration_,
          ' calc time', time.time()-start_time)

for rate= 0.25 score= 0.7472278796107716  model.best_iteration_=  None  calc time 464.0869896411896


In [76]:
prediction=model.predict(test_data)
score=f1_score(test[y], prediction)
score

0.7403476669716377

In [82]:
%%time

params = {'eval_metric': 'F1',
          'loss_function': 'Logloss',
              'learning_rate': 1,
              'random_seed': 12345,
              'verbose':50,
          'iterations': 300}  # test -> 0.765

params = {'eval_metric': 'F1',
          'loss_function': 'Logloss',
              'learning_rate': 0.2,
              'random_seed': 12345,
              'verbose':50,
          'iterations': 1000} # test -> 0.769

model = CatBoostClassifier(**params)
model.fit(train_data)


0:	learn: 0.4691124	total: 1.07s	remaining: 17m 45s
50:	learn: 0.6824441	total: 41.2s	remaining: 12m 46s
100:	learn: 0.7283135	total: 1m 20s	remaining: 11m 53s
150:	learn: 0.7542138	total: 1m 58s	remaining: 11m 6s
200:	learn: 0.7734166	total: 2m 36s	remaining: 10m 23s
250:	learn: 0.7855576	total: 3m 15s	remaining: 9m 43s
300:	learn: 0.7956017	total: 3m 54s	remaining: 9m 4s
350:	learn: 0.8080752	total: 4m 33s	remaining: 8m 24s
400:	learn: 0.8153567	total: 5m 11s	remaining: 7m 45s
450:	learn: 0.8209765	total: 5m 49s	remaining: 7m 5s
500:	learn: 0.8236323	total: 6m 27s	remaining: 6m 25s
550:	learn: 0.8255909	total: 7m 4s	remaining: 5m 46s
600:	learn: 0.8273644	total: 7m 42s	remaining: 5m 6s
650:	learn: 0.8288965	total: 8m 20s	remaining: 4m 28s
700:	learn: 0.8301805	total: 8m 58s	remaining: 3m 49s
750:	learn: 0.8340379	total: 9m 35s	remaining: 3m 10s
800:	learn: 0.8372647	total: 10m 13s	remaining: 2m 32s
850:	learn: 0.8386922	total: 10m 51s	remaining: 1m 54s
900:	learn: 0.8422471	total: 11

<catboost.core.CatBoostClassifier at 0x79c97e50e5c0>

In [83]:
model.best_iteration_

In [84]:
prediction=model.predict(test_data)
score=f1_score(test[y], prediction)
score

0.7581512002866355

In [85]:
prediction=model.predict(valid_data)
score=f1_score(valid[y], prediction)
score

0.7566125805734608

### (Distil) BERT model + logistic regression

In [5]:
# Select the device for training (use GPU if you have one)
#device = torch.device('cpu')
device = torch.device('cuda:0')

In [6]:
torch.cuda.is_available()

True

In [7]:
#model_class, tokenizer_class, pretrained_weights = (transformers.BertModel, transformers.BertTokenizer, 'bert-base-uncased')

#tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
#model = model_class.from_pretrained(pretrained_weights).to(device)

In [8]:
from transformers import DistilBertTokenizer, DistilBertModel
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained("distilbert-base-uncased").to(device)

#text = "Replace me by any text you'd like."
#encoded_input = tokenizer(text, return_tensors='pt')
#output = model(**encoded_input)

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [9]:
try:
    data_all = pd.read_csv('toxic_comments.csv',index_col=[0])
except:
    data_all=pd.read_csv('https://code.s3.yandex.net/datasets/toxic_comments.csv',index_col=[0])

In [17]:
long_subset=data_all[(np.array([len(s.split()) for s in data_all[X]])>512)]
long_subset.count()

text     1969
toxic    1969
dtype: int64

In [18]:
data_all.shape

(159292, 2)

In [97]:
# full set settings
DATA_SIZE=data_all.shape[0]
BERT_SAMPLES = DATA_SIZE
BATCH_SIZE = 100
data_sample=data_all

# test settings
#DATA_SIZE=data_all.shape[0]
#BERT_SAMPLES = (DATA_SIZE//200)*2
#BATCH_SIZE = 100
#data_sample=data_all.groupby('toxic', group_keys=False).apply(lambda x: x.sample(min(len(x), BERT_SAMPLES//2)))

print('DATA_SIZE=',len(data_sample))
X='text'
y='toxic'
data_sample=data_sample[[X,y]].reset_index()

DATA_SIZE= 159292


In [98]:
tqdm.pandas()

tokenized = data_sample[X].progress_apply(lambda x: tokenizer.encode(x, add_special_tokens=True, padding='max_length', truncation=True))

  0%|          | 0/159292 [00:00<?, ?it/s]

In [99]:
data_sample.shape

(159292, 3)

In [100]:
tokenized.index.shape

(159292,)

In [101]:
BERT_SAMPLES=len(tokenized)

In [102]:
tokens = []
target = []
for i in range(len(tokenized)):
    if len(tokenized[i]) <= 512:
        tokens.append(tokenized[i])
        target.append(data_sample['toxic'][i])
tokens = (pd.Series(tokens)).head(BERT_SAMPLES)
target = (pd.Series(target)).head(BERT_SAMPLES)

In [103]:
max_len = 0
for i in tqdm(tokens.values):
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len - len(i)) for i in tqdm(tokens.values)])
attention_mask = np.where(padded != 0, 1, 0)

  0%|          | 0/159292 [00:00<?, ?it/s]

  0%|          | 0/159292 [00:00<?, ?it/s]

In [104]:
display(padded.shape, attention_mask.shape)


(159292, 512)

(159292, 512)

In [105]:
batch_size = BATCH_SIZE
embeddings = []
num_of_batches=padded.shape[0] // batch_size
num_of_batches

1592

In [106]:
print('estimaiton of calculation duration, in min',num_of_batches*3/60)

estimaiton of calculation duration, in min 79.6


In [107]:
start_time=-time.time()
for i in range(num_of_batches+1):
    if not((i+1)%100):
        print('starting i =',i+1,'/',num_of_batches+1)
        calculation_time=start_time+time.time()
        print('everage time per batch=',calculation_time/(i+1))
    #torch.from_numpy(img).float().to(device)
    #batch = torch.tensor(pad,device=device)
    #attention_mask_batch = torch.from_numpy(att).to(device)
    pad=padded[batch_size*i:batch_size*(i+1)]
    batch = torch.from_numpy(pad).float().to(device)
    att=attention_mask[batch_size*i:batch_size*(i+1)]
    attention_mask_batch = torch.tensor(att).to(device)
    with torch.no_grad():
        batch_embeddings = model(batch.to(torch.int64), attention_mask=attention_mask_batch)
    
    embeddings.append(batch_embeddings[0][:,0,:].cpu().numpy())
features_bert = np.concatenate(embeddings)

starting i = 100 / 1593
everage time per batch= 1.609344344139099
starting i = 200 / 1593
everage time per batch= 1.6109972870349885
starting i = 300 / 1593
everage time per batch= 1.6092165740331015
starting i = 400 / 1593
everage time per batch= 1.6083427864313125
starting i = 500 / 1593
everage time per batch= 1.6086905508041383
starting i = 600 / 1593
everage time per batch= 1.610033076206843
starting i = 700 / 1593
everage time per batch= 1.6095545772143773
starting i = 800 / 1593
everage time per batch= 1.6095227658748628
starting i = 900 / 1593
everage time per batch= 1.6099046823713514
starting i = 1000 / 1593
everage time per batch= 1.6103076775074006
starting i = 1100 / 1593
everage time per batch= 1.610702857320959
starting i = 1200 / 1593
everage time per batch= 1.6110158067941667
starting i = 1300 / 1593
everage time per batch= 1.6113205302678621
starting i = 1400 / 1593
everage time per batch= 1.6116161847114563
starting i = 1500 / 1593
everage time per batch= 1.611825504

In [109]:
features_bert.shape

(159292, 768)

In [108]:
from numpy import savetxt
# save to csv file
savetxt('features_bert_dist.csv', features_bert, delimiter=',')

In [110]:
len(data_sample['toxic'])

159292

In [111]:
y_data=data_sample['toxic'].iloc[:len(features_bert)]
len(y_data)

159292

In [112]:
features_train_valid, features_test, target_train_valid, target_test = train_test_split(features_bert,y_data, test_size=0.2, random_state=12345, stratify=y_data) 
features_train, features_valid, target_train, target_valid = train_test_split(features_train_valid, target_train_valid, test_size=0.2, random_state=12345, stratify=target_train_valid)

In [120]:
%%time

model_lin=LogisticRegression(solver='lbfgs', max_iter=200,random_state=12345, penalty='l2', class_weight='balanced')
#model_lin=LogisticRegression(solver='liblinear', max_iter=100,random_state=12345, penalty='l2', class_weight='balanced')
model_lin.fit(features_train,target_train)

CPU times: user 44.6 s, sys: 2.32 s, total: 46.9 s
Wall time: 25.4 s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [121]:
prediction=model_lin.predict(features_valid)
f1_score(target_valid, prediction)

0.6664758906853627

In [122]:
prediction=model_lin.predict(features_test)
f1_score(target_test, prediction)

0.668832675794848

In [123]:
predition_prob=model_lin.predict_proba(features_valid)
predition_prob_test=model_lin.predict_proba(features_test)

In [124]:
best_thr=0
best_score=0
for i in range(1,100):
    thr=i/100
    pred=(predition_prob[:,1]>thr).astype('int')
    score=f1_score(target_valid, pred)
    if score>best_score:
        best_score=score
        best_thr=thr
print('validation: best_score=',best_score)
print('best threshold=',best_thr)

validation: best_score= 0.752406623026569
best threshold= 0.84


In [125]:
prediction_test=(predition_prob_test[:,1]>best_thr).astype('int')
print('test set score:',f1_score(target_test, prediction_test))

test set score: 0.7567651632970451


### BERT model + logistic regression on cleaned data

In [12]:
# Select the device for training (use GPU if you have one)
#device = torch.device('cpu')
device = torch.device('cuda:0')

In [13]:
torch.cuda.is_available()

True

In [14]:
model_class, tokenizer_class, pretrained_weights = (transformers.BertModel, transformers.BertTokenizer, 'bert-base-uncased')

# Загрузка предобученной модели/токенизатора
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights).to(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [17]:
data_bert=data.copy()
data_bert=data_bert.fillna('none')
#DATA_SIZE=400

DATA_SIZE=data_bert.shape[0]
BERT_SAMPLES = DATA_SIZE
BATCH_SIZE = 100

# test settings
#DATA_SIZE=data_bert.shape[0]
#BERT_SAMPLES = (DATA_SIZE//200)*2
#BATCH_SIZE = 100

# test set
#data_sample=data_bert.groupby('toxic', group_keys=False).apply(lambda x: x.sample(min(len(x), BERT_SAMPLES//2)))
# full set
data_sample=data_bert
print('DATA_SIZE=',len(data_sample))
X='no_stopwords'
y='toxic'
data_sample=data_sample[[X,y]].reset_index()

DATA_SIZE= 159281


In [18]:
data_sample['toxic'].mean()

0.10161287284735782

In [19]:
data_sample.shape

(159281, 3)

In [20]:
long_subset=data_bert[(np.array([len(s.split()) for s in data_bert[X]])>512)]

In [21]:
tqdm.pandas()

#tokenized = dat_sampl[X].progress_apply(lambda x: tokenizer.encode(x, add_special_tokens=True))
tokenized = data_sample[X].progress_apply(lambda x: tokenizer.encode(x, add_special_tokens=True, padding='max_length', truncation=True))
#tokenized = long_subset[X].progress_apply(lambda x: tokenizer.encode(x, add_special_tokens=True, padding='max_length', truncation=True))
#tokenized = data_bert[X].progress_apply(lambda x: tokenizer.encode(x, add_special_tokens=True, padding='max_length', truncation=True))

  0%|          | 0/159281 [00:00<?, ?it/s]

In [22]:
data_sample.shape

(159281, 3)

In [23]:
tokenized.index.shape

(159281,)

In [24]:
BERT_SAMPLES=len(tokenized)

In [25]:
tokens = []
target = []
for i in range(len(tokenized)):
    if len(tokenized[i]) <= 512:
        tokens.append(tokenized[i])
        target.append(data_sample['toxic'][i])
tokens = (pd.Series(tokens)).head(BERT_SAMPLES)
target = (pd.Series(target)).head(BERT_SAMPLES)

In [26]:
max_len = 0
for i in tqdm(tokens.values):
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len - len(i)) for i in tqdm(tokens.values)])
attention_mask = np.where(padded != 0, 1, 0)

  0%|          | 0/159281 [00:01<?, ?it/s]

  0%|          | 0/159281 [00:00<?, ?it/s]

In [27]:
display(padded.shape, attention_mask.shape)


(159281, 512)

(159281, 512)

In [28]:
#https://www.kaggle.com/code/atulanandjha/distilbert-on-gpu-tutorial-classification-problem

In [None]:
# one batch example
#%time 

#input_ids = torch.tensor(padded).to(device)  
#attention_mask_t = torch.tensor(attention_mask).to(device)
#
#with torch.no_grad():
#        last_hidden_states = model(input_ids, attention_mask=attention_mask_t)# .to(device)        

In [30]:
batch_size = BATCH_SIZE
embeddings = []
num_of_batches=padded.shape[0] // batch_size
num_of_batches

1592

In [31]:
print('estimaiton of calculation duration, in min',num_of_batches*3/60)

estimaiton of calculation duration, in min 79.6


In [34]:
start_time=-time.time()
for i in range(num_of_batches+1):
    if not((i+1)%100):
        print('starting i =',i+1,'/',num_of_batches+1)
        calculation_time=start_time+time.time()
        print('everage time per batch=',calculation_time/(i+1))
    #torch.from_numpy(img).float().to(device)
    #batch = torch.tensor(pad,device=device)
    #attention_mask_batch = torch.from_numpy(att).to(device)
    pad=padded[batch_size*i:batch_size*(i+1)]
    batch = torch.from_numpy(pad).float().to(device)
    att=attention_mask[batch_size*i:batch_size*(i+1)]
    attention_mask_batch = torch.tensor(att).to(device)
    with torch.no_grad():
        batch_embeddings = model(batch.to(torch.int64), attention_mask=attention_mask_batch)
    
    embeddings.append(batch_embeddings[0][:,0,:].cpu().numpy())
features_bert = np.concatenate(embeddings)

starting i = 100 / 1593
everage time ber batch= 2.963857488632202
starting i = 200 / 1593
everage time ber batch= 2.967119460105896
starting i = 300 / 1593
everage time ber batch= 2.968395978609721
starting i = 400 / 1593
everage time ber batch= 2.9693312203884124
starting i = 500 / 1593
everage time ber batch= 2.969329417705536
starting i = 600 / 1593
everage time ber batch= 2.9696555324395497
starting i = 700 / 1593
everage time ber batch= 2.970000205039978
starting i = 800 / 1593
everage time ber batch= 2.9703501117229463
starting i = 900 / 1593
everage time ber batch= 2.970766903029548
starting i = 1000 / 1593
everage time ber batch= 2.9709809548854826
starting i = 1100 / 1593
everage time ber batch= 2.9709358113462274
starting i = 1200 / 1593
everage time ber batch= 2.9711649388074877
starting i = 1300 / 1593
everage time ber batch= 2.971395261287689
starting i = 1400 / 1593
everage time ber batch= 2.971281086206436
starting i = 1500 / 1593
everage time ber batch= 2.97132792218526

In [37]:
features_bert.shape

(159281, 768)

In [38]:
type(features_bert)

numpy.ndarray

In [39]:
from numpy import savetxt
# save to csv file
savetxt('features_bert.csv', features_bert, delimiter=',')

In [41]:
len(data_sample['toxic'])

159281

In [42]:
y_data=data_sample['toxic'].iloc[:len(features_bert)]
len(y_data)

159281

In [43]:
features_train_valid, features_test, target_train_valid, target_test = train_test_split(features_bert,y_data, test_size=0.2, random_state=12345, stratify=y_data) 
features_train, features_valid, target_train, target_valid = train_test_split(features_train_valid, target_train_valid, test_size=0.2, random_state=12345, stratify=target_train_valid)

In [92]:
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition

In [99]:
%%time 
use_pca=True
if use_pca:
    pca = decomposition.PCA(n_components=100)
    features_train_dec=pca.fit_transform(features_train)
    features_test_dec=pca.transform(features_test)
    features_valid_dec=pca.transform(features_valid)

CPU times: user 8.59 s, sys: 2.07 s, total: 10.7 s
Wall time: 6.86 s


In [100]:
features_train.shape

(101939, 768)

In [101]:
model_lin=LogisticRegression(solver='lbfgs', max_iter=200,random_state=12345, penalty='l2', class_weight='balanced')
for n in range(100,701,100):
    pca = decomposition.PCA(n_components=n)
    f=pca.fit_transform(features_train)
    v=pca.transform(features_valid)
    model_lin.fit(f,target_train)
    prediction=model_lin.predict(v)
    print('n=',n,'; f1=',f1_score(target_valid, prediction))
    

n= 100 ; f1= 0.5763157894736841
n= 200 ; f1= 0.5868869936034116
n= 300 ; f1= 0.5924538399785924
n= 400 ; f1= 0.5949705724986624
n= 500 ; f1= 0.5977691170541595
n= 600 ; f1= 0.602246582758154
n= 700 ; f1= 0.6021245125722737


In [72]:
use_scaler=True
if use_scaler:
    scalar= StandardScaler()
    pd.options.mode.chained_assignment = None
    features_train_valid_norm=scalar.fit_transform(features_train_valid)
    features_train_norm=scalar.transform(features_train)
    features_valid_norm=scalar.transform(features_valid)
    features_test_norm=scalar.transform(features_test)
else:
    features_train_valid_norm=features_train_valid
    features_train_norm=features_train
    features_valid_norm=features_valid
    features_test_norm=features_test

In [88]:
%%time

model_lin=LogisticRegression(solver='lbfgs', max_iter=200,random_state=12345, penalty='l2', class_weight='balanced')
#model_lin=LogisticRegression(solver='liblinear', max_iter=100,random_state=12345, penalty='l2', class_weight='balanced')
model_lin.fit(features_train_norm,target_train)

CPU times: user 43 s, sys: 2.7 s, total: 45.7 s
Wall time: 26 s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [89]:
prediction=model_lin.predict(features_valid_norm)
f1_score(target_valid, prediction)

0.5998651382333109

In [81]:
prediction=model_lin.predict(features_valid_norm)
f1_score(target_valid, prediction)

0.6001348617666892

In [82]:
prediction=model_lin.predict(features_test_norm)
f1_score(target_test, prediction)

0.6019793459552496

In [83]:
predition_prob=model_lin.predict_proba(features_valid_norm)
predition_prob_test=model_lin.predict_proba(features_test_norm)

In [84]:
best_thr=0
best_score=0
for i in range(1,100):
    thr=i/100
    pred=(predition_prob[:,1]>thr).astype('int')
    score=f1_score(target_valid, pred)
    if score>best_score:
        best_score=score
        best_thr=thr
print('validation: best_score=',best_score)
print('best threshold=',best_thr)

validation: best_score= 0.6932333793647663
best threshold= 0.83


In [85]:
prediction_test=(predition_prob_test[:,1]>best_thr).astype('int')
print('test set score:',f1_score(target_test, prediction_test))

test set score: 0.6997167138810199


### CatBoost

In [104]:
%%time
train_valid_data = Pool(data=features_train_valid,
                  label=target_train_valid
                 )
train_data = Pool(data=features_train,
                  label=target_train
                 )
valid_data = Pool(data=features_valid,
                  label=target_valid
                 )
test_data = Pool(data=features_test,
                  label=target_test
                 )

CPU times: user 8.96 s, sys: 165 ms, total: 9.12 s
Wall time: 9.11 s


In [118]:
params = {'eval_metric': 'F1',
          'loss_function': 'Logloss',
              'learning_rate': 0.05,
              'random_seed': 12345,
              'verbose':100}

In [None]:
%%time
cv_data = cv(
    params = params,
    pool = train_data,
    fold_count=3,
    shuffle=True,
    partition_random_seed=0,
    stratified=True,
    verbose=50,
    early_stopping_rounds=20
)

In [111]:
cv_data.head()

Unnamed: 0,iterations,test-F1-mean,test-F1-std,train-F1-mean,train-F1-std,test-Logloss-mean,test-Logloss-std,train-Logloss-mean,train-Logloss-std
0,0,0.297363,0.024913,0.298578,0.042514,0.454869,0.006194,0.454276,0.005791
1,1,0.382409,0.008908,0.384345,0.010206,0.337901,0.002512,0.337172,0.001577
2,2,0.397161,0.00883,0.398214,0.000814,0.279702,0.000546,0.278434,0.001642
3,3,0.412977,0.00474,0.418067,0.001049,0.248891,0.001034,0.247482,0.001695
4,4,0.437561,0.009633,0.442204,0.005164,0.230168,0.001166,0.228219,0.001111


In [114]:
cv_data[cv_data['test-F1-mean'] == cv_data['test-F1-mean'].max()]

Unnamed: 0,iterations,test-F1-mean,test-F1-std,train-F1-mean,train-F1-std,test-Logloss-mean,test-Logloss-std,train-Logloss-mean,train-Logloss-std
155,155,0.648197,0.004679,0.800733,0.001011,0.162707,0.002198,0.103998,0.001324


In [113]:
cv_data[cv_data['train-F1-mean'] == cv_data['train-F1-mean'].max()]

Unnamed: 0,iterations,test-F1-mean,test-F1-std,train-F1-mean,train-F1-std,test-Logloss-mean,test-Logloss-std,train-Logloss-mean,train-Logloss-std
184,184,0.647549,0.003435,0.820591,0.004122,0.162748,0.002192,0.096784,0.002681


In [127]:
%%time

params = {'eval_metric': 'F1',
          'loss_function': 'Logloss',
              'learning_rate': 1,
              'random_seed': 12345,
              'verbose':50,
          'iterations': 300}  # test -> 

params = {'eval_metric': 'F1',
          'loss_function': 'Logloss',
              'learning_rate': 0.05,
              'random_seed': 12345,
              'verbose':50,
          'iterations': 500,
            'task_type':"GPU",
            'devices':'0:1'} # test -> 

model = CatBoostClassifier(**params)
model.fit(train_data)




0:	learn: 0.4333942	total: 140ms	remaining: 1m 9s
50:	learn: 0.5253857	total: 6.64s	remaining: 58.5s
100:	learn: 0.5948537	total: 12s	remaining: 47.3s
150:	learn: 0.6264898	total: 17.4s	remaining: 40.2s
200:	learn: 0.6445446	total: 22.7s	remaining: 33.7s
250:	learn: 0.6570986	total: 28.2s	remaining: 27.9s
300:	learn: 0.6666667	total: 34.4s	remaining: 22.7s
350:	learn: 0.6751991	total: 40.5s	remaining: 17.2s
400:	learn: 0.6846488	total: 46.1s	remaining: 11.4s
450:	learn: 0.6932601	total: 51.9s	remaining: 5.64s
499:	learn: 0.7012987	total: 57.2s	remaining: 0us
CPU times: user 1min 34s, sys: 23.1 s, total: 1min 58s
Wall time: 1min 5s


<catboost.core.CatBoostClassifier at 0x7cddacaa5780>

In [129]:
prediction=model.predict(test_data)
score=f1_score(target_test, prediction)
score

0.6467735919433047

In [130]:
prediction=model.predict(valid_data)
score=f1_score(target_valid, prediction)
score

0.6442194490228397

In [None]:
%%time
score_list_1000=[]
#for r in [0.01,0.02,0.03,0.04,0.05,0.1,0,15,0.2,0.25,0.3]:
for r in [0.15,0.2,0.25,0.3]:
    start_time=time.time()
    params = {'eval_metric': 'F1',
          'loss_function': 'Logloss',
              'learning_rate': r,
              'random_seed': 12345,
              'verbose':0,
          'iterations': 1000,
            'task_type':"GPU"
             # ,'devices':'0:1'
             } # test -> 

    model = CatBoostClassifier(**params)
    model.fit(train_data)
    prediction=model.predict(valid_data)
    score=f1_score(target_valid, prediction)
    print('for rate=',r ,'score=', score, ' iteration time', time.time()-start_time)
    score_list_1000.append([r,score])    

In [136]:
score_list_1000

[[0.01, 0.6175115207373272],
 [0.02, 0.6385998107852413],
 [0.03, 0.6439802863177658],
 [0.04, 0.6503382318637743],
 [0.05, 0.6573686657368666],
 [0.1, 0.6632864715372205]]

In [106]:
%%time

params = {'eval_metric': 'F1',
          'loss_function': 'Logloss',
              'learning_rate': 1,
              'random_seed': 12345,
              'verbose':50,
          'iterations': 300}  # test -> 

params = {'eval_metric': 'F1',
          'loss_function': 'Logloss',
              'learning_rate': 0.1,
              'random_seed': 12345,
              'verbose':50,
          'iterations': 300} # test -> 

model = CatBoostClassifier(**params)
model.fit(train_data)


0:	learn: 0.4064386	total: 764ms	remaining: 6m 21s
50:	learn: 0.6559394	total: 35.2s	remaining: 5m 9s
100:	learn: 0.7148682	total: 1m 5s	remaining: 4m 19s
150:	learn: 0.7562285	total: 1m 36s	remaining: 3m 42s
200:	learn: 0.7916319	total: 2m 6s	remaining: 3m 7s
250:	learn: 0.8247840	total: 2m 37s	remaining: 2m 35s
300:	learn: 0.8507124	total: 3m 7s	remaining: 2m 3s
350:	learn: 0.8710605	total: 3m 36s	remaining: 1m 31s
400:	learn: 0.8912735	total: 4m 6s	remaining: 1m
450:	learn: 0.9079748	total: 4m 37s	remaining: 30.2s
499:	learn: 0.9227048	total: 5m 7s	remaining: 0us
CPU times: user 10min 1s, sys: 2.57 s, total: 10min 4s
Wall time: 5min 16s


<catboost.core.CatBoostClassifier at 0x7cdd9a035330>

In [None]:
params = {'eval_metric': 'F1',
          'loss_function': 'Logloss',
              'learning_rate': 0.25,
              'random_seed': 12345,
              'verbose':100}

In [None]:
cv_data = cv(
    params = params,
    pool = train_data,
    fold_count=5,
    shuffle=True,
    partition_random_seed=0,
    stratified=True,
    verbose=50,
    early_stopping_rounds=20
)

In [None]:
%%time

params = {'eval_metric': 'F1',
          'loss_function': 'Logloss',
              'learning_rate': 1,
              'random_seed': 12345,
              'verbose':50,
          'iterations': 300}  # test -> 0.765

params = {'eval_metric': 'F1',
          'loss_function': 'Logloss',
              'learning_rate': 0.25,
              'random_seed': 12345,
              'verbose':50,
          'iterations': 500} # test -> 0.769

model = CatBoostClassifier(**params)
model.fit(train_data)


In [None]:
prediction=model.predict(test_data)
score=f1_score(test[y], prediction)
score

In [None]:
%%time

params = {'eval_metric': 'F1',
          'loss_function': 'Logloss',
              'learning_rate': 1,
              'random_seed': 12345,
              'verbose':50,
          'iterations': 300}  # test -> 0.765

params = {'eval_metric': 'F1',
          'loss_function': 'Logloss',
              'learning_rate': 0.25,
              'random_seed': 12345,
              'verbose':50,
          'iterations': 500} # test -> 0.769

model = CatBoostClassifier(**params)
model.fit(train_data)


In [107]:
prediction=model.predict(valid_data)
score=f1_score(target_valid, prediction)
score

0.6562641338760742

In [109]:
prediction=model.predict(test_data)
score=f1_score(target_test, prediction)
score

0.6698027863216935

### Выводы по обучению моделей

Выводы

Построены и обучены следущие модели :

В проекта используются пять моделей 
- частотный анализ TF-IDF + logistic regression   (F1 на тестовой выборке 0.78)
- частотный анализ TF-IDF + CatBoost (F1 на тестовой выборке 0.757)
- *BERT* + logistic regression на исходных данных  (F1 на тестовой выборке 0.757)
- *BERT* + logistic regression на данных после предобработки очистки, лемматизации, без стоп-слов  (F1 на тестовой выборке 0.7)
- *BERT* + CatBoost на данных после предобработки очистки, лемматизации, без стоп-слов  (F1 на тестовой выборке 0.67)

Дисбаланс классов для логической регрессии решался подбором порога вероятности. 

## Выводы

Наблюдается дисбаланс классов: доля токсичных сообщений состовляет: 10 %, нетоксичные сообщения 90 

Для обоих способов векторицазии токенизированых фраз логическая регрессия работает достаточно хорошо и гораздо быстрее чем CatBoost. 
Частотные признаки TF-IDF показали немного лучше чем BERT. 
В случаи использования BERT предобработка фраз привела к ухучшению метрики, поэтому в работе использовались исходные сообщения.  Частотный анализ строился после предобработки текста(чистка текста, лемматизация,  убрали стоп слова)

Предлагаем две модели с метрикой F1 выше целевой 0.75:
- TF-IDF + logistic regression (0.78)
- BERT + logistic regression  (0.757)
  
Модель TF-IDF + logistic regression (0.78) более быстра в исспользовании.