# Detecting Comment Toxicity using BERT

In this project, we will solve the task of text classification using supervised machine learning methods.

**Project description:**

The online store "Wikishop" is launching a new service. Now users can edit and supplement product descriptions, similar to wiki communities. This means that customers can suggest their edits and comment on changes made by others. The store needs a tool that can detect toxic comments and send them for moderation.

Our goal is to train a model to classify comments as positive or negative. We have a dataset with labels indicating the toxicity of the comments.

We need to build a model with an F1 score of at least 0.75 as the quality metric.

## Data preprocessing

In [None]:
pip install spacy --quiet

In [None]:
pip install catboost --quiet

In [None]:
pip install transformers --quiet

In [None]:
pip install fast_ml --quiet

In [None]:
import pandas as pd
import numpy as np
import re
import string
import spacy
import nltk
import torch
import keras

from fast_ml.model_development import train_valid_test_split
from sklearn.metrics import f1_score
from catboost import CatBoostClassifier
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline
from tqdm import tqdm

In [None]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
try:
    df = pd.read_csv("https://code.s3.yandex.net/datasets/toxic_comments.csv")
except:
    df =  pd.read_csv("/datasets/toxic_comments.csv")

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [None]:
df.sample(20)

Unnamed: 0.1,Unnamed: 0,text,toxic
96324,96417,BIG UPDATE!! This site has so many facts wron...,0
113080,113178,"""\n\nI shall ignore your stupid personal insul...",0
53072,53133,"Not done, comedy request.",0
133024,133162,I'm so set on getting the article deleted beca...,0
48936,48991,"Blackworm, I know why. (: hehehehehehe it has ...",0
137144,137282,"if the algae lipid factor is 40%, one need 2.5...",0
52577,52634,"""\n\nIndeed they have. And you are right about...",0
134487,134625,Major reorganization and rewriting \n\nI think...,0
115809,115908,not to mention they are mixing objective and o...,0
32873,32913,dimension 1 || dimension 2 || dimension,0


We can see that we have to work with unlemmatized and uncleaned text in English.

In [None]:
# making sure that there are no missing values
df.isna().sum()

Unnamed: 0    0
text          0
toxic         0
dtype: int64

In [None]:
# making sure that there are no duplicates
df.duplicated().sum()

0

In [None]:
df['text'].duplicated().sum()

0

In [None]:
# let's take a look at the classes ratio
df["toxic"].value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

We observe a significant class imbalance, where positive comments are predominant.

In [None]:
# initializing the lemmatizer
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

In [None]:
import re
import spacy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nlp = spacy.load("en_core_web_sm")

def clean_text(text):
    # converting all words to lowercase
    text = text.lower()

    # replacing contractions with full forms
    text = re.sub(r"i'm", "i am", text)
    text = re.sub(r"he's", "he is", text)
    text = re.sub(r"she's", "she is", text)
    text = re.sub(r"it's", "it is", text)
    text = re.sub(r"that's", "that is", text)
    text = re.sub(r"what's", "what is", text)
    text = re.sub(r"where's", "where is", text)
    text = re.sub(r"how's", "how is", text)
    text = re.sub(r"'ll", " will", text)
    text = re.sub(r"'ve", " have", text)
    text = re.sub(r"'re", " are", text)
    text = re.sub(r"'d", " would", text)
    text = re.sub(r"'re", " are", text)
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"can't", "cannot", text)
    text = re.sub(r"n't", " not", text)
    text = re.sub(r"n'", "ng", text)
    text = re.sub(r"'bout", "about", text)
    text = re.sub(r"'til", "until", text)

    # removing special characters and punctuation
    text = re.sub(r'[^\w\s]', '', text)

    # tokenizing the text
    tokens = word_tokenize(text)

    # removing stop words
    stop_words = stopwords.words('english')
    filtered_tokens = [word for word in tokens if word not in stop_words]

    # lemmatizing
    lemmatized_tokens = [token.lemma_ for token in nlp(" ".join(filtered_tokens))]

    cleaned_text = ' '.join(lemmatized_tokens)

    return cleaned_text


In [None]:
%%time

# saving the cleaned text in a separate column
df["clean"] = df['text'].apply(lambda text: clean_text(text))
df['clean'].head(20)

CPU times: user 17min 9s, sys: 9.41 s, total: 17min 18s
Wall time: 17min 42s


0     explanation edit make username hardcore metall...
1     daww match background colour seemingly stuck t...
2     hey man really try edit war guy constantly rem...
3     make real suggestion improvement wonder sectio...
4                         sir hero chance remember page
5                congratulation well use tool well talk
6                           cocksucker piss around work
7     vandalism matt shirvington article revert plea...
8     sorry word nonsense offensive anyway intend wr...
9                  alignment subject contrary dulithgow
10    fair use rationale imagewonjujpg thank upload ...
11                    bbq man let discuss itmaybe phone
12    hey talk exclusive group wp talibanswho good d...
14    oh girl start argument stick nose belong belie...
15    juelz santanas age 2002 juelz santana 18 year ...
16                bye look come think comme back tosser
17       redirect talkvoydan pop georgiev chernodrinski
18    mitsurugi point make sense argue include h

In [None]:
# defining features
target = df["toxic"]
features = df["clean"]

In [None]:
# initializaing TfidfVectorizer
vect = TfidfVectorizer(ngram_range=(1,3), min_df=3, max_df=0.9, use_idf=1,
               smooth_idf=1, sublinear_tf=1, stop_words=stopwords.words('english'))

In [None]:
# splitting the data
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.25, random_state=42,
                                                                            stratify=target)
print(features_train.shape)
print(target_train.shape)
print(features_test.shape)
print(target_test.shape)

(119469,)
(119469,)
(39823,)
(39823,)


In [None]:
# applying TfidfVectorizer
features_train_tfidf = vect.fit_transform(features_train)
features_test_tfidf = vect.transform(features_test)

We are now prepared for further training: we cleaned the data, split the dataset into sets, and applied a vectorizer.

## Model Training

### LogisticRegression

In [None]:
target_test_lr = target_test

In [None]:
%%time

# model inintialization and training
logreg = LogisticRegression()

# hyperparameters for RandomizedSearchCV
param_grid = {
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'saga']
}

random_search = RandomizedSearchCV(
    logreg,
    param_distributions=param_grid,
    n_iter=10,
    scoring='f1',
    cv=5,
    verbose=1,
    random_state=42
)

random_search.fit(features_train_tfidf, target_train)

best_params_lr = random_search.best_params_
best_model_lr = random_search.best_estimator_

f1_lr = random_search.best_score_

print("Best model parameters:", best_params_lr)
print("F1 score:", f1_lr)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
Best model parameters: {'solver': 'liblinear', 'penalty': 'l1'}
F1 score: 0.7662832728708431
CPU times: user 5min 1s, sys: 10.5 s, total: 5min 11s
Wall time: 5min 3s


### CatBoost

In [None]:
# let's train the model on 10% of the original dataframe for faster training

the_rest, df_sample = train_test_split(
    df, test_size = 0.1, random_state=42,
    stratify=df['toxic']
    )

data = df_sample.copy()
data = data.reset_index(drop=True)

del the_rest
del df_sample

print(data['toxic'].value_counts())
print(data['toxic'].value_counts(normalize=True))

0    14311
1     1619
Name: toxic, dtype: int64
0    0.898368
1    0.101632
Name: toxic, dtype: float64


In [None]:
# defining features
target = data["toxic"]
features = data["clean"]

# splitting data
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.25, random_state=42,
                                                                            stratify=target)
print(features_train.shape)
print(target_train.shape)
print(features_test.shape)
print(target_test.shape)

(11947,)
(11947,)
(3983,)
(3983,)


In [None]:
# applying TfidfVectorizer
features_train_tfidf_new = vect.fit_transform(features_train)
features_test_tfidf_new = vect.transform(features_test)

In [None]:
%%time

# initializing the model
model_cat = CatBoostClassifier(random_state=42)

# params for RandomizedSearchCV
params_cat = {'learning_rate': [0.5],
              'iterations': [50]
              }


grid_cat = RandomizedSearchCV(model_cat, param_distributions = params_cat, scoring='f1',
                              random_state=42)

grid_cat.fit(features_train_tfidf_new, target_train, verbose=20)
f1_cat = grid_cat.best_score_

print("Best F1:", f1_cat)
print("Best params:", grid_cat.best_params_)

0:	learn: 0.3657707	total: 568ms	remaining: 27.9s
20:	learn: 0.1523059	total: 8.92s	remaining: 12.3s
40:	learn: 0.1179191	total: 16.1s	remaining: 3.52s
49:	learn: 0.1082264	total: 20.3s	remaining: 0us
0:	learn: 0.3615681	total: 414ms	remaining: 20.3s
20:	learn: 0.1502752	total: 6.93s	remaining: 9.57s
40:	learn: 0.1158423	total: 17.1s	remaining: 3.75s
49:	learn: 0.1060319	total: 19.9s	remaining: 0us
0:	learn: 0.3644701	total: 423ms	remaining: 20.7s
20:	learn: 0.1495252	total: 8.9s	remaining: 12.3s
40:	learn: 0.1176080	total: 15.9s	remaining: 3.48s
49:	learn: 0.1073704	total: 20.4s	remaining: 0us
0:	learn: 0.3652083	total: 423ms	remaining: 20.7s
20:	learn: 0.1526434	total: 7.03s	remaining: 9.7s
40:	learn: 0.1180618	total: 15.5s	remaining: 3.41s
49:	learn: 0.1069502	total: 19s	remaining: 0us
0:	learn: 0.3643950	total: 411ms	remaining: 20.1s
20:	learn: 0.1500231	total: 8.91s	remaining: 12.3s
40:	learn: 0.1141776	total: 15.5s	remaining: 3.41s
49:	learn: 0.1038057	total: 20.2s	remaining: 0us

### BERT toxic-comment-model

In [None]:
!pip install transformers --quiet

In [None]:
!pip install tqdm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# let's create a smaller sample of the dataset for faster training
df_bert = df.sample(3000, random_state=42)
df_bert['toxic'].value_counts()

0    2688
1     312
Name: toxic, dtype: int64

In [None]:
# defining features
target = df_bert["toxic"]
features = df_bert["clean"]

In [None]:
# splitting data
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.25, random_state=42,
                                                                            stratify=target)
print(features_train.shape)
print(target_train.shape)
print(features_test.shape)
print(target_test.shape)

(2250,)
(2250,)
(750,)
(750,)


In [None]:
%%time

# loading the pretrained model and tokenizer
model_name = "martin-ha/toxic-comment-model"
tokenizer = AutoTokenizer.from_pretrained(model_name, padding=True,
                       truncation=True,
                       add_special_tokens=True,
                       return_tensors="pt")
model = AutoModelForSequenceClassification.from_pretrained(model_name)

pipeline = TextClassificationPipeline(model=model, tokenizer=tokenizer)

target_pred = []

for text in tqdm(features_train):
  pred = pipeline(text[:512])
  target_pred.append(1 if pred[0]['label'] == 'toxic' else 0)


Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
100%|██████████| 2250/2250 [04:17<00:00,  8.73it/s]

CPU times: user 3min 55s, sys: 1.43 s, total: 3min 56s
Wall time: 4min 23s





In [None]:
# let's check the class balance in the predictions
pd.Series(target_pred).value_counts()

0    2115
1     135
dtype: int64

In [None]:
# calculating F1
f1_bert = f1_score(target_train, target_pred)
print("F1:", f1_bert)

F1: 0.6178861788617886


### Training results

In [None]:
data = {"Model": ["LogisticRegression", "CatboostClassifier", "BERT"], "F1 score": [f1_lr, f1_cat, f1_bert]}
results = pd.DataFrame(data)
results

Unnamed: 0,Model,F1 score
0,LogisticRegression,0.766283
1,CatboostClassifier,0.655702
2,BERT,0.617886


Logistic Regression shows the best results.

### Testing

In [None]:
# getting predictions
predictions_test = best_model_lr.predict(features_test_tfidf)

In [None]:
# calculating f1
f1_final = f1_score(target_test_lr, predictions_test)
print("F1 Score:", f1_final)

F1 Score: 0.7724660030842562


## Summary

In this study, we examined several models: logistic regression, CatBoost and the BERT variant for recognizing toxic comments. We were able to achieve the target F1 metric value of over 0.75 with the LogisticRegression model: 0.77.