# Project for Wikishop

The Wikishop online store is launching a new service. Now users can edit and add product descriptions, just like in wiki communities. That is, clients suggest their edits and comment on the changes of others. The store needs a tool that will look for toxic comments and send them for moderation.

Train the model to classify comments into positive and negative. At your disposal is a set of data with markings about the toxicity of edits.

Build a model with a quality metric value *F1* of at least 0.75.

**Instructions for completing the project**

1. Download and prepare data.
2. Train different models.
3. Draw conclusions.

It is not necessary to use *BERT* to complete the project, but you can try.

**Description of data**

The data is in the file `toxic_comments.csv`. The *text* column in it contains the text of the comment, and *toxic* is the target attribute.

## Preparation

In [1]:
#!pip install spacy

In [2]:
import pandas as pd
import numpy as np
import re
import nltk
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from lightgbm import LGBMClassifier
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from tqdm.notebook import tqdm
tqdm.pandas()

In [3]:
data = pd.read_csv('datasets/toxic_comments.csv')
display(data.info())
display(data.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


None

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0


There are no gaps in the data; we will lemmatize the text and clean it.

In [4]:
!python3 -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [5]:
def lemmatize_text(text):
    text = text.lower()
    doc = nlp(text)
    lemm_text = " ".join([token.lemma_ for token in doc])
    cleared_text = re.sub(r'[^a-zA-Z]', ' ', lemm_text) 
    return " ".join(cleared_text.split())

In [6]:
#Let's check that lemmatization works correctly
sentence1 = "The striped bats are hanging on their feet for best"
sentence2 = "you should be ashamed of yourself went worked"
df_my = pd.DataFrame([sentence1, sentence2], columns = ['text'])
print(df_my)


print(df_my['text'].apply(lemmatize_text))

                                                text
0  The striped bats are hanging on their feet for...
1      you should be ashamed of yourself went worked
0    the stripe bat be hang on their foot for good
1        you should be ashamed of yourself go work
Name: text, dtype: object


In [7]:
data['lemm_text'] = data['text'].progress_apply(lemmatize_text)

  0%|          | 0/159292 [00:00<?, ?it/s]

Let's explore class balance:

In [8]:
data['toxic'].value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

There is a strong imbalance of classes, which will need to be taken into account when building models

## Training

In [9]:
#select the target feature and divide the data into training and training samples
target=data['toxic']
features=data['lemm_text']
#Let's allocate 20% of the data to the test sample
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.2, random_state=12345)
#From the remaining data, we will allocate 25% of the data to the validation sample
features_train, features_valid, target_train, target_valid = train_test_split(
    features_train, target_train, test_size=0.25, random_state=12345)

nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))
count_tf_idf = TfidfVectorizer(stop_words=stopwords) 
count_tf_idf.fit(features_train)
features_train = count_tf_idf.transform(features_train)
features_valid = count_tf_idf.transform(features_valid)


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/artembonchuk/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Let's train a logistic regression model taking into account class imbalance

In [10]:
model=LogisticRegression(random_state=12345,class_weight='balanced')
model.fit(features_train,target_train)
predictions = model.predict(features_valid)

f1 = f1_score(target_valid, predictions)
print('Logistic regression on validation set:', f1)

Logistic regression on validation set: 0.7408412483039348


Let's try to select hyperparameters, namely the inverse force of regularization

In [11]:
best_score = 0
best_c = 0
for i in [1, 10, 50, 100, 200]:
    model = LogisticRegression(C=i,random_state=12345,class_weight='balanced')
    model.fit(features_train, target_train)
    prediction = model.predict(features_valid)
    f1=f1_score(target_valid,prediction)
    if f1 > best_score:
        best_score = f1
        best_c = i
print('Logistic regression on validation set:',best_score)
print(best_c)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Logistic regression on validation set: 0.7566007791083538
10


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


With the value C=10 the result was better

Let's train the LightGBM model taking into account class imbalance

In [12]:
model_gbm=LGBMClassifier(random_state=12345,class_weight='balanced')
model_gbm.fit(features_train,target_train)
predictions = model_gbm.predict(features_valid)

f1 = f1_score(target_valid, predictions)
print('LightGBM on validation set:', f1)

LightGBM on validation set: 0.7391365888181174


The result is not good enough, let’s try to select hyperparameters, namely the number of trees (default 100) and the number of “leaves” (default 30)

In [13]:
model_gbm=LGBMClassifier(random_state=12345)
params = {
    'n_estimators': [100, 200, 300],
    'num_leaves': [21, 41],
}
# using GridSearchCV 
grid_gbm = GridSearchCV(model_gbm,
                        params,
                        n_jobs=-1,
                        scoring='f1')


In [14]:
grid_gbm.fit(features_train, target_train)

grid_gbm_best_score = grid_gbm.best_score_ 
grid_gbm_best_params = grid_gbm.best_params_
print(f'best_score: {grid_gbm_best_score}')
print(f'best_params: {grid_gbm_best_params}')

best_score: 0.7715078293138841
best_params: {'n_estimators': 300, 'num_leaves': 41}


We will train a random forest model taking into account class imbalance, and also select hyperparameters

In [15]:

best_score = 0
best_depth = 0
best_est = 0
for depth in range(2, 15, 2):
    for est in range(51, 251, 50):
        model = RandomForestClassifier(n_estimators=est, max_depth=depth, random_state=12345,class_weight='balanced')
        model.fit(features_train, target_train)
        prediction = model.predict(features_valid)
        f1=f1_score(target_valid,prediction)
        if f1 > best_score:
            best_score = f1
            best_depth = depth
            best_est = est
print('Random forest model on validation set:',best_score)
print(best_depth)
print(best_est)

Random forest model on validation set: 0.3775293229988339
14
201


The LGBMClassifier model shows the best results, let’s check it on a test sample

In [16]:
#vectorize the text using tf_idf trained on the training set
features_test = count_tf_idf.transform(features_test)
#train and test the model taking into account the selected hyperparameters and class imbalance
model_gbm=LGBMClassifier(n_estimators=300, num_leaves=41, random_state=12345,class_weight='balanced')
model_gbm.fit(features_train,target_train)
predictions = model_gbm.predict(features_test)

f1 = f1_score(target_test, predictions)
print('LightGBM on test dataset:', f1)

LightGBM on test dataset: 0.7637448501207558


# Сonclusions
 
   After preprocessing the text, the LogisticRegression, LightGBMClassifier and RandomForestClassifier models were trained. After selecting hyperparameters, the LogisticRegression and LightGBMClassifier models allowed us to achieve an F1 quality metric of at least 0.75 on the validation set. During data preprocessing, a strong class imbalance was identified, which was taken into account when building models. Since LGBMClassifier showed the best result and its work was tested on the test set, the model showed a result of F1 = 0.763 (0.772 on the validation set).

Result: the LGBMClassifier model can be recommended as the best model for classifying comments, taking into account class imbalance, which gives the result F1 = 0.763.