# Project: classification of reviews for an online store

The store needs a tool that will search for toxic comments left by users in the description of the store's products and send them for moderation.

*Project goal:* Train the model to classify comments into positive and negative.

We have a data set with markings about the toxicity of edits.

The target value of the quality metric F1 must be more than 0.75.

*Data Description*

 The text column contains the text of the comment, and toxic is the target attribute.

**1. Loading data and libraries**

In [59]:
from transformers import pipeline
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
from sklearn.metrics import f1_score

In [60]:
df_comm = pd.read_csv('https://code.s3.yandex.net/datasets/toxic_comments.csv')

In [61]:
df_comm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


We will use the pretrained language model and get predictions about comment's toxity on this model.  

In [62]:
X = df_comm['text']
y = df_comm['toxic']


In [63]:
pipe = pipeline(model='minuva/MiniLMv2-toxic-jigsaw', task='text-classification', max_length=512, truncation=True)

In [64]:
def pred(text):
    text = pipe(text)
    return text

tqdm.pandas()
pred = X.progress_apply(pred)

  0%|          | 0/159292 [00:00<?, ?it/s]

In [65]:
pred_unpacked = [item for sublist in pred for item in sublist]

df1 = pd.DataFrame(pred_unpacked)

In [66]:
df1.head()

Unnamed: 0,label,score
0,toxic,0.002888
1,toxic,0.002521
2,toxic,0.004538
3,toxic,0.002215
4,toxic,0.003816


Now we have predictions about likelihood of toxity for all comments in dataset.
We will consider comments with a probability value higher than 0.5 to be toxic

In [67]:
def score(df1):
  if df1['score']>0.5:
    return 1
  else:
    return 0



df1['toxic'] = df1.apply(score, axis=1)

Now we determine the value of the F1 metric for model predictions

In [68]:
pred1=df1['toxic']

In [69]:
f1_score(y, pred1)

0.9090797658811632

F1-score is 90.9% - very good performance of the model, we can recommend using this model in pratice

# Conclusions

According to the goal of the project - to get the model to identify toxic comments on the store's website, the necessary libraries and dataset were downloaded, and an initial assessment of the data was carried out.

We used the language model pretrained on toxic comments to have as good result as possible.

 Predictions of model were considered as toxic by score level more then 0.5; the F1 metric was 90.9%.

Thus, pretrained language models show very good performance and can be used to select toxic comments.