# Comment classification

**Customer** is an online store that is launching a new service that allows you to edit and complete product descriptions like in wiki communities. The store needs a tool that will search for toxic comments and send them for moderation.

**The main goal of the project** is to create a model that can categorize comments into positive and negative. We have at our disposal a data set with markup on the toxicity of edits.

**Description of the data**

The data is in the file `toxic_comments.csv`. The *text* column in it contains the text of the comment, and *toxic* is the target attribute.

## Data preprocessing

In [19]:
import pandas as pd
import numpy as np
import re

import nltk
from nltk.corpus import stopwords as nltk_stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score

In [22]:
data = pd.read_csv('/content/toxic_comments.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0


In [23]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


Let's remove the column with the duplicated index.

In [3]:
data.drop(columns=['Unnamed: 0'], inplace=True)

Before training the models, the text should be processed: get rid of stop words and punctuation marks, bring all words to lower case and initial form, convert to vectors.

In [4]:
# стоп-слова
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [5]:
# cleaning
def clear_text(text, stopwords):
    words = re.sub(r'[^a-z]', ' ', text.lower()).split()
    words_without_stopwords = [word for word in words if word not in stopwords]

    return ' '.join(words_without_stopwords)

In [6]:
# lemmatization
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):

    tokens = word_tokenize(text)                                   # tokenization
    lemm_list = [lemmatizer.lemmatize(token) for token in tokens]  # lemmatization
    lemm_text = ' '.join(lemm_list)

    return lemm_text

In [8]:
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [9]:
# check
text = data['text'][3]
print(text)
print()
lemmatize_text(clear_text(text, stopwords))

"
More
I can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of ""types of accidents""  -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.

There appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport  "



'make real suggestion improvement wondered section statistic later subsection type accident think reference may need tidying exact format ie date format etc later one else first preference formatting style reference want please let know appears backlog article review guess may delay reviewer turn listed relevant form eg wikipedia good article nomination transport'

In [10]:
%%time

data['lemm_text'] = data['text'].apply(lambda x: lemmatize_text(clear_text(x, stopwords)))
data.head()

CPU times: user 1min 11s, sys: 269 ms, total: 1min 12s
Wall time: 1min 16s


Unnamed: 0,text,toxic,lemm_text
0,Explanation\nWhy the edits made under my usern...,0,explanation edits made username hardcore metal...
1,D'aww! He matches this background colour I'm s...,0,aww match background colour seemingly stuck th...
2,"Hey man, I'm really not trying to edit war. It...",0,hey man really trying edit war guy constantly ...
3,"""\nMore\nI can't make any real suggestions on ...",0,make real suggestion improvement wondered sect...
4,"You, sir, are my hero. Any chance you remember...",0,sir hero chance remember page


Splitting the data:

In [11]:
data_train, data_test = train_test_split(data, test_size=0.25, random_state=123)
data_train.shape, data_test.shape

((119469, 3), (39823, 3))

In [12]:
# train set
corpus_train = data_train['lemm_text'].values             # texts
target_train = data_train['toxic'].values                 # labels

# test set
corpus_test = data_test['lemm_text'].values               # texts
target_test = data_test['toxic'].values                   # labels

Application of vector representation:

In [13]:
count_tf_idf = TfidfVectorizer()

In [14]:
%%time

count_tf_idf.fit(corpus_train)

train_tf_idf = count_tf_idf.transform(corpus_train)
test_tf_idf = count_tf_idf.transform(corpus_test)

CPU times: user 11 s, sys: 141 ms, total: 11.2 s
Wall time: 11.3 s


## Model training

Let's try 2 different algorithms to solve the problem at hand: *Logistic regression* and *Random forest*.

In [15]:
%%time

# Logistic regression
model = LogisticRegression(class_weight='balanced', random_state=123, max_iter=500)
params = {'solver': ['newton-cg', 'liblinear', 'saga']}

grid = GridSearchCV(model, params, cv=5, scoring='f1')
grid.fit(train_tf_idf, target_train)
lr_model = grid.best_estimator_

print(grid.best_params_)
print(grid.best_score_)

{'solver': 'newton-cg'}
0.750300783546443
CPU times: user 3min 48s, sys: 30.1 s, total: 4min 18s
Wall time: 3min 48s


In [18]:
%%time

# Random Forest
model = RandomForestClassifier(class_weight='balanced', random_state=123)
params = {'n_estimators': [200, 300, 400],
          'max_depth': [7, 9, 11]}

grid = GridSearchCV(model, params, cv=3, scoring='f1')
grid.fit(train_tf_idf, target_train)
rf_model = grid.best_estimator_

print(grid.best_params_)
print(grid.best_score_)

{'max_depth': 11, 'n_estimators': 400}
0.3780097561427285
CPU times: user 30min 49s, sys: 1.88 s, total: 30min 51s
Wall time: 30min 59s


Logistic regression proved to be better for the task. Let's check the performance of the model on the test sample:

In [20]:
preds = lr_model.predict(test_tf_idf)
f1_score(target_test, preds)

0.7504366812227075

## Conclusion

This project was intended to create a model that could detect toxic comments with sufficient accuracy. Before training, the text data were prepared: cleaned of stop words and unnecessary characters, all words were lemmatized and converted into vector format (tf-idf).

The model trained by the logistic regression algorithm showed the best result. Both on the test sample and on the training sample we managed to obtain F1-score greater than 0.75.