# Project description

The online store is launching a new service. Now users can edit product descriptions, just like in wiki communities. Particularly, customers suggest their edits and comment on others' changes. The store needs a tool that will detect toxic comments and send them for moderation.

Train a model to classify comments as toxic and non-toxic. You have a dataset with toxicity annotations for the edits. Build two models with different architecture.

Build a model with an F1-score of at least 0.75.

In [1]:
import pandas as pd
import numpy as np
import nltk
import spacy
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from tqdm import tqdm
import re

import warnings
warnings.filterwarnings('ignore')

## Data preparation

In [2]:
toxic_comments = pd.read_csv('/datasets/toxic_comments.csv', index_col=0)

In [3]:
toxic_comments.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [4]:
toxic_comments['toxic'].value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

In [5]:
#delete redundant symbols
pattern = r"[^a-zA-Z\s]"

toxic_comments['text_cleaned'] = toxic_comments['text'].apply(lambda x: re.sub(pattern, '', x).replace('\n', ' ').strip(' '))

In [6]:
#all tokens are now lowercase
toxic_comments['text_cleaned'] = toxic_comments['text_cleaned'].str.lower()

### Text lemmatization

In [7]:
filtered = []

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

for i in tqdm(toxic_comments['text_cleaned']):
    doc = nlp(i)
    filtered.append(" ".join([token.lemma_ for token in doc]))

100%|██████████| 159292/159292 [15:52<00:00, 167.24it/s]


In [8]:
toxic_comments['text_filtered'] = filtered
toxic_comments.head()

Unnamed: 0,text,toxic,text_cleaned,text_filtered
0,Explanation\nWhy the edits made under my usern...,0,explanation why the edits made under my userna...,explanation why the edit make under my usernam...
1,D'aww! He matches this background colour I'm s...,0,daww he matches this background colour im seem...,daww he match this background colour I m seemi...
2,"Hey man, I'm really not trying to edit war. It...",0,hey man im really not trying to edit war its j...,hey man I m really not try to edit war its jus...
3,"""\nMore\nI can't make any real suggestions on ...",0,more i cant make any real suggestions on impro...,more I can not make any real suggestion on imp...
4,"You, sir, are my hero. Any chance you remember...",0,you sir are my hero any chance you remember wh...,you sir be my hero any chance you remember wha...


Typos and proper names are inevitable in the comments. I suggest removing such words by calculating their frequency and adding it as the min_df argument in the TF-IDF. I removed all words that occur less than 5 times (assuming that each word appears on average once per comment), as this leads to the number of unique words in all comments being less than 30,000 but more than 20,000 — which, as far as I know, is the active vocabulary of the average English-speaking person.

### Stopwords

In [9]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [10]:
stopwords = list(stopwords)

### Train-test split

In [11]:
train, test = train_test_split(toxic_comments, test_size=.35, random_state=42, stratify=toxic_comments['toxic'])

In [12]:
train.head()

Unnamed: 0,text,toxic,text_cleaned,text_filtered
70971,Here is a short list of popular flood geology ...,0,here is a short list of popular flood geology ...,here be a short list of popular flood geology ...
73618,"""\n\nI noticed the comment about the IOC; I di...",0,i noticed the comment about the ioc i discount...,I notice the comment about the ioc I discount ...
132475,"""\n\n Super Bowl halftime \n\nBro, I couldn't ...",0,super bowl halftime bro i couldnt help but n...,super bowl halftime bro I could not help bu...
83502,"""\n\nThe word """"airbases"""" appears in the arti...",0,the word airbases appears in the article histo...,the word airbase appear in the article history...
140265,"""\n\n Sich riflemen/Urainian Sich Riflemen/Ukr...",0,sich riflemenurainian sich riflemenukranian so...,sich riflemenurainian sich riflemenukranian so...


### TF-IDF calc

In [13]:
count_tf_idf = TfidfVectorizer(stop_words=stopwords, lowercase=True, min_df=5)

X_train = count_tf_idf.fit_transform(train['text_filtered'])
X_test = count_tf_idf.transform(test['text_filtered'])

y_train = train['toxic']
y_test = test['toxic']

In [14]:
X_train.shape

(103539, 22049)

## Model training

### LogReg (w parameter search)

In [15]:
param_grid = {
    'C': [0.1, 1],
}


grid_search_lr = GridSearchCV(LogisticRegression(), param_grid, cv=3, scoring='f1', verbose=2)


grid_search_lr.fit(X_train, y_train)


print("Best cross-validation score: {:.2f}".format(grid_search_lr.best_score_))

Fitting 3 folds for each of 2 candidates, totalling 6 fits
[CV] END ..............................................C=0.1; total time=  10.1s
[CV] END ..............................................C=0.1; total time=   7.7s
[CV] END ..............................................C=0.1; total time=   9.0s
[CV] END ................................................C=1; total time=  20.2s
[CV] END ................................................C=1; total time=  18.7s
[CV] END ................................................C=1; total time=  19.5s
Best cross-validation score: 0.71


### Random forest (w parameter search)

In [16]:
model = RandomForestClassifier(random_state=42)

param_grid = {
    'n_estimators': [50, 450],
    'bootstrap': [True, False]
}


grid_search = GridSearchCV(model, param_grid, cv=3, scoring='f1', verbose=2)


grid_search.fit(X_train, y_train)


print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV] END ....................bootstrap=True, n_estimators=50; total time=  53.4s
[CV] END ....................bootstrap=True, n_estimators=50; total time=  52.7s
[CV] END ....................bootstrap=True, n_estimators=50; total time=  54.3s
[CV] END ...................bootstrap=True, n_estimators=450; total time= 8.1min
[CV] END ...................bootstrap=True, n_estimators=450; total time= 8.1min
[CV] END ...................bootstrap=True, n_estimators=450; total time= 7.9min
[CV] END ...................bootstrap=False, n_estimators=50; total time= 1.2min
[CV] END ...................bootstrap=False, n_estimators=50; total time= 1.1min
[CV] END ...................bootstrap=False, n_estimators=50; total time= 1.2min
[CV] END ..................bootstrap=False, n_estimators=450; total time=10.7min
[CV] END ..................bootstrap=False, n_estimators=450; total time=10.5min
[CV] END ..................bootstrap=False, n_est

### Prediction on the test set

In [17]:
y_pred = grid_search.best_estimator_.predict(X_test)
f1_score(y_test, y_pred)

0.7501520989657271

## Conclusion

The data provided for the task was preprocessed to minimize the amount of information without losing meaning, as it would have been impossible to train any model otherwise.

Logistic regression and random forest models were trained with hyperparameter tuning. The best model (random forest) achieved an F1 score > 0.75 on the test set, meeting the task's requirements.