<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Project-Description" data-toc-modified-id="Project-Description-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Project Description</a></span></li><li><span><a href="#Imports" data-toc-modified-id="Imports-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Dataset-Loading-&amp;-Exploration" data-toc-modified-id="Dataset-Loading-&amp;-Exploration-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Dataset Loading &amp; Exploration</a></span><ul class="toc-item"><li><span><a href="#Conclusions-1.0" data-toc-modified-id="Conclusions-1.0-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Conclusions 1.0</a></span></li></ul></li><li><span><a href="#Data-Preproessing" data-toc-modified-id="Data-Preproessing-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Data Preproessing</a></span><ul class="toc-item"><li><span><a href="#Check-for-Duplicates" data-toc-modified-id="Check-for-Duplicates-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Check for Duplicates</a></span></li><li><span><a href="#text-field" data-toc-modified-id="text-field-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span><code>text</code> field</a></span><ul class="toc-item"><li><span><a href="#Text-Cleaning" data-toc-modified-id="Text-Cleaning-4.2.1"><span class="toc-item-num">4.2.1&nbsp;&nbsp;</span>Text Cleaning</a></span></li><li><span><a href="#Lowercase" data-toc-modified-id="Lowercase-4.2.2"><span class="toc-item-num">4.2.2&nbsp;&nbsp;</span>Lowercase</a></span></li><li><span><a href="#Samples-Formation" data-toc-modified-id="Samples-Formation-4.2.3"><span class="toc-item-num">4.2.3&nbsp;&nbsp;</span>Samples Formation</a></span></li><li><span><a href="#Stemming-&amp;-Format-Setting" data-toc-modified-id="Stemming-&amp;-Format-Setting-4.2.4"><span class="toc-item-num">4.2.4&nbsp;&nbsp;</span>Stemming &amp; Format Setting</a></span></li><li><span><a href="#Stopword-Elimination-&amp;-TF-IDF-Calculation" data-toc-modified-id="Stopword-Elimination-&amp;-TF-IDF-Calculation-4.2.5"><span class="toc-item-num">4.2.5&nbsp;&nbsp;</span>Stopword Elimination &amp; TF-IDF Calculation</a></span></li></ul></li></ul></li><li><span><a href="#Model-Training-|-Cross-validation" data-toc-modified-id="Model-Training-|-Cross-validation-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Model Training | Cross-validation</a></span><ul class="toc-item"><li><span><a href="#DecisionTreeClassifier" data-toc-modified-id="DecisionTreeClassifier-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>DecisionTreeClassifier</a></span></li><li><span><a href="#RandomForestClassifier" data-toc-modified-id="RandomForestClassifier-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>RandomForestClassifier</a></span></li><li><span><a href="#LogisticRegression" data-toc-modified-id="LogisticRegression-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>LogisticRegression</a></span></li></ul></li><li><span><a href="#Models-Analysis" data-toc-modified-id="Models-Analysis-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Models Analysis</a></span><ul class="toc-item"><li><span><a href="#Model-Testing" data-toc-modified-id="Model-Testing-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Model Testing</a></span></li></ul></li><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Conclusions</a></span></li></ul></div>

# Toxic Comments Classification

## Project Description

Online store "Magazinchik Wik" launches a new service. Now users can edit and supplement product descriptions. That is, clients propose their edits and comment on the changes of others. The store needs a tool that will look for toxic comments and submit them for moderation.

Train the model to classify comments into positive and negative. At your disposal is a dataset with markup on the toxicity of edits.

**Build a model with a quality metric *F1* of at least 0.75.**

---

**Instructions for the implementation of the project:**

1. Download and prepare data.
2. Train different models.
3. Draw conclusions.

**Data Description**

The data is in the `toxic_comments.csv` file. The *text* column contains the text of the comment, and *toxic* is the target attribute.

## Imports

In [2]:
import pandas as pd
import numpy as np
from pymystem3 import Mystem
import re
from tqdm import notebook
import nltk
nltk.download('stopwords')
from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Dataset Loading & Exploration

In [3]:
df = pd.read_csv('/datasets/toxic_comments.csv')

display(df.tail(10))
display(df.info())
display(df.describe())

Unnamed: 0.1,Unnamed: 0,text,toxic
159282,159441,"""\nNo he did not, read it again (I would have ...",0
159283,159442,"""\n Auto guides and the motoring press are not...",0
159284,159443,"""\nplease identify what part of BLP applies be...",0
159285,159444,Catalan independentism is the social movement ...,0
159286,159445,The numbers in parentheses are the additional ...,0
159287,159446,""":::::And for the second time of asking, when ...",0
159288,159447,You should be ashamed of yourself \n\nThat is ...,0
159289,159448,"Spitzer \n\nUmm, theres no actual article for ...",0
159290,159449,And it looks like it was actually you who put ...,0
159291,159450,"""\nAnd ... I really don't think you understand...",0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


None

Unnamed: 0.1,Unnamed: 0,toxic
count,159292.0,159292.0
mean,79725.697242,0.101612
std,46028.837471,0.302139
min,0.0,0.0
25%,39872.75,0.0
50%,79721.5,0.0
75%,119573.25,0.0
max,159450.0,1.0


### Conclusions 1.0

- Dataset contains only 2 fields:
    - `text` - client comment
    - `toxic` - a characteristic of a comment that reflects the "toxicity" of a text message
    
    
- Comments language - English


- There are no gaps in the data


- Need to check for duplicates


- It is required to divide the samples into training and test


- The `text` field must be properly processed before training:
    - convert texts to Unicode format (U)
    - remove extra characters
    - convert all words of the message to lowercase
    - exclude stop words from learning
    - perform lemmatization

## Data Preproessing

### Check for Duplicates

In [4]:
print('Число дубликатов:', df.duplicated().sum())

Число дубликатов: 0


###  `text` field

#### Text Cleaning

We will leave in each of the comments only characters containing Latin and spaces

In [5]:
# Function for text cleaning

def clean_text(text):
    
    clean_1 = re.sub(r'[^a-zA-Z ]', ' ', text) 
    clean_2 = ' '.join(clean_1.split())
    
    return clean_2

In [6]:
df['text_clean'] = df['text'].apply(lambda x: clean_text(x))
                                    
df.head(5)

Unnamed: 0,text,toxic,text_clean
0,Explanation\nWhy the edits made under my usern...,0,Explanation Why the edits made under my userna...
1,D'aww! He matches this background colour I'm s...,0,D aww He matches this background colour I m se...
2,"Hey man, I'm really not trying to edit war. It...",0,Hey man I m really not trying to edit war It s...
3,"""\nMore\nI can't make any real suggestions on ...",0,More I can t make any real suggestions on impr...
4,"You, sir, are my hero. Any chance you remember...",0,You sir are my hero Any chance you remember wh...


#### Lowercase

In [7]:
df['text_clean'] = df['text_clean'].apply(lambda x: x.lower())

df.head(5)

Unnamed: 0,text,toxic,text_clean
0,Explanation\nWhy the edits made under my usern...,0,explanation why the edits made under my userna...
1,D'aww! He matches this background colour I'm s...,0,d aww he matches this background colour i m se...
2,"Hey man, I'm really not trying to edit war. It...",0,hey man i m really not trying to edit war it s...
3,"""\nMore\nI can't make any real suggestions on ...",0,more i can t make any real suggestions on impr...
4,"You, sir, are my hero. Any chance you remember...",0,you sir are my hero any chance you remember wh...


#### Samples Formation

In [8]:
features = df['text_clean']
target = df['toxic']

features_train, features_test, target_train, target_test = \
        train_test_split(features, target, test_size=0.1, random_state=12345)

#### Stemming & Format Setting

In [11]:
stemmer = SnowballStemmer('english')

In [12]:
comments_train = features_train.values.astype('U')

for i in notebook.tqdm(range(len(comments_train))):
    words = nltk.word_tokenize(comments_train[i])
    comments_train[i] = ' '.join([stemmer.stem(w) for w in words])

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=143613.0), HTML(value='')))




In [13]:
comments_test = features_test.values.astype('U')

for i in notebook.tqdm(range(len(comments_test))):
    words = nltk.word_tokenize(comments_test[i])
    comments_test[i] = ' '.join([stemmer.stem(w) for w in words])

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=15958.0), HTML(value='')))




#### Stopword Elimination & TF-IDF Calculation

In [14]:
%time

stopwords = set(nltk_stopwords.words('english'))

tf_idf_count = TfidfVectorizer(stop_words=stopwords)
tf_idf_value = tf_idf_count.fit(comments_train)

train_features = tf_idf_value.transform(comments_train)
test_features= tf_idf_value.transform(comments_test)

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 8.58 µs


## Model Training | Cross-validation

### DecisionTreeClassifier

In [15]:
best_model_cross_tree = None
best_depth_cross_tree = 0
best_f1_cross_tree = 0.5

for depth in notebook.tqdm(range(1, 101, 20)):
    
    model_cross_tree = DecisionTreeClassifier(random_state=12345, class_weight = 'balanced',\
                                              max_depth=depth)
    
    mean_f1_tree = cross_val_score(model_cross_tree, train_features, target_train, cv=5, \
                                          scoring='f1').mean()
    
    if mean_f1_tree > best_f1_cross_tree:
        best_model_cross_tree = model_cross_tree
        best_depth_cross_tree = depth
        best_f1_cross_tree = mean_f1_tree
        
print('\033[1m' + 'Model: ' + '\033[0m' + type(best_model_cross_tree).__name__ )
print('\033[1m' + 'F1-score: ' + '\033[0m', best_f1_cross_tree) 
print('max_depth:', best_depth_cross_tree)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=5.0), HTML(value='')))


[1mModel: [0mDecisionTreeClassifier
[1mF1-score: [0m 0.6527422531088689
max_depth: 61


### RandomForestClassifier

In [17]:
best_model_cross_forest = None
best_est_cross_forest = 0
best_depth_cross_forest = 0
best_f1_cross_forest = best_f1_cross_tree

for est in notebook.tqdm(range(1, 11, 1)):
    for depth in range (1, 101, 20):
        
        model_cross_forest = RandomForestClassifier(random_state=12345, class_weight = 'balanced',\
                                              n_estimators=est, max_depth=depth) 
        
        mean_f1_forest = cross_val_score(model_cross_forest, train_features, target_train, cv=5, \
                                          scoring='f1').mean()
        if mean_f1_forest > best_f1_cross_forest:
            best_model_cross_forest = model_cross_forest
            best_est_cross_forest = est
            best_depth_cross_forest = depth
            best_f1_cross_forest = mean_f1_forest
        
print('\033[1m' + 'Model: ' + '\033[0m' + type(best_model_cross_forest).__name__ )
print('\033[1m' + 'F1-score: ' + '\033[0m', best_f1_cross_forest) 
print('n_estimators:', best_est_cross_forest)
print('max_depth:', best_depth_cross_forest)


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=10.0), HTML(value='')))


[1mModel: [0mNoneType
[1mF1-score: [0m 0.6527422531088689
n_estimators: 0
max_depth: 0


### LogisticRegression

In [20]:
best_model_cross_logit = None
best_C_cross_logit = 0
best_f1_cross_logit = best_f1_cross_tree

for c in notebook.tqdm(np.arange(1, 10, 0.5)):
    
    model_cross_logit = LogisticRegression(random_state=12345, class_weight = 'balanced', C=c)
    
    mean_f1_logit = cross_val_score(model_cross_logit, train_features, target_train, cv=5, \
                                          scoring='f1').mean()
    
    if mean_f1_logit > best_f1_cross_logit:
        best_model_cross_logit = model_cross_logit
        best_C_cross_logit = c
        best_f1_cross_logit = mean_f1_logit
        
print('\033[1m' + 'Model: ' + '\033[0m' + type(model_cross_logit).__name__ )
print('\033[1m' + 'F1-score: ' + '\033[0m', best_f1_cross_logit) 
print('C:', best_C_cross_logit)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=18.0), HTML(value='')))




[1mModel: [0mLogisticRegression
[1mF1-score: [0m 0.7651641772954468
C: 6.0


## Models Analysis

In [25]:
models_cross = ['DecisionTreeClassifier', 'RandomForestClassifier', 'LogisticRegression']
f1_results_cross = [best_f1_cross_tree.round(2), best_f1_cross_forest.round(2), \
                    best_f1_cross_logit.round(2)]

pd.DataFrame({'Model for Cross Valid': models_cross, \
              'F1-score': f1_results_cross}).sort_values(by='F1-score', ascending=False)

Unnamed: 0,Model for Cross Valid,F1-score
2,LogisticRegression,0.77
0,DecisionTreeClassifier,0.65
1,RandomForestClassifier,0.65


### Model Testing

Let's take the model with the best performance on the validation set - Logistic Regression with parameter C = 6 and check the model on the test set

In [22]:
%time

model_check = LogisticRegression(random_state=12345, class_weight = 'balanced', C=6)

model_check.fit(train_features, target_train)
predicted_test = model_check.predict(test_features)
f1_test_logit = f1_score(predicted_test, target_test)

print('F1-score на тестовой выборке:', f1_test_logit.round(2))

CPU times: user 5 µs, sys: 1e+03 ns, total: 6 µs
Wall time: 9.06 µs
F1-score на тестовой выборке: 0.76


## Conclusions

* The best result in terms of the metric (F1-score) was shown by the **Logitic Regression** model when setting the internal parameter **`class_weight` = balanced** - class balance


* Both models **DecisionTreeClassifier** and **RandomForestClassifier** did NOT pass the set condition **F1-score >= 0.75**


* Other recommendations:
    1. It is worth trying more complex models based on gradient boosting - perhaps they will be able to show the result better than simple models
    
    2. One could also try other text message lemmatization libraries that work more accurately