# NLP Toxic Comment Analysis

## Project Description — Introduction:
<br>

### The Dataset:

The dataset was originally featured in the Jigsaw Toxic Comment Classification Challenge. The comments represent moderated Wikipedia comments are property of Wikipedia and/or Jigsaw AI. The current project uses a simplified version of the dataset, with only 1 column representing toxicity (rather than distinguishing between obscene, threatening, severely toxic comments, etc., as in the original dataset).
<br>
### Technical Task:

The task consists of teaching a model to classify Wikipedia comments as either toxic or non-toxic in order to automatise their moderation on the website.
<br>
### Data Description:

Our data is labelled and contained in the `toxic_comments.csv` file. There are two columns (features) in the dataset — the comments themselves, represented as a `string` (`object`), and their toxicity rating — `0` (`False`) or `1` (`True`)
<br>
### Project Goal:

Apply NLP techniques in order to pre-process data, teach classification models, and accurately classify comments based on the F1–score.
<br>
### Specific Objectives, Metrics, and Implementation Plan:

**Implementation Plan:**

1. EDA — downloading, cleaning, and preparing data.
2. Teaching and testing ML models.
3. Draw conclusions.

**Metrics:**
We would like to achieve model results with the F1–score ≥ 0.75, according to our project specifications.

**Models:**
We will be applying classification models: logistic regression, random forest, and the LightGBM classifier.

In addition, our course contained specification and practice use cases of the BERT NLP technique / model, but we will use simpler models in this project given their sufficient performance and our limited computational resources.

<br>

### WARNING :

*THE NOTEBOOK MAY CONTAIN OBSCENE AND INAPPROPRIATE CONTENT BECAUSE OF THE NATURE OF THE DATASET BEING USED. READER DISCRETION IS ADVISED.*

## EDA and Pre-processing

Note that some packages need to be additionally installed in order to run the Jupyter Notebook code. The Mac M1 chips require a special command to enable LightGBM installation (can also be run in the Terminal):

```
conda install \
   --yes \
   -c conda-forge \
   'lightgbm>=3.3.3'
```

In [1]:
# Installing libraries, if necessary:
!pip install scikit-learn
!pip install pymystem3
!pip install nltk

# Importing libraries and frameworks:
import pandas as pd
import numpy as np
import time
import seaborn as sns

from sklearn.metrics import f1_score
from sklearn.metrics import make_scorer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
import lightgbm as lgb

from pymystem3 import Mystem
import re
import nltk

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
import warnings
warnings.filterwarnings('ignore')



In [2]:
df = pd.read_csv('toxic_comments.csv', index_col = 'Unnamed: 0')
display(df.head(), df.info(), df['toxic'].value_counts())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 159292 entries, 0 to 159450
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 3.6+ MB


Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


None

0    143106
1     16186
Name: toxic, dtype: int64

We have a dataset of approximately 160'000 entries, which is imbalanced in the target feature. We will need to bear this in mind when applying our models: we will use the automatic `class_weight = balanced` parameter, although we could also experiment with up- and down-sampling techniques.

The target variable passes our sanity check — it contains values of either `0` or `1`.

Let us check for duplicates and missing values:

In [3]:
display(df.duplicated().value_counts(), df.isna().sum())

False    159292
dtype: int64

text     0
toxic    0
dtype: int64

There are no duplicates or missing values in the columns, and we can begin NLP data pre-processing:
1. We will first lemmatise all the comments — remove extraneous punctuation and spaces, converting all to type `U`;
2. We will divide the data into the training, validation, and testing datasets with a ratio of 70:15:15;
3. We will apply vectorisation using the bag-of-words technique and the `TfidfVectorizer`, being careful not to use the `fit` method on our validation or testing dataset.

In [4]:
# Applying lemmatisation:

m = Mystem()

def lemmatisation(comments):
    comments = comments.lower()
    comments = ' '.join(m.lemmatize(comments))
    comments = re.sub(r'[^aA-zZ]', ' ', comments)
    comments = ' '.join(comments.split())
    return comments

df['text'] = df['text'].apply(lemmatisation)
df.head()

Unnamed: 0,text,toxic
0,explanation why the edits made under my userna...,0
1,d aww he matches this background colour i m se...,0
2,hey man i m really not trying to edit war it s...,0
3,more i can t make any real suggestions on impr...,0
4,you sir are my hero any chance you remember wh...,0


Lemmatisation has been applied correctly.

We can now split our data — because we are dealing with an imbalanced dataset, let us apply the `stratify = target` parameter in order to divide our classes equally between datasets. An example of using the `stratify` parameter is shown [here](https://gist.github.com/SHi-ON/63839f3a3647051a180cb03af0f7d0d9).

In [5]:
# Dividing into subsets:
features = df.drop('toxic', axis = 1)
target = df['toxic']

def splitter(features, target, ratio):
    return train_test_split(features, target, test_size = ratio, stratify=target, random_state = 42)

features_train, preliminary_features_test, target_train, preliminary_target_test = splitter(features, target, 0.3)
features_valid, features_test, target_valid, target_test = splitter(preliminary_features_test,
                                                                              preliminary_target_test, 0.5)
features_train.shape, features_valid.shape, features_test.shape, target_train.shape, target_valid.shape, target_test.shape

((111504, 1), (23894, 1), (23894, 1), (111504,), (23894,), (23894,))

Our data has been split successfully.

Next, let us convert our data to the correct value-type:

In [6]:
features_train = features_train['text'].values.astype('U')
features_valid = features_valid['text'].values.astype('U')
features_test = features_test['text'].values.astype('U')

display(features_train.dtype, features_valid.dtype, features_test.dtype)

dtype('<U5000')

dtype('<U5000')

dtype('<U5382')

Moving on to feature vectorisation and stop-word deletion:

In [7]:
stop_words = set(stopwords.words('english'))
count_tf_idf = TfidfVectorizer(stop_words=list(stop_words))

features_train = count_tf_idf.fit_transform(features_train)

features_valid = count_tf_idf.transform(features_valid)

features_test = count_tf_idf.transform(features_test)

features_train.shape, features_valid.shape, features_test.shape

((111504, 137027), (23894, 137027), (23894, 137027))

**Conclusion:**

In this section, we have lemmatised our comments and divided them into subsets, where we have accounted for class imbalance and stratified our sets. We have converted our data to the correct datasetype and deleted English-language stop-words. We have also used the bag-of-words technique in order to vectorise our comments.

We can now move on to training, tuning, and testing our models.

## ML Models

We are faced with a classification task, and we will be using logistic regression, random forest, and LightGBM frameworks / models in order to solve it.

We will train the models and examine their performance on the validation dataset, train and tune them using the validation dataset if necessary, and finally test them in implementing sentiment analysis on the (entirely unseen) testing dataset.

In [8]:
# Let us create a dataframe for storing our results:

df_results = pd.DataFrame(index = ['f1_score', 'learning_time', 'prediction_time'])
df_results

f1_score
learning_time
prediction_time


Let us now create a function in order to train our models without tuning, using the validation dataset to test the results:

In [9]:
pd.set_option('display.float_format', '{:.5f}'.format)

def modelling(model, features_train, target_train, features_valid, target_valid):
    start_time = time.time()
    model.fit(features_train, target_train)
    learning_time = time.time() - start_time
    predictions = model.predict(features_valid)
    prediction_time = time.time() - learning_time - start_time
    f1 = f1_score(target_valid, predictions)
    return [f1, learning_time, prediction_time]

# We are using practically default parameters in our models:
df_results = df_results.assign(LogisticRegression = modelling(LogisticRegression(random_state = 42, class_weight='balanced'), features_train, target_train, features_valid, target_valid))
df_results = df_results.assign(RandomForestClassifier = modelling(RandomForestClassifier(random_state = 42, class_weight='balanced'), features_train, target_train, features_valid, target_valid))
df_results = df_results.assign(LGBMClassifier = modelling(lgb.LGBMClassifier(random_state = 42, boosting_type='gbdt', is_unbalance=True), features_train, target_train, features_test, target_test))

df_results

Unnamed: 0,LogisticRegression,RandomForestClassifier,LGBMClassifier
f1_score,0.75188,0.63013,0.72598
learning_time,6.82944,472.46407,24.19027
prediction_time,0.00621,3.61217,0.40085


Our simplest model (logistic regression) has already allowed us to achieve the necessary F1–score, and it is also the fastest performing model. We could try to improve on it using a different `solver` parameter. [This](https://stackoverflow.com/questions/38640109/logistic-regression-python-solvers-definitions) post recommends either `liblinear` or `sag` for relatively large datasets — let us compare those:

In [10]:
df_results_lr = pd.DataFrame(index = ['f1_score', 'learning_time', 'prediction_time'])
df_results_lr = df_results_lr.assign(LogisticRegression_default = modelling(LogisticRegression(random_state = 42, class_weight='balanced'), features_train, target_train, features_valid, target_valid))
df_results_lr = df_results_lr.assign(LogisticRegression_liblinear = modelling(LogisticRegression(random_state = 42, class_weight='balanced', solver = 'liblinear'), features_train, target_train, features_valid, target_valid))
df_results_lr = df_results_lr.assign(LogisticRegression_sag = modelling(LogisticRegression(random_state = 42, class_weight='balanced', solver = 'sag'), features_train, target_train, features_valid, target_valid))
df_results_lr

Unnamed: 0,LogisticRegression_default,LogisticRegression_liblinear,LogisticRegression_sag
f1_score,0.75188,0.75147,0.75147
learning_time,6.61875,1.64428,3.49507
prediction_time,0.00717,0.00552,0.00302


Both of the new solvers show better performance on the F1–score, and the much shorter learning time of the `liblinear` solver makes it the preferred option.

Let's add the `liblinear` model to our results table:

In [11]:
df_results = df_results.assign(LogisticRegression_tuned = modelling(LogisticRegression(random_state = 42, class_weight='balanced', solver = 'liblinear'), features_train, target_train, features_valid, target_valid))
df_results

Unnamed: 0,LogisticRegression,RandomForestClassifier,LGBMClassifier,LogisticRegression_tuned
f1_score,0.75188,0.63013,0.72598,0.75147
learning_time,6.82944,472.46407,24.19027,1.54695
prediction_time,0.00621,3.61217,0.40085,0.00682


We could now also attempt to tune our two other models: random forest and the LGBM classifier. Of these, LGBM seems like a more promising option — it already performs better and faster than random forest.

Let us therefore experiment with the LGBM model hyperparameters, using k-fold cross-validation on the training dataset to find the best combination:

In [12]:
folds = KFold(n_splits = 4, shuffle = True, random_state = 42).split(features_train, target_train)

param_grid = {
    'n_estimators': [50, 250, 500],
    'learning_rate' : [0.3, 0.1, 0.01]
    }

lgbm_estimator = lgb.LGBMClassifier(random_state = 42, boosting_type='gbdt', max_depth = -1, is_unbalance = True)

gsearch = GridSearchCV(estimator = lgbm_estimator, param_grid = param_grid, cv = folds, scoring = 'f1')
lgb_model = gsearch.fit(X = features_train, y = target_train)

display(lgb_model.best_params_, lgb_model.best_score_)

{'learning_rate': 0.3, 'n_estimators': 500}

0.7644255822406555

Let's now test these best hyperparameters on the validation dataset and add the result to our results table:

In [13]:
df_results = df_results.assign(LGBMClassifier_tuned = modelling(lgb.LGBMClassifier(random_state = 42, boosting_type='gbdt', is_unbalance=True, max_depth = -1, n_estimators = 500, learning_rate = 0.3), features_train, target_train, features_valid, target_valid))

# Let us highlight the best results:
cm = sns.cubehelix_palette(as_cmap = True)
df_results.T.apply(pd.to_numeric).style.background_gradient(cmap=cm)

Unnamed: 0,f1_score,learning_time,prediction_time
LogisticRegression,0.751877,6.829438,0.006215
RandomForestClassifier,0.63013,472.46407,3.612165
LGBMClassifier,0.725976,24.190273,0.400845
LogisticRegression_tuned,0.751466,1.546955,0.006816
LGBMClassifier_tuned,0.766565,67.843642,1.942461


The tuned LGBM classifier is more than twice as slow as the out-of-the-box model, but it shows the best results so far. Given its slowness, let us forgo random forest tuning and move on to testing our 2 final models, the tuned logistic regression and the LGBM classifier, on our testing dataset.

## Testing

Let us now test our best tuned models on the testing subset and display the results in a new dataframe:

In [14]:
test_results = pd.DataFrame(index = ['f1_score', 'learning_time', 'prediction_time'])
test_results = test_results.assign(LogisticRegression_final = modelling(LogisticRegression(random_state = 42, class_weight='balanced', solver = 'liblinear'), features_train, target_train, features_test, target_test))
test_results = test_results.assign(LGBMClassifier_final = modelling(lgb.LGBMClassifier(random_state = 42, boosting_type='gbdt', is_unbalance=True, max_depth = -1, n_estimators = 500, learning_rate = 0.3), features_train, target_train, features_test, target_test))

test_results.T.apply(pd.to_numeric).style.background_gradient(cmap=cm)

Unnamed: 0,f1_score,learning_time,prediction_time
LogisticRegression_final,0.751693,1.718711,0.016241
LGBMClassifier_final,0.762895,68.963549,1.918973


We can see that our best F1–score on the test dataset was attained with the LGBM classifier, although its performance takes longer than logistic regression. Both logistic regression and our LGBM classifier achieved the necessary metric of ≥ 0.75 on the F1–score.

## Outcome Discussion

Our highest F1–score was achieved using the LGBM classifier with the hyperparameters:
- `boosting_type = 'gbdt'`;
- `is_unbalance = True`;
- `max_depth = -1`;
- `n_estimators = 500`;
- `learning_rate = 0.3`;
- `random_state = 42`.

Our final result was F1–score = 0.763, which is above the necessary benchmark, showing that our model is able to successfully predict (classify) toxic comments.

Our other models — logistic regression and random forest showed worse results, with random forest significantly underperforming in both F1–score and time spent on training. We would thus recommend using either the LGBM classifier or logistic regression, which works faster and whose results are comparative with the LGBM classifier.

Regarding model improvements, we could recommend further hyperparameter tuning, using a wider range of options or more folds with cross-validation. We could also test out hyperparameter tuning with random forest and analyse its underperformance. In addition, other models / frameworks like CatBoost may be attempted.

In terms of overall data analysis and preparation, we could test out different ways of lemmatising our data (e.g., substituting spaces with empty symbols), and we could also either down- or up-sample our data instead of using hyperparameters to off-set class-imbalance.

Overall, achieving F1–score ≥ 0.75 on our testing dataset constitutes a successful task completion.