# MALIGNANT COMMENTS CLASSIFICATION

#### Problem Statement

The proliferation of social media enables people to express their opinions widely online. However, at the same time, this has resulted in the emergence of conflict and hate, making online environments uninviting for users. Although researchers have found that hate is a problem across multiple platforms, there is a lack of models for online hate detection.
Online hate, described as abusive language, aggression, cyberbullying, hatefulness and many others has been identified as a major threat on online social media platforms. Social media platforms are the most prominent grounds for such toxic behaviour.   
There has been a remarkable increase in the cases of cyberbullying and trolls on various social media platforms. Many celebrities and influences are facing backlashes from people and have to come across hateful and offensive comments. This can take a toll on anyone and affect them mentally leading to depression, mental illness, self-hatred and suicidal thoughts.    
Internet comments are bastions of hatred and vitriol. While online anonymity has provided a new outlet for aggression and hate speech, machine learning can be used to fight it. The problem we sought to solve was the tagging of internet comments that are aggressive towards other users. This means that insults to third parties such as celebrities will be tagged as unoffensive, but “u are an idiot” is clearly offensive.

Our goal is to build a prototype of online hate and abuse comment classifier which can used to classify hate and offensive comments so that it can be controlled and restricted from spreading hatred and cyberbullying. 


# Importing the libraries

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier

from sklearn.model_selection import train_test_split

from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

%matplotlib inline
import warnings
warnings.filterwarnings("ignore")


# Loading the Dataset


In [2]:
# train dataset

df_train = pd.read_csv("D:/Malignant Comments Classifier Project/train.csv")
df_train.head()

Unnamed: 0,id,comment_text,malignant,highly_malignant,rude,threat,abuse,loathe
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [3]:
#test dataset

df_test = pd.read_csv("D:/Malignant Comments Classifier Project/test.csv")
df_test.head()

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.


### Shape of the dataset

#### The dataset appears to have a total of 7752 records and 25 columns including 2 target column

In [4]:
df_train.shape

(159571, 8)

In [11]:
# list of column names

df_train.columns

Index(['id', 'comment_text', 'malignant', 'highly_malignant', 'rude', 'threat',
       'abuse', 'loathe'],
      dtype='object')

# Preporcessing


### We have 24 float64 type features  and  1 object type feature. The target variables "Next_Tmax" and "Next_Tmin"  ate also float64 type

In [5]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 8 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   id                159571 non-null  object
 1   comment_text      159571 non-null  object
 2   malignant         159571 non-null  int64 
 3   highly_malignant  159571 non-null  int64 
 4   rude              159571 non-null  int64 
 5   threat            159571 non-null  int64 
 6   abuse             159571 non-null  int64 
 7   loathe            159571 non-null  int64 
dtypes: int64(6), object(2)
memory usage: 9.7+ MB


In [6]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153164 entries, 0 to 153163
Data columns (total 2 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            153164 non-null  object
 1   comment_text  153164 non-null  object
dtypes: object(2)
memory usage: 2.3+ MB


#### Lets plot a heatmap to identify if any null values

#### We have a lots of missing values in all the features that also includes both the target columns
#### The only features that have no missing values are  "lat",  "lon",  "DEM",  "Slope" and  "Solar radiation"

In [16]:
df_train.isnull().sum()

id                  0
comment_text        0
malignant           0
highly_malignant    0
rude                0
threat              0
abuse               0
loathe              0
dtype: int64

In [18]:
df_test.isnull().sum()

id              0
comment_text    0
dtype: int64

### Drop the ID columns

In [None]:
df_train.drop(columns = ["id"], axis=1, inplace=True)
df_test.drop(columns = ["id"], axis=1, inplace=True)

# Encoding the dataset


### I have used LabelEncoder as we have only 5 classes in outputs.
### The column "comment_text" is converted to it's corresponding numerical value

In [23]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

# Encode the training dataset

df_train.comment_text = encoder.fit_transform(df_train.comment_text)


# Encode the test dataset

df_test.comment_text = encoder.fit_transform(df_test.comment_text)

# Let's now observe the stats of the dataset

### All the values are accounted for and has no missing values

#### Difference between mean and std also seems fine . However we will check for skewness if any in the further steps

In [24]:
df_train.describe()

Unnamed: 0,comment_text,malignant,highly_malignant,rude,threat,abuse,loathe
count,159571.0,159571.0,159571.0,159571.0,159571.0,159571.0,159571.0
mean,79785.0,0.095844,0.009996,0.052948,0.002996,0.049364,0.008805
std,46064.32424,0.294379,0.099477,0.223931,0.05465,0.216627,0.09342
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,39892.5,0.0,0.0,0.0,0.0,0.0,0.0
50%,79785.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,119677.5,0.0,0.0,0.0,0.0,0.0,0.0
max,159570.0,1.0,1.0,1.0,1.0,1.0,1.0


In [25]:
df_test.describe()

Unnamed: 0,comment_text
count,153164.0
mean,76581.5
std,44214.782652
min,0.0
25%,38290.75
50%,76581.5
75%,114872.25
max,153163.0


# Using pandas profiling for a quick analysis

In [63]:
from pandas_profiling import ProfileReport

profile = ProfileReport(df_train, title="My Data Profile Report")

In [64]:
# run this line if you are not able to see the report on Jupyter notebook, download the HTML and view it in separte window

profile.to_file("Malignant_Classifier_report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

# Splitting up of dataset between x (features) and y (target column)

#### Since the target values have 2 features of same data type, col "y" is split in the following manner

In [27]:
# train dataset with featurs only
x = df_train.drop(columns = ["malignant", "highly_malignant", "rude", "threat", "abuse", "loathe"], axis=1)

y = df_train[["malignant", "highly_malignant", "rude", "threat", "abuse", "loathe"]]

# test dataset with featurs only
x1 = df_test

# Lets us now Scale the data for further processing.¶

#### we have used StandardScaler for further scaling up of data 

In [28]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)
x_scaled

array([[-0.15385056],
       [-0.24804523],
       [-0.00414639],
       ...,
       [ 0.92084566],
       [-0.42870621],
       [-1.05676472]])

# split the dataset into train and test data set

#### I have chosed 200 random state and 30% of data is divided in text dataset

In [29]:
x_train, x_test, y_train, y_test = train_test_split(x_scaled, y, test_size=0.30, random_state = 200)

# Create multi output classification models
#### I have considered 4 ML models in this scenario

### 1) KNeighborsClassifier

In [43]:
from sklearn.neighbors import KNeighborsClassifier

k_neigh = KNeighborsClassifier()
k_neigh.fit(x_train,y_train)

y_pred = k_neigh.predict(x_test)

print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.8879094251336899
              precision    recall  f1-score   support

           0       0.56      0.22      0.32      4598
           1       0.41      0.14      0.20       465
           2       0.58      0.20      0.29      2491
           3       0.14      0.01      0.01       144
           4       0.54      0.19      0.28      2305
           5       0.27      0.02      0.03       420

   micro avg       0.55      0.19      0.29     10423
   macro avg       0.42      0.13      0.19     10423
weighted avg       0.53      0.19      0.28     10423
 samples avg       0.02      0.02      0.02     10423



### 2) RandomForestClassifier

In [45]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(x_train,y_train)

y_pred = rf.predict(x_test)

print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.8388201871657754
              precision    recall  f1-score   support

           0       0.33      0.32      0.32      4598
           1       0.20      0.21      0.21       465
           2       0.28      0.29      0.28      2491
           3       0.11      0.12      0.11       144
           4       0.26      0.27      0.27      2305
           5       0.07      0.07      0.07       420

   micro avg       0.28      0.28      0.28     10423
   macro avg       0.21      0.21      0.21     10423
weighted avg       0.28      0.28      0.28     10423
 samples avg       0.03      0.03      0.02     10423



### 3) DecisionTreeClassifier

In [47]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier()
dt.fit(x_train,y_train)

y_pred = dt.predict(x_test)

print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.8386112967914439
              precision    recall  f1-score   support

           0       0.32      0.32      0.32      4598
           1       0.20      0.21      0.21       465
           2       0.28      0.29      0.28      2491
           3       0.11      0.12      0.11       144
           4       0.26      0.27      0.27      2305
           5       0.07      0.07      0.07       420

   micro avg       0.28      0.28      0.28     10423
   macro avg       0.21      0.21      0.21     10423
weighted avg       0.28      0.28      0.28     10423
 samples avg       0.03      0.03      0.02     10423



### 4) ExtraTreesClassifier

In [32]:
from sklearn.ensemble import ExtraTreesClassifier

ext_reg = ExtraTreesClassifier()
ext_reg.fit(x_train,y_train)

y_pred = ext_reg.predict(x_test)

print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.8412224264705882
              precision    recall  f1-score   support

           0       0.33      0.31      0.32      4598
           1       0.20      0.20      0.20       465
           2       0.28      0.28      0.28      2491
           3       0.08      0.08      0.08       144
           4       0.27      0.27      0.27      2305
           5       0.07      0.06      0.06       420

   micro avg       0.29      0.28      0.28     10423
   macro avg       0.21      0.20      0.20     10423
weighted avg       0.28      0.28      0.28     10423
 samples avg       0.03      0.03      0.02     10423



# Cross validation to check if its overfitting

In [54]:
from sklearn.model_selection import cross_val_score

In [55]:
scr = cross_val_score(k_neigh, x, y, cv=5)
print("Cross Validation score of KNeighborsClassifier model is:", scr.mean())

Cross Validation score of KNeighborsClassifier model is: 0.8875484996195173


In [56]:
scr = cross_val_score(rf, x, y, cv=5)
print("Cross Validation score of RandomForestClassifier model is:", scr.mean())

Cross Validation score of RandomForestClassifier model is: 0.8385232995015166


In [57]:
scr = cross_val_score(dt, x, y, cv=5)
print("Cross Validation score of DecisionTreeClassifier model is:", scr.mean())

Cross Validation score of DecisionTreeClassifier model is: 0.8383227607494531


In [59]:
scr = cross_val_score(ext_reg, x, y, cv=5)
print("Cross Validation score of ExtraTreesClassifier model is:", scr.mean())

Cross Validation score of ExtraTreesClassifier model is: 0.8413935122190315


# Selecting the best ML model for this dataset

### From the above algorithms ExtraTreesClassifier is an appropriate model for this dataset.

### Compared to other algorithms, ExtraTreesClassifier has the least difference between accuracy and cross validation

### These 4 algorithms are known to support multiout classification unlike the other models

| Sr.No | ML Models used | Accuracy Score | Cross Validation Scores | Difference in values |
| --- | --- | --- | --- |---|
| 1 | KNeighborsClassifier | 0.887909425133689 | 0.887548499619517 | 0.000360925514171995 |
| 2 | RandomForestClassifier | 0.838820187165775 | 0.838523299501516 | 0.000296887664258949 |
| 3 | DecisionTreeClassifier | 0.838611296791443 | 0.838322760749453 | 0.000288536041989973 |
| 4 | ExtraTreesClassifier | 0.840136196524064 | 0.841393512219031 | -0.00125731569496701 |


## Use "MultiOutputClassifier"

### This strategy consists of fitting one regressor per target. This is a simple strategy for extending classifiers that do not natively support multi-target classifications.

In [35]:
from sklearn.multioutput import MultiOutputClassifier

In [36]:
multiclassifier = MultiOutputClassifier(ext_reg)

In [37]:
multiclassifier.fit(x_train,y_train)

MultiOutputClassifier(estimator=ExtraTreesClassifier())

In [63]:
multiclassifier.score(x_train,y_train)

1.0

# Hyper Parameter Tuning

### Let us try to tune the proposed model (ExtraTreesClassifier ) to get better accuracy, if possible

##### The "paramaters" have been selected from the skicit library and I have considered 4 paramaters

In [38]:
parameters = {"criterion":["gini", "entropy"],
              "max_features":["auto", "sqrt", "log2"],
              "class_weight":["balanced", "balanced_subsample"],
              "oob_score":[True, False],
              "random_state":[30, 50, 70, 100, 120],
              "n_estimators":[100, 130, 150, 170, 200]
              }

### RandomizedSearchCV is used to tune the parameters by fitting the same to the training dataset

#### We have used this as the dataset was large to process in GridSearchCV and have used upto 10 iterations before getting best params

In [65]:
from sklearn.model_selection import RandomizedSearchCV
RCV = RandomizedSearchCV(ExtraTreesClassifier(), parameters, cv=5, n_iter=10)

In [66]:
RCV.fit(x_train, y_train)

RandomizedSearchCV(cv=5, estimator=ExtraTreesClassifier(),
                   param_distributions={'class_weight': ['balanced',
                                                         'balanced_subsample'],
                                        'criterion': ['gini', 'entropy'],
                                        'max_features': ['auto', 'sqrt',
                                                         'log2'],
                                        'n_estimators': [100, 130, 150, 170,
                                                         200],
                                        'oob_score': [True, False],
                                        'random_state': [30, 50, 70, 100, 120]})

In [67]:
RCV.best_params_

{'random_state': 120,
 'oob_score': False,
 'n_estimators': 130,
 'max_features': 'auto',
 'criterion': 'gini',
 'class_weight': 'balanced_subsample'}

### Rebuild the model using the appropriate params we recieved from best_params_


#### Its observed that there is no improvement in model accuracy and it still stands at 98.32 % better

In [40]:
mod_final = MultiOutputClassifier(ExtraTreesClassifier(random_state = 120, oob_score = False, n_estimators = 130,
                                                       max_features= "auto", criterion= "gini", class_weight = "balanced_subsample"))

mod_final.fit(x_train,y_train)
pred = mod_final.predict(x_test)

In [69]:
mod_final.score(x_train,y_train)

1.0

# Saving the model (using joblib)

In [70]:
import joblib
 
joblib.dump(mod_final,"Malignant_Comments.pkl")

['Malignant_Comments.pkl']

# Loading the saved model

In [72]:
model = joblib.load("Malignant_Comments.pkl")

prediction = model.predict(df_test)

prediction=pd.DataFrame(prediction)
prediction

Unnamed: 0,0,1,2,3,4,5
0,0,0,0,0,0,0
1,0,0,0,0,0,0
2,0,0,0,0,0,0
3,0,0,0,0,0,0
4,0,0,0,0,0,0
...,...,...,...,...,...,...
153159,0,0,0,0,0,0
153160,0,0,0,0,0,0
153161,0,0,0,0,0,0
153162,0,0,0,0,0,0
