Dataset Source: https://zenodo.org/record/2586669

In [113]:
import pandas as pd
from sklearn.model_selection import train_test_split
import string
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
nltk.download('punkt')
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import precision_score, recall_score, precision_recall_curve
from matplotlib import pyplot as plt

from sklearn.metrics import precision_recall_curve
import numpy as np
from sklearn.model_selection import GridSearchCV

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [114]:
ds = pd.read_csv('youtoxic_english.csv', sep=',')

In [115]:
ds.shape

(1000, 15)

In [116]:
ds.head(5)

Unnamed: 0,CommentId,VideoId,Text,IsToxic,IsAbusive,IsThreat,IsProvocative,IsObscene,IsHatespeech,IsRacist,IsNationalist,IsSexist,IsHomophobic,IsReligiousHate,IsRadicalism
0,Ugg2KwwX0V8-aXgCoAEC,04kJtp6pVXI,If only people would just take a step back and...,False,False,False,False,False,False,False,False,False,False,False,False
1,Ugg2s5AzSPioEXgCoAEC,04kJtp6pVXI,Law enforcement is not trained to shoot to app...,True,True,False,False,False,False,False,False,False,False,False,False
2,Ugg3dWTOxryFfHgCoAEC,04kJtp6pVXI,\nDont you reckon them 'black lives matter' ba...,True,True,False,False,True,False,False,False,False,False,False,False
3,Ugg7Gd006w1MPngCoAEC,04kJtp6pVXI,There are a very large number of people who do...,False,False,False,False,False,False,False,False,False,False,False,False
4,Ugg8FfTbbNF8IngCoAEC,04kJtp6pVXI,"The Arab dude is absolutely right, he should h...",False,False,False,False,False,False,False,False,False,False,False,False


For this investigation I will use only 2 columns: Text and isToxic. Let's get rid of the rest of the Data

In [117]:
columns_to_drop = ['CommentId', 'VideoId', 'IsAbusive',	'IsThreat',	'IsProvocative',	'IsObscene',	'IsHatespeech',	'IsRacist',	'IsNationalist',	'IsSexist',	'IsHomophobic',	'IsReligiousHate',	'IsRadicalism']
ds = ds.drop(columns = columns_to_drop, axis=1)

In [118]:
ds.head(5)

Unnamed: 0,Text,IsToxic
0,If only people would just take a step back and...,False
1,Law enforcement is not trained to shoot to app...,True
2,\nDont you reckon them 'black lives matter' ba...,True
3,There are a very large number of people who do...,False
4,"The Arab dude is absolutely right, he should h...",False


In [119]:
ds['IsToxic'].value_counts()

False    538
True     462
Name: IsToxic, dtype: int64

In [120]:
for comment in ds[ds['IsToxic'] == True]['Text'][:5]:
  print(comment)

Law enforcement is not trained to shoot to apprehend. Â They are trained to shoot to kill. Â And I thank Wilson for killing that punk bitch.

Dont you reckon them 'black lives matter' banners being held by white cunts is Â kinda patronizing and ironically racist. could they have not come up with somethin better.. or is it just what white folks do to give them selves pride. 'ooo look at me im being nice for the black people' why does it always have to be about race actually the whole world is pussyfootin around for fear of being racist. its fuckin daft man.
here people his facebook isÂ https://www.facebook.com/bassem.masri.520 he has ties with isis and other terrorist groups he is a muslim extremistÂ 
Check out this you tube post. "Black man goes on an epic rant against Ferguson rioters."

Although his message is delivered with childish, cartoon-ish emotions.... He is one of the very few African American's who gets it.
I would LOVE to see this pussy go to Staten Island and spit on a cop

Data preparation for better representation

In [121]:
ds['Text'] = ds['Text'].str.replace('\n', '')

In [122]:
for comment in ds[ds['IsToxic'] == True]['Text'][:5]:
  print(comment, '\n')

Law enforcement is not trained to shoot to apprehend. Â They are trained to shoot to kill. Â And I thank Wilson for killing that punk bitch. 

Dont you reckon them 'black lives matter' banners being held by white cunts is Â kinda patronizing and ironically racist. could they have not come up with somethin better.. or is it just what white folks do to give them selves pride. 'ooo look at me im being nice for the black people' why does it always have to be about race actually the whole world is pussyfootin around for fear of being racist. its fuckin daft man. 

here people his facebook isÂ https://www.facebook.com/bassem.masri.520 he has ties with isis and other terrorist groups he is a muslim extremistÂ  

Check out this you tube post. "Black man goes on an epic rant against Ferguson rioters."Although his message is delivered with childish, cartoon-ish emotions.... He is one of the very few African American's who gets it. 

I would LOVE to see this pussy go to Staten Island and spit on 

Let's use 800 samples as Train

In [123]:
train_ds, test_ds = train_test_split(ds, test_size=200)

In [124]:
test_ds.shape

(200, 2)

The distribution of toxic and normal comments in train and test is not even

In [125]:
test_ds['IsToxic'].value_counts()

False    102
True      98
Name: IsToxic, dtype: int64

In [126]:
train_ds['IsToxic'].value_counts()

False    436
True     364
Name: IsToxic, dtype: int64

In [132]:
sentence_example = ds.iloc[1]['Text']
tokens = word_tokenize(sentence_example, language='english')
tokens_without_punctuation = [i for i in tokens if i not in string.punctuation]
print(len(tokens_without_punctuation))

25


In [128]:
print(f'Tokens without punctuation: {tokens_without_punctuation}')

Tokens without punctuation: ['Law', 'enforcement', 'is', 'not', 'trained', 'to', 'shoot', 'to', 'apprehend', 'They', 'are', 'trained', 'to', 'shoot', 'to', 'kill', 'And', 'I', 'thank', 'Wilson', 'for', 'killing', 'that', 'punk', 'bitch']


In [133]:
english_stopwords = stopwords.words('english')
tokens_without_punctuation_and_stopwords = [i for i in tokens_without_punctuation if i not in english_stopwords]
print(len(tokens_without_punctuation_and_stopwords))

16


In [134]:
print(f'Tokens without punctuation and stopwords: {tokens_without_punctuation_and_stopwords}')

Tokens without punctuation and stopwords: ['Law', 'enforcement', 'trained', 'shoot', 'apprehend', 'They', 'trained', 'shoot', 'kill', 'And', 'I', 'thank', 'Wilson', 'killing', 'punk', 'bitch']


In [135]:
snowball = SnowballStemmer(language='english')

Delete suffixes with Stemming

In [137]:
stemmed_tokens = [snowball.stem(i) for i in tokens_without_punctuation_and_stopwords]

In [139]:
print(*stemmed_tokens)

law enforc train shoot apprehend they train shoot kill and i thank wilson kill punk bitch


Function to apply tokenization and stemming to samples

In [142]:
snowball = SnowballStemmer(language='english')
english_stopwords = stopwords.words('english')

def tokenize_sentence(sentence: str, remove_stop_words: bool = True):
  tokens = word_tokenize(sentence, language='english')
  tokens = [i for i in tokens if i not in string.punctuation]
  if remove_stop_words:
    tokens = [i for i in tokens if i not in english_stopwords]
  tokens = [snowball.stem(i) for i in tokens]
  return tokens

Let's test the function

In [152]:
sentence_example = ds.iloc[2]['Text']
print(sentence_example)
print(*tokenize_sentence(sentence_example), sep=', ')

Dont you reckon them 'black lives matter' banners being held by white cunts is Â kinda patronizing and ironically racist. could they have not come up with somethin better.. or is it just what white folks do to give them selves pride. 'ooo look at me im being nice for the black people' why does it always have to be about race actually the whole world is pussyfootin around for fear of being racist. its fuckin daft man.
dont, reckon, black, live, matter, banner, held, white, cunt, kinda, patron, iron, racist, could, come, somethin, better, .., white, folk, give, selv, pride, ooo, look, im, nice, black, peopl, alway, race, actual, whole, world, pussyfootin, around, fear, racist, fuckin, daft, man


We will use LogisticRegression, that's why it's necessary to preprocess the Dataset to transform the text into vectors:

In [154]:
vectorizer = TfidfVectorizer(tokenizer=lambda x: tokenize_sentence(x, remove_stop_words = True))

In [157]:
features = vectorizer.fit_transform(train_ds['Text'])

In [160]:
model = LogisticRegression(random_state=0)
model.fit(features, train_ds['IsToxic'])

Let's see how the model works

In [161]:
model.predict(features[0])

array([ True])

The model predicted that the 1st comment is toxic.

In [163]:
train_ds['Text'].iloc[0]

'Dumb ass people run over them all ðŸ˜¡'

The prediction is correct

In [165]:
model_pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer(tokenizer=lambda x: tokenize_sentence(x, remove_stop_words = True))),
    ('model', LogisticRegression(random_state=0))
])

In [166]:
model_pipeline.fit(train_ds['Text'], train_ds['IsToxic'])



Let's test the model

In [168]:
model_pipeline.predict(['He is a bad person. He should kill himself!!!'])

array([ True])

The prediction was correct

In [170]:
model_pipeline.predict(['He is a good person. I wish him all the best!!!'])

array([False])

The prediction was correct

Gauge the model using Precision and Recall

In [172]:
precision_score(y_true=test_ds['IsToxic'], y_pred = model_pipeline.predict(test_ds['Text']))

0.7763157894736842

In [173]:
recall_score(y_true=test_ds['IsToxic'], y_pred = model_pipeline.predict(test_ds['Text']))

0.6020408163265306