<a href="https://colab.research.google.com/github/VivekSaini11/Natural-Language-Processing-project/blob/master/Help_Twitter_Combat_Hate_Speech_Using_NLP_and_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**DESCRIPTION**

Using NLP and ML, make a model to identify hate speech (racist or sexist tweets) in Twitter.

**Problem Statement**:  

Twitter is the biggest platform where anybody and everybody can have their views heard. Some of these voices spread hate and negativity. Twitter is wary of its platform being used as a medium  to spread hate. 

You are a data scientist at Twitter, and you will help Twitter in identifying the tweets with hate speech and removing them from the platform. You will use NLP techniques, perform specific cleanup for tweets data, and make a robust model.

**Domain: Social Media**

Analysis to be done: Clean up tweets and build a classification model by using NLP techniques, cleanup specific for tweets data, regularization and hyperparameter tuning using stratified k-fold and cross validation to get the best model.

In [127]:
import pandas as pd	
from collections import Counter 
import re
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
from nltk import word_tokenize
from nltk import pos_tag
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV 
from sklearn.metrics import f1_score
from sklearn import metrics

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [24]:
# Load the tweets file using read_csv function from Pandas package. 
data = pd.read_csv('TwitterHate.csv')
data

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation
...,...,...,...
31957,31958,0,ate @user isz that youuu?ðððððð...
31958,31959,0,to see nina turner on the airwaves trying to...
31959,31960,0,listening to sad songs on a monday morning otw...
31960,31961,1,"@user #sikh #temple vandalised in in #calgary,..."


In [25]:
# Get the tweets into a list for easy text cleanup and manipulation.
tweet = data['tweet'].tolist()
tweet[:1]

[' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run']

In [26]:
# Normalize the casing.  
tokenized_sents = [word_tokenize(i) for i in tweet]
for word in tokenized_sents[0]:
   print(word.lower())

@
user
when
a
father
is
dysfunctional
and
is
so
selfish
he
drags
his
kids
into
his
dysfunction
.
#
run


In [27]:
# Using regular expressions, remove user handles. These begin with '@’.
tokenized_sents = [word_tokenize(i) for i in tweet]
for word in tokenized_sents[0]:
    if word.startswith('@') is False:
      print(' '.join([word]));

user
when
a
father
is
dysfunctional
and
is
so
selfish
he
drags
his
kids
into
his
dysfunction
.
#
run


In [32]:
# Using regular expressions, remove URLs.
tokenized_sents = [word_tokenize(i) for i in tweet]
for word in tokenized_sents[0]:
  if word.startswith('@') is False:
    print(re.sub(r"http\S+", "", word));

user
when
a
father
is
dysfunctional
and
is
so
selfish
he
drags
his
kids
into
his
dysfunction
.
#
run


In [34]:
# Using TweetTokenizer from NLTK, tokenize the tweets into individual terms.
from nltk.tokenize import TweetTokenizer 
tk = TweetTokenizer() 
geek = tk.tokenize(data['tweet'].tolist()[0])  
geek

['@user',
 'when',
 'a',
 'father',
 'is',
 'dysfunctional',
 'and',
 'is',
 'so',
 'selfish',
 'he',
 'drags',
 'his',
 'kids',
 'into',
 'his',
 'dysfunction',
 '.',
 '#run']

In [35]:
# Remove stop words.
tokens_without = [word for word in tokenized_sents[0] if not word in stopwords.words()]
tokens_without

['@',
 'user',
 'father',
 'dysfunctional',
 'selfish',
 'drags',
 'kids',
 'dysfunction',
 '.',
 '#',
 'run']

In [42]:
# Remove redundant terms like ‘amp’, ‘rt’, etc.
tokenized_sents = [word_tokenize(i) for i in tweet]
for word in tokenized_sents[0]:
  if word.startswith('amp') is False | word.startswith('rt') is False:
    print(' '.join([word]));

@
user
when
a
father
is
dysfunctional
and
is
so
selfish
he
drags
his
kids
into
his
dysfunction
.
#
run


In [43]:
# Remove ‘#’ symbols from the tweet while retaining the term.
tokenized_sents = [word_tokenize(i) for i in tweet]
for word in tokenized_sents[0]:
  if word.startswith('@') is False:
    print(word.replace("#"," "));

user
when
a
father
is
dysfunctional
and
is
so
selfish
he
drags
his
kids
into
his
dysfunction
.
 
run


In [47]:
# Extra cleanup by removing terms with a length of 1.
tokenized_sents = [word_tokenize(i) for i in tweet]
for word in tokenized_sents[0]:
    if word.count != 0:
      print(' '.join([word]));

@
user
when
a
father
is
dysfunctional
and
is
so
selfish
he
drags
his
kids
into
his
dysfunction
.
#
run


In [52]:
# First, get all the tokenized terms into one large list
tokenized_sents = [word_tokenize(i) for i in tweet]
tokenized_sents[0]

['@',
 'user',
 'when',
 'a',
 'father',
 'is',
 'dysfunctional',
 'and',
 'is',
 'so',
 'selfish',
 'he',
 'drags',
 'his',
 'kids',
 'into',
 'his',
 'dysfunction',
 '.',
 '#',
 'run']

In [69]:
# Use the counter and find the 10 most common terms.
tokenized_sents = [word_tokenize(i) for i in tweet]
for word in tokenized_sents:
  word_count  = Counter(word) 
  most_occur = word_count .most_common(10)  
  most_occur
most_occur  

[('you', 2), ('thank', 1), ('@', 1), ('user', 1), ('for', 1), ('follow', 1)]

In [111]:
# Perform train_test_split using sklearn.
X_train, X_test, y_train, y_test = train_test_split(data["tweet"], data["label"], test_size = 0.2, random_state = 0)

In [88]:
# Instantiate with a maximum of 5000 terms in your vocabulary.
tfidf = TfidfVectorizer(analyzer='char',ngram_range=(2,3),token_pattern=r'\w{1,}',max_features=5000)
train_tfidf = tfidf.fit_transform(X_train)
test_tfidf = tfidf.transform(X_train)
test_tfidf = tfidf.transform(X_test)
test_tfidf

<6393x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 813994 stored elements in Compressed Sparse Row format>

In [98]:
# Model building: Ordinary Logistic Regression
clf = LogisticRegression(C=1.0)
clf.fit(train_tfidf, y_train)
pred = clf.predict(test_tfidf)
pred

array([0, 0, 0, ..., 0, 0, 0])

In [102]:
# Model evaluation: Accuracy, recall, and f_1 score.
scores = cross_val_score(clf, train_tfidf, y_train, cv=5, scoring="f1")
scores

array([0.52427184, 0.53510436, 0.52895753, 0.52071006, 0.54166667])

In [113]:
# Import GridSearch and StratifiedKFold because of class imbalance.
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
for train_ix, test_ix in kfold.split(data["tweet"],data["label"]):
	train_X, test_X = data["tweet"][train_ix], data["tweet"][test_ix]
	train_y, test_y = data["label"][train_ix], data["label"][test_ix]
	train_0, train_1 = len(train_y[train_y==0]), len(train_y[train_y==1])
	test_0, test_1 = len(test_y[test_y==0]), len(test_y[test_y==1])
	print('>Train: 0=%d, 1=%d, Test: 0=%d, 1=%d' % (train_0, train_1, test_0, test_1))

>Train: 0=23776, 1=1793, Test: 0=5944, 1=449
>Train: 0=23776, 1=1793, Test: 0=5944, 1=449
>Train: 0=23776, 1=1794, Test: 0=5944, 1=448
>Train: 0=23776, 1=1794, Test: 0=5944, 1=448
>Train: 0=23776, 1=1794, Test: 0=5944, 1=448


In [114]:
# split into train/test sets with same class ratio
trainX, testX, trainy, testy = train_test_split(data["tweet"], data["label"], test_size = 0.2, random_state = 0)
train_0, train_1 = len(trainy[trainy==0]), len(trainy[trainy==1])
test_0, test_1 = len(testy[testy==0]), len(testy[testy==1])
print('>Train: 0=%d, 1=%d, Test: 0=%d, 1=%d' % (train_0, train_1, test_0, test_1))

>Train: 0=23735, 1=1834, Test: 0=5985, 1=408


In [121]:
# Regularization and Hyperparameter tuning
balance = [{0:100,1:1}, {0:10,1:1}, {0:1,1:1}, {0:1,1:10}, {0:1,1:100}]
param_grid = dict(class_weight=balance)
grid = GridSearchCV(estimator=clf, param_grid=param_grid, n_jobs=-1, cv=kfold, scoring='roc_auc')
grid_result = grid.fit(train_tfidf, y_train)
# report the best configuration
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
# report all configurations
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.956948 using {'class_weight': {0: 1, 1: 10}}
0.945012 (0.005574) with: {'class_weight': {0: 100, 1: 1}}
0.947798 (0.005304) with: {'class_weight': {0: 10, 1: 1}}
0.948890 (0.005000) with: {'class_weight': {0: 1, 1: 1}}
0.956948 (0.003562) with: {'class_weight': {0: 1, 1: 10}}
0.956823 (0.003340) with: {'class_weight': {0: 1, 1: 100}}


In [123]:
# What are the best parameters?
grid_result.best_estimator_

LogisticRegression(C=1.0, class_weight={0: 1, 1: 10}, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [125]:
# Predict and evaluate using the best estimator.
prediction = grid_result.best_estimator_.predict(test_tfidf)
prediction

array([0, 0, 0, ..., 0, 1, 0])

In [124]:
# What is the recall on the test set for the toxic comments?
scores = cross_val_score(grid_result.best_estimator_, train_tfidf, y_train, cv=5, scoring="f1")
scores

array([0.63522013, 0.65274725, 0.63921993, 0.62582057, 0.61636557])

In [126]:
# What is the f_1 score?
score = f1_score(y_test, prediction, average='binary')
print('F-Measure: %.3f' % score)

F-Measure: 0.623


In [129]:
# Choose ‘recall’ as the metric for scoring.
print(metrics.f1_score(y_test, prediction))

0.6227106227106227


In [131]:
# Evaluate the predictions on the train set: accuracy, recall, and f_1 score.
accuracy = metrics.accuracy_score(y_test, prediction)
print('Accuracy: %f' % accuracy)
precision = metrics.precision_score(y_test, prediction)
print('Precision: %f' % precision)
recall = metrics.recall_score(y_test, prediction)
print('Recall: %f' % recall)

Accuracy: 0.935555
Precision: 0.497076
Recall: 0.833333
