The goal of this analysis is to experiment with the aplication of NLP algorithms. The data set used in this project is evaluated to find trolls on twitter.

Let us start by uploading some of the libraries which will prove useful in this analysis.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing

As well as the necessary ML libraries to setup and execute the NLP.

In [2]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import nltk
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

I will now read the dataset. It is worth noting that the data set is a json file and it will be interesting to see the structure of the dataset. This may require some further manipulation later on in the project. 

In [3]:
ds = pd.read_json('Dataset for Detection of Cyber-Trolls.json', lines= True)
ds.head()

Unnamed: 0,annotation,content,extras
0,"{'notes': '', 'label': ['1']}",Get fucking real dude.,
1,"{'notes': '', 'label': ['1']}",She is as dirty as they come and that crook ...,
2,"{'notes': '', 'label': ['1']}",why did you fuck it up. I could do it all day...,
3,"{'notes': '', 'label': ['1']}",Dude they dont finish enclosing the fucking s...,
4,"{'notes': '', 'label': ['1']}",WTF are you talking about Men? No men thats n...,


The annotation column provides useful details of the nature of the tweets in the content column. 0 being "not negative" and 1 being "negative" in terms of the sentiment behind the tweet. 

What is required now is to develop a 'Bag of words' for the NLP. This is essentially a way of structuring the content to make for efficient analysis of the data. This is done by constructing a corpus.

In [5]:
corpus = []

for i in range (0, len(ds)):                                #Iterating over each review
    review = re.sub('[^a-zA-Z]',' ',ds['content'][i])       #Removing annotations
    review = review.lower()                                 #Converting everything to lower case
    review = review.split()                                 #Splitting each word in a review into a separate list
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)                               #Joining all the words into a single list
    corpus.append(review)                                   #Forming our Corpus

We will not view the corpus for the sake of simplicity of this document, as it will essentially just be a massive amount of text.

In [6]:
cv = CountVectorizer(max_features = len(ds))
X = cv.fit_transform(corpus).toarray()

In [7]:
y = []
for i in range(0,len(ds)):
    y.append(ds.annotation[i]['label']) 

It's now time to spilt the training and the test set.

In [8]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [9]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


GaussianNB(priors=None, var_smoothing=1e-09)

In [10]:
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

In [11]:
cm

array([[1112, 1934],
       [  37, 1918]])

The model is very good at deducing when a statement is 'negative', but has trouble in making an accurate prediction of whether the statement is 'not negative'. 

In [12]:
total=sum(sum(cm))

sensitivity = cm[0,0]/(cm[0,0]+cm[1,0])
print('Sensitivity : ', sensitivity )

specificity = cm[1,1]/(cm[1,1]+cm[0,1])
print('Specificity : ', specificity)

Sensitivity :  0.9677980852915579
Specificity :  0.4979231568016615


For greater detail, it would be ideal to see a more detailed report of the performance of the model

In [13]:
from sklearn.metrics import classification_report, accuracy_score
print(classification_report(y_test,y_pred))   #Results
print(accuracy_score(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.97      0.37      0.53      3046
           1       0.50      0.98      0.66      1955

   micro avg       0.61      0.61      0.61      5001
   macro avg       0.73      0.67      0.60      5001
weighted avg       0.78      0.61      0.58      5001

0.6058788242351529


The overall performance with regards to accuracy of 0.6059 is not a bad level performance but for obvious reasons it would be better to improve this performance, especially to improve the models ability to make accurate prediction of whether a statemet is 'not negative'. 

The applications of deep learning to optimise the detail of whether a tweet is 'negative' or 'not negative' would probably result in much greater results. Our model, as is, does not pick up on instances of sarcasm and is merely trained on the bag of words. 

My next challenge would be to analyse the problem with more powerful applications of deep learning, such as ANN, but for now I will try and apply a different approach using Naive Bayes. This will involve fitting the model to Multinomial Naive Bayes.

In [14]:
from sklearn.naive_bayes import MultinomialNB

In [15]:
classifier = MultinomialNB()   #Using Naive Byes algorithm(A common method in NLP)
classifier.fit(X_train,y_train)      #training the model
y_pred = classifier.predict(X_test)  #Predicting our test label

  y = column_or_1d(y, warn=True)


In [16]:
print(classification_report(y_test,y_pred))   #Results
print(accuracy_score(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.81      0.80      0.80      3046
           1       0.69      0.70      0.70      1955

   micro avg       0.76      0.76      0.76      5001
   macro avg       0.75      0.75      0.75      5001
weighted avg       0.76      0.76      0.76      5001

0.7618476304739052


In [17]:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

In [18]:
cm

array([[2442,  604],
       [ 587, 1368]])

It is clear when changing the model from Gaussian Naive Bayes to Multinomial Naive Bayes, performance increases significantly. It will be interesting to see the comparison of the accuracy when comparing Multinomial Naive Bayes to deep learning applications. 