# Ex2.1 Categorizing Reviews with an FNN

In [1]:
import time
#Ignore the next statement -- it is just to estimate how long the exercise takes
start = time.perf_counter()

We shall use a neural network to categorize user reviews of articles in Wikipedia. The aim is to identify the reviews which contain personal attacks.

The dataset we will use includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it contains a personal attack.

In [2]:
import pandas as pd
import re
import urllib
import sklearn
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/student/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/student/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Loading the data
There are two files, one with the comments and another with annotations made by reviewers as to whether the comments contain personal attacks.

In [3]:
comments = pd.read_csv('attack_annotated_comments.tsv', sep = '\t', index_col = 0)
annotations = pd.read_csv('attack_annotations.tsv',  sep = '\t')

## Examining the data
First we look at the comments dataframe.

In [4]:
comments.columns

Index(['comment', 'year', 'logged_in', 'ns', 'sample', 'split'], dtype='object')

In [5]:
comments.head()

Unnamed: 0_level_0,comment,year,logged_in,ns,sample,split
rev_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
37675,`-NEWLINE_TOKENThis is not ``creative``. Thos...,2002,False,article,random,train
44816,`NEWLINE_TOKENNEWLINE_TOKEN:: the term ``stand...,2002,False,article,random,train
49851,"NEWLINE_TOKENNEWLINE_TOKENTrue or false, the s...",2002,False,article,random,train
89320,"Next, maybe you could work on being less cond...",2002,True,article,random,dev
93890,This page will need disambiguation.,2002,True,article,random,train


The first column is the review ID. Each user review of an article has a rev_id. The other column of interest is the comment column. It will need a bit of cleaning. The other columns are irrelevant to our purpose.

We now look at the annotations dataframe.

In [6]:
annotations.columns

Index(['rev_id', 'worker_id', 'quoting_attack', 'recipient_attack',
       'third_party_attack', 'other_attack', 'attack'],
      dtype='object')

In [7]:
annotations.head()

Unnamed: 0,rev_id,worker_id,quoting_attack,recipient_attack,third_party_attack,other_attack,attack
0,37675,1362,0.0,0.0,0.0,0.0,0.0
1,37675,2408,0.0,0.0,0.0,0.0,0.0
2,37675,1493,0.0,0.0,0.0,0.0,0.0
3,37675,1439,0.0,0.0,0.0,0.0,0.0
4,37675,170,0.0,0.0,0.0,0.0,0.0


Each comment was given to multiple "workers" and the workers scored it for various types of attacks.  The results are in the second dataset. The rev_id is the link between the two datasets. A `1` means that worker considered the comment to be an attack. The last column, "attack" will be a `1` if any of the other columns for specific types of attack are `1`'s.

Let's find some records where the attack column has a 1.

In [8]:
annotations[annotations["attack"]==1.0]

Unnamed: 0,rev_id,worker_id,quoting_attack,recipient_attack,third_party_attack,other_attack,attack
33,89320,3341,0.0,1.0,0.0,0.0,1.0
35,89320,3338,0.0,1.0,0.0,0.0,1.0
36,89320,2101,0.0,0.0,0.0,1.0,1.0
37,89320,673,0.0,0.0,0.0,1.0,1.0
127,155243,214,0.0,0.0,0.0,1.0,1.0
...,...,...,...,...,...,...,...
1365161,699822249,3301,0.0,1.0,0.0,0.0,1.0
1365173,699848324,3512,0.0,0.0,1.0,0.0,1.0
1365187,699851288,2231,0.0,1.0,0.0,0.0,1.0
1365200,699891012,1553,0.0,1.0,0.0,0.0,1.0


 Consider, for example, a specific comment from the review ith rev_id = 89320.
 
 Several workers, the ones with worker_id 3341, 3338, 2101 and 673 thought it had some kind of personal attack.

In [9]:
comments.loc[89320]["comment"]

' Next, maybe you could work on being less condescending with your suggestions about reading the naming conventions and FDL, both of which I read quite a while ago, thanks. I really liked the bit where you were explaining why you had no interest in fixing things I complained about because you felt insulted, yet you were being extremely insulting at the time. With any luck, you can learn to be less of a jerk. GregLindahlNEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKEN '

We can look at the results from all the workers who scored this comment.

In [10]:
annotations[annotations["rev_id"]==89320]

Unnamed: 0,rev_id,worker_id,quoting_attack,recipient_attack,third_party_attack,other_attack,attack
29,89320,3307,0.0,0.0,0.0,0.0,0.0
30,89320,4000,0.0,0.0,0.0,0.0,0.0
31,89320,3262,0.0,0.0,0.0,0.0,0.0
32,89320,3376,0.0,0.0,0.0,0.0,0.0
33,89320,3341,0.0,1.0,0.0,0.0,1.0
34,89320,3340,0.0,0.0,0.0,0.0,0.0
35,89320,3338,0.0,1.0,0.0,0.0,1.0
36,89320,2101,0.0,0.0,0.0,1.0,1.0
37,89320,673,0.0,0.0,0.0,1.0,1.0


We shall consider a comment to be an attack if its mean score in the attack column is above 0.5. We create a column called label".
We group the annotations by rev_id and if the mean for the attack colummn is above 0.5 the label is true, otherwise it is false. We add that column to the comments dataset.

In [11]:
label = annotations.groupby('rev_id')['attack'].mean() > 0.5

In [12]:
label

rev_id
37675        False
44816        False
49851        False
89320        False
93890        False
             ...  
699848324    False
699851288    False
699857133    False
699891012    False
699897151    False
Name: attack, Length: 115864, dtype: bool

We join comments and labels.

In [13]:
comments = comments.join(label)

In [14]:
comments.head(10)

Unnamed: 0_level_0,comment,year,logged_in,ns,sample,split,attack
rev_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
37675,`-NEWLINE_TOKENThis is not ``creative``. Thos...,2002,False,article,random,train,False
44816,`NEWLINE_TOKENNEWLINE_TOKEN:: the term ``stand...,2002,False,article,random,train,False
49851,"NEWLINE_TOKENNEWLINE_TOKENTrue or false, the s...",2002,False,article,random,train,False
89320,"Next, maybe you could work on being less cond...",2002,True,article,random,dev,False
93890,This page will need disambiguation.,2002,True,article,random,train,False
102817,NEWLINE_TOKEN-NEWLINE_TOKENNEWLINE_TOKENImport...,2002,True,user,random,train,False
103624,I removed the following:NEWLINE_TOKENNEWLINE_T...,2002,True,article,random,train,False
111032,`:If you ever claimed in a Judaic studies prog...,2002,True,article,random,dev,False
120283,NEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKENMy apol...,2002,True,article,random,dev,False
128532,"`Someone wrote:NEWLINE_TOKENMore recognizable,...",2002,True,article,random,train,False


In [15]:
# Skip this cell if you would rather not read rather unpleasant comments!
comments[comments["attack"] == True]

Unnamed: 0_level_0,comment,year,logged_in,ns,sample,split,attack
rev_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
801279,Iraq is not good ===NEWLINE_TOKENNEWLINE_TO...,2003,False,article,random,train,True
2702703,NEWLINE_TOKENNEWLINE_TOKEN____NEWLINE_TOKENfuc...,2004,False,user,random,train,True
4632658,"i have a dick, its bigger than yours! hahaha",2004,False,article,blocked,train,True
6545332,NEWLINE_TOKENNEWLINE_TOKEN== renault ==NEWLINE...,2004,True,user,blocked,train,True
6545351,NEWLINE_TOKENNEWLINE_TOKEN== renault ==NEWLINE...,2004,True,user,blocked,test,True
...,...,...,...,...,...,...,...
699645524,Brandon Semenuk has won the event four times ...,2016,True,user,blocked,train,True
699659494,im soory since when is google images not allow...,2016,True,user,blocked,dev,True
699660419,what ever you fuggin fagNEWLINE_TOKENQuestion ...,2016,True,user,blocked,test,True
699661020,NEWLINE_TOKENNEWLINE_TOKEN== Nice try but no c...,2016,True,user,blocked,train,True


## Preprocessing the data

We remove the "NEWLINE_TOKEN" and "TAB_TOKEN" substrings in the comments as well as the punctuation. We shall lower case the words and remove stop words and numbers.

In [16]:
comments['comment'].head()

rev_id
37675    `-NEWLINE_TOKENThis is not ``creative``.  Thos...
44816    `NEWLINE_TOKENNEWLINE_TOKEN:: the term ``stand...
49851    NEWLINE_TOKENNEWLINE_TOKENTrue or false, the s...
89320     Next, maybe you could work on being less cond...
93890                 This page will need disambiguation. 
Name: comment, dtype: object

In [17]:
comments['comment'] = comments['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", ""))
comments['comment'] = comments['comment'].apply(lambda x: x.replace("TAB_TOKEN", ""))
comments['comment'] = comments['comment'].apply(lambda x: x.lower())
comments['comment'] = comments['comment'].apply(lambda x: re.sub(r'[^\w\s]', '', x))
comments['comment'] = comments['comment'].apply(lambda x: re.sub(r'\d+', '', x))

In [18]:
comments['comment'].head(20)

rev_id
37675     this is not creative  those are the dictionary...
44816      the term standard model is itself less npov t...
49851     true or false the situation as of march  was s...
89320      next maybe you could work on being less conde...
93890                   this page will need disambiguation 
102817    important note for all sysops there is a bug i...
103624    i removed the followingall names of early poli...
111032    if you ever claimed in a judaic studies progra...
120283    my apologies  im english i watch cricket i kno...
128532    someone wrotemore recognizable perhaps is a ty...
133562    correct full biographical details will put dow...
138117    care should be taken to distinguish when and i...
155243    if i may butt in  ive spent the last  hour fol...
177310    on my  you will find the apology that i owe yo...
192579    i fail to see the distinction  who better than...
201190                        gets far more tendentious yet
208009    as a person who has don

In [19]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))

In [20]:
def remove_stop_words(comment):
    word_tokens = word_tokenize(comment)
    filtered_comment = [w for w in word_tokens if not w.lower() in stop_words]
    filtered_comment = ""
    for w in word_tokens:
        if w not in stop_words:
            filtered_comment = filtered_comment + " " + w
    return(filtered_comment)

In [21]:
remove_stop_words("This is a test comment")

' This test comment'

In [22]:
%%time
# This takes about 30 seconds
comments['comment'] = comments['comment'].apply(lambda x: remove_stop_words(x))

CPU times: user 28.8 s, sys: 4.56 ms, total: 28.8 s
Wall time: 28.8 s


In [23]:
comments['comment'].head(20)

rev_id
37675      creative dictionary definitions terms insuran...
44816      term standard model less npov think wed prefe...
49851      true false situation march saudi proposal lan...
89320      next maybe could work less condescending sugg...
93890                              page need disambiguation
102817     important note sysops bug administrative move...
103624     removed followingall names early polish ruler...
111032     ever claimed judaic studies program ultraorth...
120283     apologies im english watch cricket know nothi...
128532     someone wrotemore recognizable perhaps type g...
133562     correct full biographical details put birth d...
138117     care taken distinguish definitions express sp...
155243     may butt ive spent last hour following andre ...
177310           find apology owe shuffles feet looks floor
192579     fail see distinction better legal scholars de...
201190                             gets far tendentious yet
208009     person done activity s

We only need the comment and attack columns.

In [24]:
df = pd.concat([comments["comment"],comments["attack"]], axis=1)

In [25]:
df.head()

Unnamed: 0_level_0,comment,attack
rev_id,Unnamed: 1_level_1,Unnamed: 2_level_1
37675,creative dictionary definitions terms insuran...,False
44816,term standard model less npov think wed prefe...,False
49851,true false situation march saudi proposal lan...,False
89320,next maybe could work less condescending sugg...,False
93890,page need disambiguation,False


In [26]:
df.shape

(115864, 2)

In [27]:
df[df["attack"] == True].shape

(13590, 2)

## Splitting the data into a training set and a test set

In [28]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df['comment'], 
    df['attack'], 
    test_size = 0.2,
    random_state = 1278)

As a sanity check, print out the shapes of the dataframes.

In [29]:
print("Training features and labels")
print("X_train shape: ",X_train.shape)
print("y_train shape: ",y_train.shape)
print()
print("Testing features and labels")      
print("X_test shape: ",X_test.shape)
print("y_test shape: ",y_test.shape)

Training features and labels
X_train shape:  (92691,)
y_train shape:  (92691,)

Testing features and labels
X_test shape:  (23173,)
y_test shape:  (23173,)


Vectorize the features.

The features need to be expressed as vectors. We shall use CountVectorizer which does a word count on each document and creates a vector fo it based on the frequency of words in it. To avoid very  long vectors we shall just use the 5000 most frequent words as features. This is after removing stop words as they are very frequent but carry no information about the document. We also will not use rare words, words that appear in less than 10% of the documents. Further, we will not use words that are too common, ones that are present in more than 90% of the documents.

In [30]:
from sklearn.feature_extraction.text import CountVectorizer

In [31]:
vectorizer = CountVectorizer(binary = True, 
                             stop_words = stopwords.words('english'), 
                             lowercase = True, 
                             min_df = 3, 
                             max_df = 0.9, 
                             max_features = 5000)
X_train_vectorized = vectorizer.fit_transform(X_train)

The array produced by CountVectorizer is a sparse array.

In [32]:
print (X_train_vectorized.toarray().shape)
print(X_train_vectorized.toarray()[5,:])

(92691, 5000)
[0 0 0 ... 0 0 0]


Each one of the 5000 words being used to characterize the comments has an index. If the word is present in the document the value at that index will be a 1, otherwise it will be a 0. There are relatively few 1's so a sparse matrix is an efficient way to store the array.

In [33]:
# These are the first 20 mappings of the form word => index.
print(list(vectorizer.vocabulary_.items())[:20])

[('july', 2433), ('vandalizing', 4722), ('pages', 3120), ('removing', 3679), ('sourced', 4128), ('content', 962), ('cases', 654), ('wholly', 4861), ('deleting', 1200), ('article', 308), ('check', 716), ('recent', 3581), ('history', 2092), ('also', 175), ('reported', 3697), ('one', 3049), ('thought', 4471), ('im', 2179), ('really', 3571), ('sure', 4341)]


Some of the words used:

In [34]:
vectorizer.get_feature_names_out()[1200:1250]
# Be warned as you explore that some words from media will be unpleasant

array(['deleting', 'deletion', 'deletions', 'deliberate', 'deliberately',
       'delivered', 'demand', 'demands', 'democracy', 'democratic',
       'demonstrate', 'demonstrated', 'demonstrates', 'denial', 'denied',
       'dennis', 'deny', 'denying', 'department', 'depending', 'depends',
       'depth', 'derived', 'derogatory', 'descendants', 'descent',
       'describe', 'described', 'describes', 'describing', 'description',
       'descriptions', 'descriptive', 'deserve', 'deserved', 'deserves',
       'design', 'designed', 'desire', 'desired', 'desist', 'desk',
       'desperate', 'despite', 'destroy', 'destroyed', 'destroying',
       'destruction', 'destructive', 'detail'], dtype=object)

## Defining the model (neural network)

In [35]:
from keras.models import Sequential
from keras.layers import Dense

# Sequential is a container for the other components.
# You add layers, in order, to an instance of Sequential.

nn = Sequential()

# The 5000 features plus a bias, which makes 5001 items, will be fed to a dense layer with 500 nodes.
# This layer calculates 5001 * 500 = 2,500,500 weights.

nn.add(Dense(units = 500, activation = 'relu', input_dim = len(vectorizer.get_feature_names())))

# You get an output from each node, 500 outputs in all.
# The 500 outputs of the first hidden layer plus a bias will go to the one node of the 
# second hidden layer. This makes 501 weights to calculate in this layer.

nn.add(Dense(units=1, activation='sigmoid'))
  
# Binary cross entropy is a popular loss function for binary type (yes/no) situations.
nn.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

2023-09-05 02:26:07.462681: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-05 02:26:12.960648: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-05 02:26:12.964239: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.


In [36]:
nn.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 500)               2500500   
                                                                 
 dense_1 (Dense)             (None, 1)                 501       
                                                                 
Total params: 2,501,001
Trainable params: 2,501,001
Non-trainable params: 0
_________________________________________________________________


This very small network has 2.5 million parameters to calculate.
Make sure there are no other kernels running otherwise your kernel is likely to crash for lack of resources.

## Training the model

This will take about 7 min

In [37]:
%%time
# Takes 3 to 7 min depending on resources available.
# The last 2000 rows of the training data are used for validation.
nn.fit(X_train_vectorized[:-2000].toarray(), y_train[:-2000], 
          epochs = 4, batch_size = 128, verbose = 1, 
          validation_data = (X_train_vectorized[-2000:].toarray(), y_train[-2000:]))

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
CPU times: user 6min 44s, sys: 4min 6s, total: 10min 50s
Wall time: 2min 36s


<keras.callbacks.History at 0x7f873c1e0e20>

## Evaluating the model peformance

We prepare vectors for the test data set and use the `evaluate()` method so see how good the model is with unseen data.

In [38]:
scores = nn.evaluate(vectorizer.transform(X_test).toarray(), y_test, verbose = 1)
# scores has several measurements in it. The one in postion 1 is the accuracy.
print("Accuracy:", scores[1])

Accuracy: 0.9379018545150757


We try out the model with our own comment. We need to pre-process the new comment like we did the training comments. This is best put in a function or pipeline.

In [39]:
def prepareData(comment):
    comment.replace("NEWLINE_TOKEN", "")
    comment.replace("TAB_TOKEN", "")
    comment.lower()
    comment = re.sub(r'[^\w\s]', '', comment)
    comment = re.sub(r'\d+', '', comment)
    return(comment)

In [40]:
def comment_analysis(raw_comment):
    prepared_comment = pd.array([prepareData(raw_comment)])
    vectorized_comment = vectorizer.transform(prepared_comment)
    print("Input: ", raw_comment)
    print("Probability that it is a personal attack :", nn.predict(vectorized_comment))

In [41]:
comment_analysis("This is a terrible article. Whoever wrote it is a total fool")

Input:  This is a terrible article. Whoever wrote it is a total fool
Probability that it is a personal attack : [[0.9549549]]


In [42]:
comment_analysis("This is the best article on this topic. Thank you for writing it")

Input:  This is the best article on this topic. Thank you for writing it
Probability that it is a personal attack : [[0.00014433]]


*Try some comments of your own*

In [43]:
end = time.perf_counter()
print("Time taken: in min", (end - start)/60)

Time taken: in min 5.3006080187663125
