# Ex1.1 Categorizing Reviews with an FNN

In [None]:
import time
#Ignore the next statement -- it is just to estimate how long the exercise takes
start = time.perf_counter()

We shall use a neural network to categorize user reviews of articles in Wikipedia. The aim is to identify the reviews which contain personal attacks.

The dataset we will use includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it contains a personal attack.

In [None]:
import pandas as pd
import re
import urllib
import sklearn
import nltk
nltk.download('stopwords')
nltk.download('punkt')

## Loading the data
There are two files, one with the comments and another with annotations made by reviewers as to whether the comments contain personal attacks.

In [None]:
comments = pd.read_csv('attack_annotated_comments.tsv', sep = '\t', index_col = 0)
annotations = pd.read_csv('attack_annotations.tsv',  sep = '\t')

## Examining the data
First we look at the comments dataframe.

In [None]:
comments.columns

In [None]:
comments.head()

The first column is the review ID. Each user review of an article has a rev_id. The other column of interest is the comment column. It will need a bit of cleaning. The other columns are irrelevant to our purpose.

We now look at the annotations dataframe.

In [None]:
annotations.columns

In [None]:
annotations.head()

Each comment was given to multiple "workers" and the workers scored it for various types of attacks.  The results are in the second dataset. The rev_id is the link between the two datasets. A `1` means that worker considered the comment to be an attack. The last column, "attack" will be a `1` if any of the other columns for specific types of attack are `1`'s.

Let's find some records where the attack column has a 1.

In [None]:
annotations[annotations["attack"]==1.0]

 Consider, for example, a specific comment from the review ith rev_id = 89320.
 
 Several workers, the ones with worker_id 3341, 3338, 2101 and 673 thought it had some kind of personal attack.

In [None]:
comments.loc[89320]["comment"]

We can look at the results from all the workers who scored this comment.

In [None]:
annotations[annotations["rev_id"]==89320]

We shall consider a comment to be an attack if its mean score in the attack column is above 0.5. We create a column called label".
We group the annotations by rev_id and if the mean for the attack colummn is above 0.5 the label is true, otherwise it is false. We add that column to the comments dataset.

In [None]:
label = annotations.groupby('rev_id')['attack'].mean() > 0.5

In [None]:
label

We join comments and labels.

In [None]:
comments = comments.join(label)

In [None]:
comments.head(10)

In [None]:
# Skip this cell if you would rather not read rather unpleasant comments!
comments[comments["attack"] == True]

## Preprocessing the data

We remove the "NEWLINE_TOKEN" and "TAB_TOKEN" substrings in the comments as well as the punctuation. We shall lower case the words and remove stop words and numbers.

In [None]:
comments['comment'].head()

In [None]:
comments['comment'] = comments['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", ""))
comments['comment'] = comments['comment'].apply(lambda x: x.replace("TAB_TOKEN", ""))
comments['comment'] = comments['comment'].apply(lambda x: x.lower())
comments['comment'] = comments['comment'].apply(lambda x: re.sub(r'[^\w\s]', '', x))
comments['comment'] = comments['comment'].apply(lambda x: re.sub(r'\d+', '', x))

In [None]:
comments['comment'].head(20)

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))

In [None]:
def remove_stop_words(comment):
    word_tokens = word_tokenize(comment)
    filtered_comment = [w for w in word_tokens if not w.lower() in stop_words]
    filtered_comment = ""
    for w in word_tokens:
        if w not in stop_words:
            filtered_comment = filtered_comment + " " + w
    return(filtered_comment)

In [None]:
remove_stop_words("This is a test comment")

In [None]:
%%time
# This takes about 30 seconds
comments['comment'] = comments['comment'].apply(lambda x: remove_stop_words(x))

In [None]:
comments['comment'].head(20)

We only need the comment and attack columns.

In [None]:
df = pd.concat([comments["comment"],comments["attack"]], axis=1)

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df[df["attack"] == True].shape

## Splitting the data into a training set and a test set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df['comment'], 
    df['attack'], 
    test_size = 0.2,
    random_state = 1278)

As a sanity check, print out the shapes of the dataframes.

In [None]:
print("Training features and labels")
print("X_train shape: ",X_train.shape)
print("y_train shape: ",y_train.shape)
print()
print("Testing features and labels")      
print("X_test shape: ",X_test.shape)
print("y_test shape: ",y_test.shape)

Vectorize the features.

The features need to be expressed as vectors. We shall use CountVectorizer which does a word count on each document and creates a vector fo it based on the frequency of words in it. To avoid very  long vectors we shall just use the 5000 most frequent words as features. This is after removing stop words as they are very frequent but carry no information about the document. We also will not use rare words, words that appear in less than 10% of the documents. Further, we will not use words that are too common, ones that are present in more than 90% of the documents.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
vectorizer = CountVectorizer(binary = True, 
                             stop_words = stopwords.words('english'), 
                             lowercase = True, 
                             min_df = 3, 
                             max_df = 0.9, 
                             max_features = 5000)
X_train_vectorized = vectorizer.fit_transform(X_train)

The array produced by CountVectorizer is a sparse array.

In [None]:
print (X_train_vectorized.toarray().shape)
print(X_train_vectorized.toarray()[5,:])

Each one of the 5000 words being used to characterize the comments has an index. If the word is present in the document the value at that index will be a 1, otherwise it will be a 0. There are relatively few 1's so a sparse matrix is an efficient way to store the array.

In [None]:
# These are the first 20 mappings of the form word => index.
print(list(vectorizer.vocabulary_.items())[:20])

Some of the words used:

In [None]:
vectorizer.get_feature_names_out()[1200:1250]
# Be warned as you explore that some words from media will be unpleasant

## Defining the model (neural network)

In [None]:
from keras.models import Sequential
from keras.layers import Dense

# Sequential is a container for the other components.
# You add layers, in order, to an instance of Sequential.

nn = Sequential()

# The 5000 features plus a bias, which makes 5001 items, will be fed to a dense layer with 500 nodes.
# This layer calculates 5001 * 500 = 2,500,500 weights.

nn.add(Dense(units = 500, activation = 'relu', input_dim = len(vectorizer.get_feature_names())))

# You get an output from each node, 500 outputs in all.
# The 500 outputs of the first hidden layer plus a bias will go to the one node of the 
# second hidden layer. This makes 501 weights to calculate in this layer.

nn.add(Dense(units=1, activation='sigmoid'))
  
# Binary cross entropy is a popular loss function for binary type (yes/no) situations.
nn.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
nn.summary()

This very small network has 2.5 million parameters to calculate.
Make sure there are no other kernels running otherwise your kernel is likely to crash for lack of resources.

## Training the model

This will take about 7 min

In [None]:
%%time
# Takes 3 to 7 min depending on resources available.
# The last 2000 rows of the training data are used for validation.
nn.fit(X_train_vectorized[:-2000].toarray(), y_train[:-2000], 
          epochs = 4, batch_size = 128, verbose = 1, 
          validation_data = (X_train_vectorized[-2000:].toarray(), y_train[-2000:]))

## Evaluating the model peformance

We prepare vectors for the test data set and use the `evaluate()` method so see how good the model is with unseen data.

In [None]:
scores = nn.evaluate(vectorizer.transform(X_test).toarray(), y_test, verbose = 1)
# scores has several measurements in it. The one in postion 1 is the accuracy.
print("Accuracy:", scores[1])

We try out the model with our own comment. We need to pre-process the new comment like we did the training comments. This is best put in a function or pipeline.

In [None]:
def prepareData(comment):
    comment.replace("NEWLINE_TOKEN", "")
    comment.replace("TAB_TOKEN", "")
    comment.lower()
    comment = re.sub(r'[^\w\s]', '', comment)
    comment = re.sub(r'\d+', '', comment)
    return(comment)

In [None]:
def comment_analysis(raw_comment):
    prepared_comment = pd.array([prepareData(raw_comment)])
    vectorized_comment = vectorizer.transform(prepared_comment)
    print("Input: ", raw_comment)
    print("Probability that it is a personal attack :", nn.predict(vectorized_comment))

In [None]:
comment_analysis("This is a terrible article. Whoever wrote it is a total fool")

In [None]:
comment_analysis("This is the best article on this topic. Thank you for writing it")

*Try some comments of your own*

In [None]:
end = time.perf_counter()
print("Time taken: in min", (end - start)/60)