# **Spooky Author Identification - Sequence classification with NLP**

# **Description of the task:**

As I scurried across the candlelit chamber, manuscripts in hand, I thought I'd made it. Nothing would be able to hurt me anymore. Little did I know there was one last fright lurking around the corner.

DING! My phone pinged me with a disturbing notification. It was Will, the scariest of Kaggle moderators, sharing news of another data leak.

"ph’nglui mglw’nafh Cthulhu R’lyeh wgah’nagl fhtagn!" I cried as I clumsily dropped my crate of unbound, spooky books. Pages scattered across the chamber floor. How will I ever figure out how to put them back together according to the authors who wrote them? Or are they lost, forevermore? Wait, I thought... I know, machine learning!

In this year's Halloween playground competition, you're challenged to predict the author of excerpts from horror stories by Edgar Allan Poe, Mary Shelley, and HP Lovecraft.

---

The notebook uses GloVe developed by Stanford NLP researchers and included in the StanfordNLP software package.

This example will teach you how to:


1.   Preprocess data for sequence classiification
2.   Create a word embeddings matrix from a pre-trained GloVe model
3.   Train a single layer LSTM model and a Bidirectional LSTM model

After the example you will be tasked to **complete exercises** relevant to the notebook.


For further reading:

* StanfordNLP: https://stanfordnlp.github.io/
* GloVe: https://nlp.stanford.edu/pubs/glove.pdf
* LSTM and GRU: https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21


# **About the authors**

   **EAP - [Edgar Allen Poe](https://en.wikipedia.org/wiki/Edgar_Allan_Poe)** : American writer who wrote poetry and short stories that revolved around tales of mystery and the grisly and the grim. Arguably his most famous work is the poem - "The Raven" and he is also widely considered the pioneer of the genre of the detective fiction.

   **HPL - [H.P. Lovecraft](https://en.wikipedia.org/wiki/H._P._Lovecraft)** : Best known for authoring works of horror fiction, the stories that he is most celebrated for revolve around the fictional mythology of the infamous creature "Cthulhu" - a hybrid chimera mix of Octopus head and humanoid body with wings on the back.

   **MWS - [Mary Shelley](https://en.wikipedia.org/wiki/Mary_Shelley)** : Seemed to have been involved in a whole panoply of literary pursuits - novelist, dramatist, travel-writer, biographer. She is most celebrated for the classic tale of Frankenstein where the scientist Frankenstein a.k.a "The Modern Prometheus" creates the Monster that comes to be associated with his name.


# **Import data**

In order to download the dataset for the exercise you have to register an account on Kaggle (https://www.kaggle.com/), alternatively you can use a Google account for this step. 

After you have done that you can download the dataset via this link:
https://www.kaggle.com/c/spooky-author-identification/download/train.zip

Then you need to upload this file to Google Colab,by opening the **Table of contents** tab on the left side, then choose **Files** on the same line and press the  **Upload** button.

After you have done this, running the following cell is going to unzip the training dataset and download the pre-trained GloVe embedding created by Stanford.

If you don't see the newly created files,please press **Refresh** under **Files**.

In [1]:
import requests, zipfile, io

print("Unzipping the training dataset...")

z = zipfile.ZipFile("/content/train.zip")
z.extractall()

print("Downloading pre-trained GloVe embedding...")

url = "http://nlp.stanford.edu/data/glove.6B.zip"
r = requests.get(url)

print("Unzipping pre-trained GloVe embedding...")

z = zipfile.ZipFile(io.BytesIO(r.content))
z.extract("glove.6B.50d.txt")

Unzipping the training dataset...
Downloading pre-trained GloVe embedding...
Unzipping pre-trained GloVe embedding...


'/content/glove.6B.50d.txt'

# **A look at the data**

## **First let's see the most commonly used words of the authors:**

![alt text](https://raw.githubusercontent.com/barnabaskocsis/spookyauthornlp/master/eap_raven_wordcloud.png)

![alt text](https://raw.githubusercontent.com/barnabaskocsis/spookyauthornlp/master/hpl_cthulhu_wordcloud.png)

![alt text](https://raw.githubusercontent.com/barnabaskocsis/spookyauthornlp/master/mws_frankenstein_wordcloud.png)

## **Read data**

We use pandas to read the .csv file provided which contains the text data and the corresponding labels.

In [2]:
import pandas as pd

train = pd.read_csv("/content/train.csv")

In [3]:
train.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


# Theory behind GloVe


GloVe (Global Vectors for Word Representation) is a tool released by Stanford NLP Group researchers Jeffrey Pennington, Richard Socher, and Chris Manning for learning continuous-space vector representations of words.

The GloVe model learns word vectors by examining word co-occurrences within a text corpus. Before the training of the actual model, a co-occurence matrix need to be constructed where each cell, represents how often a word appears in the context on another.

We run through our corpus just once to build the matrix and from then on use this co-occurrence data in place of the actual corpus.

Let the co-occurrence matrix be $X$.

For each word pair of word $i$ and word $j$:

$$w_i^Tw_j+b_i+b_j = log X_{ij}$$

where $b_i$ and $b_j$ are scalar bias terms.

We want to build word vectors that retain some useful information about how every pair of words $i$ and $j$ co-occur. We’ll do this by minimizing an objective function $J$, which evaluates the sum of all squared errors based on the above equation, weighted with a function $f$:

$$J = \displaystyle\sum_{i=1}^V \sum_{j=1}^V f(X_{ij}) (w_i^Tw_j+b_i+b_j)$$

where we choose an $f$ that prevents common words from skewing our objective too much.

![alt text](https://www.researchgate.net/profile/Le-Lu-9/publication/303376372/figure/fig6/AS:668376489816091@1536364781736/Example-words-embedded-in-the-vector-space-using-word-to-vector-modeling.png)

# **Text Preprocessing and embedding matrix initialization**

Using LabelEncoder from sklearn we transform our text labels to numbers for the neural network and split our data into train and validation sets.

In [4]:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

le = preprocessing.LabelEncoder()
labels = le.fit_transform(train.author.values) #EAP 0 HPL 1 MWS 2
data = train.text.values

x_train, x_val, y_train, y_val = train_test_split(data, labels, test_size=0.1, shuffle=True, stratify=labels)

In [34]:
print(x_train.shape)
print(y_train.shape)
print(x_val.shape)
print(y_val.shape)
print("\n")
print("Sample train: " + x_train[1])
print("Sample validation: " + x_val[1])

(17621,)
(17621,)
(1958,)
(1958,)


Sample train: they wanted to mix like they done with the Kanakys, an' he fer one didn't feel baound to stop 'em.
Sample validation: Most philosophers, upon many points of philosophy, are still very unphilosophical.


We create an embeddings index dictionary from the pre-trained GloVe model containing 6 billion tokens in 50 dimensional vectors, the keys are words and the values are vectors. This contains all words our model will know.

In [40]:
import numpy as np

embeddings_index = {}

with open("/content/glove.6B.50d.txt",'r') as file:
  for line in file:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
    
print('Found %s word vectors.' % len(embeddings_index))
print(list(embeddings_index.items())[0])

Found 400000 word vectors.
('the', array([ 4.1800e-01,  2.4968e-01, -4.1242e-01,  1.2170e-01,  3.4527e-01,
       -4.4457e-02, -4.9688e-01, -1.7862e-01, -6.6023e-04, -6.5660e-01,
        2.7843e-01, -1.4767e-01, -5.5677e-01,  1.4658e-01, -9.5095e-03,
        1.1658e-02,  1.0204e-01, -1.2792e-01, -8.4430e-01, -1.2181e-01,
       -1.6801e-02, -3.3279e-01, -1.5520e-01, -2.3131e-01, -1.9181e-01,
       -1.8823e+00, -7.6746e-01,  9.9051e-02, -4.2125e-01, -1.9526e-01,
        4.0071e+00, -1.8594e-01, -5.2287e-01, -3.1681e-01,  5.9213e-04,
        7.4449e-03,  1.7778e-01, -1.5897e-01,  1.2041e-02, -5.4223e-02,
       -2.9871e-01, -1.5749e-01, -3.4758e-01, -4.5637e-02, -4.4251e-01,
        1.8785e-01,  2.7849e-03, -1.8411e-01, -1.1514e-01, -7.8581e-01],
      dtype=float32))


Now we have to create an embedding matrix from our embeddings index for weights used by the keras embeddings layer later.

The Keras preprocessing library contains the Tokenizer we are going to use. This is going to vectorize our text by creating a dictionary containing all the words in our data and outputting each as a sequence of integers where every integer represent a word in the dictionary.

After this we need to pad our sequences for the Recurrent Neural Network, we set the target sentence length to 50, sequences longer will be truncated, sequences shorter are padded with zeros.

In [33]:
from keras.preprocessing import sequence, text

tokenizer = text.Tokenizer(num_words=None)
max_sent_len = 50

tokenizer.fit_on_texts(list(x_train) + list(x_val))

x_train_seq = tokenizer.texts_to_sequences(x_train)
x_val_seq = tokenizer.texts_to_sequences(x_val)

x_train_pad = sequence.pad_sequences(x_train_seq, maxlen=max_sent_len)
x_val_pad = sequence.pad_sequences(x_val_seq, maxlen=max_sent_len)

word_index = tokenizer.word_index

print("Train sample data:" )
print(x_train[1])
print(x_train_seq[1])
print("\n")
print(x_train_pad[1])
print("\n")
print("Validation sample data:" )
print(x_val[1])
print(x_val_seq[1])
print("\n")
print(x_val_pad[1])
print("\n")
print(list(word_index.items())[:10])

Train sample data:
they wanted to mix like they done with the Kanakys, an' he fer one didn't feel baound to stop 'em.
[43, 2353, 4, 10108, 82, 43, 339, 14, 1, 5390, 252, 13, 2195, 38, 2112, 302, 16051, 4, 2032, 1299]


[    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0    43  2353     4 10108    82    43
   339    14     1  5390   252    13  2195    38  2112   302 16051     4
  2032  1299]


Validation sample data:
Most philosophers, upon many points of philosophy, are still very unphilosophical.
[86, 4321, 44, 113, 1123, 2, 1395, 56, 104, 60, 12261]


[    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0    86  4321    44   113  1123     2  1395    56   104
    60 12261]


[

The labels have to be one-hot encoded.

In [42]:
from keras.utils import np_utils

y_train_cat = np_utils.to_categorical(y_train)
y_val_cat = np_utils.to_categorical(y_val)

print("Authors:")
print(y_train_cat)

Authors:
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 ...
 [1. 0. 0.]
 [0. 0. 1.]
 [1. 0. 0.]]


Here we create the embedding matrix as a numpy array, the dimensions are the number of words in our dictionary and the length of our word embedding vectors.

In [50]:
embedding_matrix = np.zeros((len(word_index) + 1, 50))

for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in the embedding index will be all-zeros
        embedding_matrix[i] = embedding_vector

print("The word 'the': \n")
print("Embedding index: \n")
print(embeddings_index.get("the"))
print("\nEmbedding matrix: \n")
print(embedding_matrix[1])

The word 'the': 

Embedding index: 

[ 4.1800e-01  2.4968e-01 -4.1242e-01  1.2170e-01  3.4527e-01 -4.4457e-02
 -4.9688e-01 -1.7862e-01 -6.6023e-04 -6.5660e-01  2.7843e-01 -1.4767e-01
 -5.5677e-01  1.4658e-01 -9.5095e-03  1.1658e-02  1.0204e-01 -1.2792e-01
 -8.4430e-01 -1.2181e-01 -1.6801e-02 -3.3279e-01 -1.5520e-01 -2.3131e-01
 -1.9181e-01 -1.8823e+00 -7.6746e-01  9.9051e-02 -4.2125e-01 -1.9526e-01
  4.0071e+00 -1.8594e-01 -5.2287e-01 -3.1681e-01  5.9213e-04  7.4449e-03
  1.7778e-01 -1.5897e-01  1.2041e-02 -5.4223e-02 -2.9871e-01 -1.5749e-01
 -3.4758e-01 -4.5637e-02 -4.4251e-01  1.8785e-01  2.7849e-03 -1.8411e-01
 -1.1514e-01 -7.8581e-01]

Embedding matrix: 

[ 4.18000013e-01  2.49679998e-01 -4.12420005e-01  1.21699996e-01
  3.45270008e-01 -4.44569997e-02 -4.96879995e-01 -1.78619996e-01
 -6.60229998e-04 -6.56599998e-01  2.78430015e-01 -1.47670001e-01
 -5.56770027e-01  1.46579996e-01 -9.50950012e-03  1.16579998e-02
  1.02040000e-01 -1.27920002e-01 -8.44299972e-01 -1.21809997e-01
 -1.680

# **Single layer LSTM Embedding**

Our single layer model consists of an Embedding layer which outputs a 3D tensor with shape (batch_size, sequence_length, output_dim),this is the input for the LSTM layer. We use our embedding matrix as weights, which we created previously from the GloVe embedding and set the trainable property of the Embedding layer to False to prevent it from changing the weights.

In [10]:
from keras.models import Sequential
from keras.layers.recurrent import LSTM
from keras.layers.core import Dense
from keras.layers.embeddings import Embedding
from keras.layers import Bidirectional
from keras.callbacks import EarlyStopping

model = Sequential()
model.add(Embedding(input_dim = len(word_index) + 1,
                    output_dim = 50,
                    weights = [embedding_matrix],
                    input_length = max_sent_len,
                    trainable = False))
model.add(LSTM(128, dropout=0.3, recurrent_dropout=0.3))
model.add(Dense(3, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 50, 50)            1297200   
                                                                 
 lstm (LSTM)                 (None, 128)               91648     
                                                                 
 dense (Dense)               (None, 3)                 387       
                                                                 
Total params: 1,389,235
Trainable params: 92,035
Non-trainable params: 1,297,200
_________________________________________________________________


In [11]:
earlystop = EarlyStopping(monitor='val_loss', patience=5, verbose=0, mode='auto')

model.fit(x_train_pad, y_train_cat, batch_size=256, epochs=100, 
          verbose=1, validation_data=(x_val_pad, y_val_cat), callbacks=[earlystop])

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100


<keras.callbacks.History at 0x7f7eb0023d50>

In [12]:
score = model.evaluate(x_val_pad, y_val_cat, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 0.7060660719871521
Test accuracy: 0.7114402651786804


In [14]:
preds = model.predict(x_val_pad)
print(x_val_pad)
print(preds[0])

[[    0     0     0 ...  1700    25  9854]
 [    0     0     0 ...    35 24931  3080]
 [    0     0     0 ...     2    55  4243]
 ...
 [    0     0     0 ...    19     5  4008]
 [    0     0     0 ...  1320     3  5364]
 [    0     0     0 ...     2    29   221]]
[0.94635147 0.0451295  0.00851901]


# **Bidirectional LSTM Embedding**

In [None]:
from keras.models import Sequential
from keras.layers.recurrent import LSTM
from keras.layers.core import Dense
from keras.layers.embeddings import Embedding
from keras.layers import Bidirectional
from keras.callbacks import EarlyStopping

model = Sequential()
model.add(Embedding(input_dim = len(word_index) + 1,
                    output_dim = 50,
                    weights = [embedding_matrix],
                    input_length = max_sent_len,
                    trainable = False))
model.add(Bidirectional(LSTM(128, dropout=0.3, recurrent_dropout=0.3)))
model.add(Dense(3, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 50, 50)            1297200   
_________________________________________________________________
bidirectional_2 (Bidirection (None, 256)               183296    
_________________________________________________________________
dense_2 (Dense)              (None, 3)                 771       
Total params: 1,481,267
Trainable params: 184,067
Non-trainable params: 1,297,200
_________________________________________________________________


In [None]:
earlystop = EarlyStopping(monitor='val_loss', patience=5, verbose=0, mode='auto')

model.fit(x_train_pad, y_train_cat, batch_size=256, epochs=100, 
          verbose=1, validation_data=(x_val_pad, y_val_cat), callbacks=[earlystop])

Train on 17621 samples, validate on 1958 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100


<keras.callbacks.History at 0x7f0797514ba8>

In [None]:
score = model.evaluate(x_val_pad, y_val_cat, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 0.6221312910348816
Test accuracy: 0.7502553626757964


In [None]:
preds = model.predict(x_val_pad)
print(preds[0])

[0.1559595  0.8405729  0.00346767]


# Exercises


Swap out the LSTM layer with only densely connected layers. 

*   How did this affect the accuracy of the model?
*   If the performance was worse, why?



Modify the LSTM model to have 2 hidden RNN layers. 

*   Did the accuracy of the model improve?
*   If not, why?



Swap out the LSTM layer with GRU.

*   Did the accuracy of the model improve?



**Bonus exercise**:

Think about a good $f$ function to use as the weight for the GloVe embedding.