# Section 2. Model Classification for different authors' books

## This section of the project will use the LSTM Neural Network to classify the authors using their books. 

### First, import the necessary packages to use.

In [27]:
import pandas as pd 
import json
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Flatten, LSTM, Conv1D, MaxPooling1D, Dropout, Activation
from keras.layers.embeddings import Embedding
import nltk
import string
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
import re
from nltk.stem.snowball import SnowballStemmer 
from keras.constraints import max_norm
import tensorflow as tf

# Data Collection

### Three authors of the books are: Jonathan Swift, Jane Austen, and Mary Shelley. The three popular books are collected for each of the authors, so there are total of 9 books in the dataset. For the ease of model fitting, each input unit will be the each paragraph of the book. 

### Below is the function of generating dataframe for each of the authors.

In [3]:
# import text file and create dataframe
def Swift(textfile):
    # textfile has been edited so that it contains only body of the text
    with open(textfile) as f:
        lines = f.read()
    book = lines.split("\n\n") #split by paragraph
    text = pd.Series(book, index = range(len(book)))
    author = pd.Series(['Jonathan_Swift'] * len(book), index = range(len(book)))
    df = pd.DataFrame({'author':author,
                    'text':text})
    return df

In [4]:
def Austen(textfile):
    # textfile has been edited so that it contains only body of the text
    with open(textfile) as f:
        lines = f.read()
    book = lines.split("\n\n") #split by paragraph
    text = pd.Series(book, index = range(len(book)))
    author = pd.Series(['Jane_Austen'] * len(book), index = range(len(book)))
    df = pd.DataFrame({'author':author,
                    'text':text})
    return df

In [5]:
def Shelley(textfile):
    # textfile has been edited so that it contains only body of the text
    with open(textfile) as f:
        lines = f.read()
    book = lines.split("\n\n") #split by paragraph
    text = pd.Series(book, index = range(len(book)))
    author = pd.Series(['Mary_Shelley'] * len(book), index = range(len(book)))
    df = pd.DataFrame({'author':author,
                    'text':text})
    return df

In [30]:
swift_1 = Swift('Swift_1.txt')
swift_2 = Swift('Swift_2.txt')
swift_3 = Swift('Swift_3.txt')

In [31]:
austen_1 = Austen('Austen_1.txt')
austen_2 = Austen('Austen_2.txt')
austen_3 = Austen('Austen_3.txt')

In [32]:
shelley_1 = Shelley('Shelley_1.txt')
shelley_2 = Shelley('Shelley_2.txt')
shelley_3 = Shelley('Shelley_3.txt')

### Then, concatenate all the data for different authors into one large dataframe. There are total of 10818 rows, which means there are 10818 paragraphs in total. The example of the dataframe is shown below.

In [33]:
df = pd.concat([swift_1,swift_2,swift_3,austen_1,austen_2,austen_3,shelley_1,shelley_2,shelley_3],ignore_index=True)

In [34]:
df.shape

(10818, 2)

In [35]:
df.head()

Unnamed: 0,author,text
0,Jonathan_Swift,"It is a melancholy object to those, who walk t..."
1,Jonathan_Swift,"I think it is agreed by all parties, that this..."
2,Jonathan_Swift,But my intention is very far from being confin...
3,Jonathan_Swift,"As to my own part, having turned my thoughts f..."
4,Jonathan_Swift,There is likewise another great advantage in m...


## Data Cleaning

### After the text of the data has been stored, it needs to be processed. Below function will replace all the punctuations, symbols and also take out the stopwords. The text will all be transformed into lower case and there will be no digits. 

In [12]:
df = df.reset_index(drop=True)
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def clean_text(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = text.lower()
    text = REPLACE_BY_SPACE_RE.sub(' ', text)
    text = BAD_SYMBOLS_RE.sub('', text)
    text = text.replace('x', '')
    text = ' '.join(word for word in text.split() if word not in STOPWORDS)
    return text


In [13]:
df['text'] = df['btext'].apply(clean_text)
df['text'] = df['text'].str.replace('\d+', '')

### Now, the below 'text' is the cleaned version of the dataframe. We can compare this dataframe with above original text. There are only useful words contain in this cleaned dataframe. Next these text will be tokenized and use word embedding for the model fitting.

In [14]:
df.head()

Unnamed: 0,author,text
0,Jonathan_Swift,melancholy object walk great town travel count...
1,Jonathan_Swift,think agreed parties prodigious number ofchild...
2,Jonathan_Swift,intention far confined provide thechildren pro...
3,Jonathan_Swift,part turned thoughts many years upon thisimpor...
4,Jonathan_Swift,likewise another great advantage scheme willpr...


### The maximum number of words to be used is set as 5000, and it is the most frequent words showing in the data. Max number of words in each complaint is set as 500, since one paragraph will not be too long. The embedding dimension is set to be 100. 

### 'Tokenizer' method will split text into words are generate word vectors. The number of unique words that are in these books are 67501.

In [15]:
MAX_NB_WORDS = 5000
MAX_SEQUENCE_LENGTH = 500
EMBEDDING_DIM = 100
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)
tokenizer.fit_on_texts(df['text'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 67501 unique tokens.


### Since there are too many words, input vector needs to be padded into the maximum sequence where I set to be 500. 

In [16]:
X = tokenizer.texts_to_sequences(df['text'].values)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X.shape)

Shape of data tensor: (10818, 500)


### The Y vector are now the names of the authors. It needs to be converted into vectors as well. 'get_dummies' function will automatically generate vector for those authors.

In [40]:
Y = pd.get_dummies(df['author']).values
print('Shape of label tensor:', Y.shape)
Y

Shape of label tensor: (10818, 3)


array([[0, 1, 0],
       [0, 1, 0],
       [0, 1, 0],
       ...,
       [0, 0, 1],
       [0, 0, 1],
       [0, 0, 1]], dtype=uint8)

### Now, the dataset is ready to split into training and testing sets. The size of the training set is 90% of the total dataset, and the rest of the dataset is used for testing set. The 'random_state' will set seed to these testing and training set so that everytime running this code will have the same training and testing set.

In [18]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.10, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

(9736, 500) (9736, 3)
(1082, 500) (1082, 3)


In [19]:
X.shape[1]

500

#  Baseline model
### Below is the baseline for the LSTM Neural Network model. The activation function is 'tanh' for the LSTM Network. The baseline model has added the dropout of 0.5 since LSTM Network can easily get into overfitting model. The dense unit would be 3 at the end since we have three different authors that will be classified. The activation function of Dense function is a 'softmax' function. It will be complied into the 'categorical crossentropy' since there are three categories, which are three authors.

In [20]:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(LSTM(32, activation = 'tanh'))
model.add(Dropout(0.5))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

epochs = 10
batch_size = 128

history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size,validation_split=0.1)

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use tf.cast instead.
Train on 8762 samples, validate on 974 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [21]:
accr = model.evaluate(X_test,Y_test)
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))

Test set
  Loss: 0.272
  Accuracy: 0.918


In [22]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 100)          500000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 32)                17024     
_________________________________________________________________
dropout_1 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 3)                 99        
Total params: 517,123
Trainable params: 517,123
Non-trainable params: 0
_________________________________________________________________


### From the above result, the model is overfitting according to the accuracy of validation dataset.

# LSTM Neural Network model
### Here, the model constraints has been added in order to fix the overfitting. Instead of using the dropout method, this model used to specify the maximum number of the norm of kernel vector, reccurent vector, and bias vector. I used 3 as the maximun number of norm of the kernel vector, norm of reccurent vector, and norm of bias vector. If the number of norm of the vector exceed 3, then it will be dropped.

In [23]:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(LSTM(32, kernel_constraint=max_norm(3), recurrent_constraint=max_norm(3), 
               bias_constraint=max_norm(3)))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

epochs = 6
batch_size = 128

history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size,validation_split=0.1)

Train on 8762 samples, validate on 974 samples
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


In [24]:
accr = model.evaluate(X_test,Y_test)
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))

Test set
  Loss: 0.220
  Accuracy: 0.915


In [25]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 500, 100)          500000    
_________________________________________________________________
lstm_2 (LSTM)                (None, 32)                17024     
_________________________________________________________________
dense_2 (Dense)              (None, 3)                 99        
Total params: 517,123
Trainable params: 517,123
Non-trainable params: 0
_________________________________________________________________


### Now, above model has the good performance without any overfitting problem. Test accuracy is 0.915, where training accuracy is 0.9618. 

# Confusion Matrix
### After the model, look at the confusion matrix of the model to see the detailed prediction accuracy for each of the output vector. 

In [26]:
Y_pred = model.predict_classes(X_test)
sum(Y_pred == Y_test.argmax(axis = 1))
con_mat = tf.confusion_matrix(labels=Y_test.argmax(axis = 1), predictions=Y_pred)
sess = tf.Session()
with sess.as_default():
        print(sess.run(con_mat))

Instructions for updating:
Use tf.cast instead.
[[615   1  47]
 [  7  78  16]
 [ 17   4 297]]


### Above confusion matrix is in the order of Jonathan Swift, Jane Austen, and Mary Shelley.  For Jane Austen, the number of paragraphs are a lot shorter than other two authors', so that there are only 5 paragraphs has been misclassified. Jonathan Swift's paragraphs are classified more accurate than Mary Shelley's. Overall, the performance is good for all of the three author's paragraphs. 

resources for LSTM model: https://towardsdatascience.com/multi-class-text-classification-with-lstm-1590bee1bd17