### LSTMs in Keras

---

#### Add the imports

In [1]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense, LSTM
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras import backend as K
import numpy as np
from sklearn.datasets import fetch_20newsgroups
import spacy
import tqdm

---

#### Generate some training data

In [2]:
categories = ['alt.atheism', 'sci.space']
data = fetch_20newsgroups(categories=categories)

In [5]:
#print(data['DESCR'])

In [3]:
data.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [6]:
set(data['target'])

{0, 1}

In [7]:
data['target'][0]

0

In [9]:
print(data['data'][0])

From: bil@okcforum.osrhe.edu (Bill Conner)
Subject: Re: Not the Omni!
Nntp-Posting-Host: okcforum.osrhe.edu
Organization: Okcforum Unix Users Group
X-Newsreader: TIN [version 1.1 PL6]
Lines: 18

Charley Wingate (mangoe@cs.umd.edu) wrote:
: 
: >> Please enlighten me.  How is omnipotence contradictory?
: 
: >By definition, all that can occur in the universe is governed by the rules
: >of nature. Thus god cannot break them. Anything that god does must be allowed
: >in the rules somewhere. Therefore, omnipotence CANNOT exist! It contradicts
: >the rules of nature.
: 
: Obviously, an omnipotent god can change the rules.

When you say, "By definition", what exactly is being defined;
certainly not omnipotence. You seem to be saying that the "rules of
nature" are pre-existant somehow, that they not only define nature but
actually cause it. If that's what you mean I'd like to hear your
further thoughts on the question.

Bill



In [None]:
* Clean the text - remove stop words, special characters, lowercase, lemmatize
* Vectorize our text - turn our text into number equivalents - create a dictionary mapping from words to numbers
                        Use Keras Embedding to make word vectors
* Create our LSTM model 
* Train and test it

In [10]:
X = data['data']
y = data['target']

In [12]:
def clean_my_text(text):
    lemmatized = []
    text = text.lower()
    tokens = model(text)
    for word in tokens:
        if not word.is_stop and word.is_alpha:
            lemmatized.append(word.lemma_)
    return lemmatized

In [14]:
clean_my_text(X[0])

['bill',
 'conner',
 'subject',
 'omni',
 'nntp',
 'post',
 'host',
 'organization',
 'okcforum',
 'unix',
 'user',
 'group',
 'x',
 'newsreader',
 'tin',
 'version',
 'line',
 'charley',
 'wingate',
 'write',
 'enlighten',
 'omnipotence',
 'contradictory',
 'definition',
 'occur',
 'universe',
 'govern',
 'rule',
 'nature',
 'god',
 'break',
 'god',
 'allow',
 'rule',
 'omnipotence',
 'exist',
 'contradict',
 'rule',
 'nature',
 'obviously',
 'omnipotent',
 'god',
 'change',
 'rule',
 'definition',
 'exactly',
 'define',
 'certainly',
 'omnipotence',
 'say',
 'rule',
 'nature',
 'pre',
 'existant',
 'define',
 'nature',
 'actually',
 'cause',
 'mean',
 'like',
 'hear',
 'thought',
 'question',
 'bill']

In [11]:
model = spacy.load('en_core_web_sm')

In [20]:
clean_X = []

for text in tqdm.tqdm(X):
    results = clean_my_text(text)
    clean_X.append(results)

100%|██████████| 1073/1073 [02:01<00:00,  8.83it/s]


In [23]:
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(word_vec_X,y)

#### We have some data, but now we need to preprocess it

In [81]:
vocab_list = ['']
for text in clean_X:
    for word in text:
        vocab_list.append(word)
vocab_list = list(set(vocab_list))

In [83]:
word_to_num = {}
num_to_word = {}
for i, word in enumerate(vocab_list):
    num_to_word[i] = word
    word_to_num[vocab_list[i]] = i

#### Now turn our reviews into 'word vectors', and pad the text so they are all the same size

In [92]:
word_vec_X = [[word_to_num[word] for word in text] for text in clean_X]

In [93]:
max_length = len(sorted(clean_X)[0])

In [94]:
word_vec_X = sequence.pad_sequences(word_vec_X, maxlen=max_length, padding='pre')

In [96]:
len(vocab_list)

14465

In [95]:
word_vec_X[0]

array([    0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,  7427,  2220,
        1328, 10245, 13222,  8694,  5973, 13306, 11764,  7452, 11119,
        7222, 10153,  4872,  3876, 11801, 11986,  1095,  7324, 13844,
        5429,  7780,  7150,  4479,   243, 11496,  2407,  8606, 12239,
        2188,  2324,  2188, 11015,  8606,  7780,  8759,  2317,  8606,
       12239,  8229,  3096,  2188, 11077,  8606,  4479,  8705,  2647,
       12405,  7780,  6836,  8606, 12239, 12003,  8439,  2647, 12239,
         572,  5788, 13161, 11390, 10003, 13408, 10681,  7427],
      dtype=int32)

In [97]:
vocab_size= len(vocab_list) + 1

---

#### Now lets create the model - its a standard Sequential with an Embedding and an LSTM layer added

Embedding:
This layer takes 3 parameters - the size of the vocab (input_dims), the no. of dimensions of each word embedding (output_dim), and the length of each document (input_length), which we've standardised above. It returns a 2d matrix, with rows equal to each word in the document, and columns equal to the number of dimensions in the word embedding. 

*Actually its 3D, cos the batch_size is the first dimension in both input and output, but I find that confuses things more than it clarifies*
Put another way 

The embedding **takes in** a factorized corpus, e.g.:

**[The, cat, sat, on, the, mat]**    becomes    **[1,2,3,4,1,5]**

And **outputs** a word embedded corpus:

**[1,2,3,4,1,5]**    becomes (lets assume output_dim=2)   **[[0.2,0.7], [0.6,0.3], [0.1,0.8], [0.2,0.1], [0.2,0.7], [0.4,0.9]]**

In [98]:
model = Sequential()
model.add(Embedding(vocab_size,64,input_length=max_length))
model.add(LSTM(512))
model.add(Dense(1, activation='sigmoid'))

In [99]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 125, 64)           925824    
_________________________________________________________________
lstm (LSTM)                  (None, 512)               1181696   
_________________________________________________________________
dense (Dense)                (None, 1)                 513       
Total params: 2,108,033
Trainable params: 2,108,033
Non-trainable params: 0
_________________________________________________________________


---

In [101]:
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(word_vec_X,y)

### Now train and test

In [100]:
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])

In [102]:
model.fit(Xtrain,ytrain, epochs=5, batch_size=128, validation_split=0.2)

Train on 643 samples, validate on 161 samples
Epoch 1/5
128/643 [====>.........................] - ETA: 35s

KeyboardInterrupt: 

#### Two ways of evaluating your model on unseeen data

In [None]:
model.evaluate(Xtest,ytest)

In [None]:
test_results = model.predict(Xtest)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(np.argmax(test_results), Xtest)

---

## TEST - Try with test data thats genuinely new

### e.g twitter data, newspaper articles