### LSTMs in Keras

---

In [2]:
#import tensorflow.keras as keras

#### Add the imports

In [4]:
import tensorflow.keras as keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense, LSTM
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras import backend as K
import numpy as np
from sklearn.datasets import fetch_20newsgroups
import spacy
import tqdm

---

#### Generate some training data

In [5]:
categories = ['alt.atheism', 'sci.space']
data = fetch_20newsgroups(categories=categories)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [7]:
print(data['DESCR'])

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features       

In [6]:
data.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [8]:
set(data['target'])

{0, 1}

In [9]:
print(data['data'][0])

From: bil@okcforum.osrhe.edu (Bill Conner)
Subject: Re: Not the Omni!
Nntp-Posting-Host: okcforum.osrhe.edu
Organization: Okcforum Unix Users Group
X-Newsreader: TIN [version 1.1 PL6]
Lines: 18

Charley Wingate (mangoe@cs.umd.edu) wrote:
: 
: >> Please enlighten me.  How is omnipotence contradictory?
: 
: >By definition, all that can occur in the universe is governed by the rules
: >of nature. Thus god cannot break them. Anything that god does must be allowed
: >in the rules somewhere. Therefore, omnipotence CANNOT exist! It contradicts
: >the rules of nature.
: 
: Obviously, an omnipotent god can change the rules.

When you say, "By definition", what exactly is being defined;
certainly not omnipotence. You seem to be saying that the "rules of
nature" are pre-existant somehow, that they not only define nature but
actually cause it. If that's what you mean I'd like to hear your
further thoughts on the question.

Bill



In [10]:
### Clean Text -remove stop words, special characters, lowercase, lemmatize
### Vectorize our text - turn our text into number equivalents - create a dictionary mapping from words to numbers  
            ### Use Keras Embedding to make word vectors
### Create LSTM model
### train and test it

In [11]:
X=data['data']
y=data['target']

In [14]:
def clean_my_text(text):
    lemmatized = []
    text = text.lower()
    tokens = model(text)
    for word in tokens:
        if not word.is_stop and word.is_alpha:
            lemmatized.append(word.lemma_)
    return lemmatized

In [15]:
clean_my_text(X[0])

['bill',
 'conner',
 'subject',
 'omni',
 'nntp',
 'post',
 'host',
 'organization',
 'okcforum',
 'unix',
 'user',
 'group',
 'x',
 'newsreader',
 'tin',
 'version',
 'line',
 'charley',
 'wingate',
 'write',
 'enlighten',
 'omnipotence',
 'contradictory',
 'definition',
 'occur',
 'universe',
 'govern',
 'rule',
 'nature',
 'god',
 'break',
 'god',
 'allow',
 'rule',
 'omnipotence',
 'exist',
 'contradict',
 'rule',
 'nature',
 'obviously',
 'omnipotent',
 'god',
 'change',
 'rule',
 'definition',
 'exactly',
 'define',
 'certainly',
 'omnipotence',
 'say',
 'rule',
 'nature',
 'pre',
 'existant',
 'define',
 'nature',
 'actually',
 'cause',
 'mean',
 'like',
 'hear',
 'thought',
 'question',
 'bill']

In [12]:
model = spacy.load('en_core_web_sm')

In [18]:
clean_X = []

for text in tqdm.tqdm(X):
    results = clean_my_text(text)
    clean_X.append(results)
    

100%|██████████████████████████████████████████████████████████████████████████████| 1073/1073 [01:40<00:00, 10.69it/s]


In [51]:
from sklearn.model_selection import train_test_split

In [52]:
Xtrain, Xtest, ytrain, ytest = train_test_split(clean_X, y)

In [53]:
vocab_list = ['']
for text in clean_X:
    for word in text:
        vocab_list.append(word)

In [54]:
vocab_list[:100]

['',
 'bill',
 'conner',
 'subject',
 'omni',
 'nntp',
 'post',
 'host',
 'organization',
 'okcforum',
 'unix',
 'user',
 'group',
 'x',
 'newsreader',
 'tin',
 'version',
 'line',
 'charley',
 'wingate',
 'write',
 'enlighten',
 'omnipotence',
 'contradictory',
 'definition',
 'occur',
 'universe',
 'govern',
 'rule',
 'nature',
 'god',
 'break',
 'god',
 'allow',
 'rule',
 'omnipotence',
 'exist',
 'contradict',
 'rule',
 'nature',
 'obviously',
 'omnipotent',
 'god',
 'change',
 'rule',
 'definition',
 'exactly',
 'define',
 'certainly',
 'omnipotence',
 'say',
 'rule',
 'nature',
 'pre',
 'existant',
 'define',
 'nature',
 'actually',
 'cause',
 'mean',
 'like',
 'hear',
 'thought',
 'question',
 'bill',
 'jurriaan',
 'wittenberg',
 'subject',
 'magellan',
 'update',
 'organization',
 'utrecht',
 'university',
 'dept',
 'computer',
 'science',
 'keyword',
 'magellan',
 'jpl',
 'line',
 'ron',
 'baalke',
 'write',
 'forward',
 'doug',
 'griffith',
 'magellan',
 'project',
 'manager'

In [55]:
vocab_list= list(set(vocab_list))

#### We have some data, but now we need to preprocess it

In [56]:
word_to_num = {}
num_to_word = {}
for i, word in enumerate(vocab_list):
    num_to_word[i] = word
    word_to_num[vocab_list[i]] = i

In [57]:
vocab_list[0]

''

In [58]:
word_to_num

{'': 0,
 'exotic': 1,
 'airplane': 2,
 'center': 3,
 'karman': 4,
 'illegitimate': 5,
 'meng': 6,
 'zach': 7,
 'tomayko': 8,
 'disturb': 9,
 'ahem': 10,
 'neuron': 11,
 'deletia': 12,
 'pixel': 13,
 'cuff': 14,
 'warmly': 15,
 'tedious': 16,
 'durham': 17,
 'renoir': 18,
 'vw': 19,
 'peopel': 20,
 'hominum': 21,
 'stansbery': 22,
 'microfilm': 23,
 'subjectively': 24,
 'vast': 25,
 'bradford': 26,
 'mastercard': 27,
 'infiltrate': 28,
 'establishment': 29,
 'polytheist': 30,
 'widget': 31,
 'technische': 32,
 'partook': 33,
 'ringlet': 34,
 'specifieke': 35,
 'night': 36,
 'disagreement': 37,
 'fairman': 38,
 'administer': 39,
 'rebroadcast': 40,
 'funeral': 41,
 'overweight': 42,
 'plane': 43,
 'city': 44,
 'lecture': 45,
 'academic': 46,
 'blood': 47,
 'announcement': 48,
 'marketing': 49,
 'presently': 50,
 'obesity': 51,
 'heavy': 52,
 'anchor': 53,
 'magnetic': 54,
 'installation': 55,
 'tube': 56,
 'criminal': 57,
 'library': 58,
 'muddy': 59,
 'sofia': 60,
 'san': 61,
 'nuke': 6

In [59]:
num_to_word

{0: '',
 1: 'exotic',
 2: 'airplane',
 3: 'center',
 4: 'karman',
 5: 'illegitimate',
 6: 'meng',
 7: 'zach',
 8: 'tomayko',
 9: 'disturb',
 10: 'ahem',
 11: 'neuron',
 12: 'deletia',
 13: 'pixel',
 14: 'cuff',
 15: 'warmly',
 16: 'tedious',
 17: 'durham',
 18: 'renoir',
 19: 'vw',
 20: 'peopel',
 21: 'hominum',
 22: 'stansbery',
 23: 'microfilm',
 24: 'subjectively',
 25: 'vast',
 26: 'bradford',
 27: 'mastercard',
 28: 'infiltrate',
 29: 'establishment',
 30: 'polytheist',
 31: 'widget',
 32: 'technische',
 33: 'partook',
 34: 'ringlet',
 35: 'specifieke',
 36: 'night',
 37: 'disagreement',
 38: 'fairman',
 39: 'administer',
 40: 'rebroadcast',
 41: 'funeral',
 42: 'overweight',
 43: 'plane',
 44: 'city',
 45: 'lecture',
 46: 'academic',
 47: 'blood',
 48: 'announcement',
 49: 'marketing',
 50: 'presently',
 51: 'obesity',
 52: 'heavy',
 53: 'anchor',
 54: 'magnetic',
 55: 'installation',
 56: 'tube',
 57: 'criminal',
 58: 'library',
 59: 'muddy',
 60: 'sofia',
 61: 'san',
 62: 'nuke

#### Now turn our reviews into word vectors, and pad the text so they are all the same size

In [60]:
sorted_list = sorted(clean_X, reverse=True)

In [61]:
sorted_list[0]

['yevgeny',
 'gene',
 'kilman',
 'subject',
 'usatoday',
 'ad',
 'family',
 'value',
 'organization',
 'florida',
 'international',
 'university',
 'miami',
 'line',
 'article',
 'dan',
 'e',
 'babcock',
 'write',
 'funny',
 'ad',
 'usatoday',
 'american',
 'family',
 'association',
 'post',
 'choice',
 'part',
 'enjoyment',
 'emphasis',
 'ad',
 'add',
 'typo',
 'dan',
 'article',
 'delete',
 'find',
 'add',
 'local',
 'sunday',
 'newspaper',
 'add',
 'place',
 'cartoon',
 'section',
 'perfect',
 'place']

---

In [62]:
max_length = len(sorted(clean_X)[0])

In [63]:
max_length

125

In [64]:
word_vec_X = [[word_to_num[word] for word in text] for text in clean_X]

In [68]:
word_vec_X = sequence.pad_sequences(word_vec_X, max_length, padding='pre')

In [69]:
word_vec_X[0]

array([    0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,  2800, 13303,
       11793, 11154, 11925,  5456, 14146,  4417,  6961,  5795,   784,
        5715,  5051, 11231, 12346,  9729,  8500, 13745,  9546,  1688,
       13202,  8214,  1645,  5568,  4860,  4995,  7732,  9256,  6091,
        2920, 13039,  2920,  7416,  9256,  8214,  1028,   136,  9256,
        6091,  4062,  1288,  2920, 10538,  9256,  5568,  1854,  8780,
        9124,  8214,   185,  9256,  6091,  9206,  4001,  8780,  6091,
        1518,  5460,   180, 10329, 13842,  2897,  3005,  2800])

In [70]:
vocab_size=len(vocab_list)+1

#### Now lets create the model - its a standard Sequential with an Embedding and an LSTM layer added

Embedding:
This layer takes 3 parameters - the size of the vocab (input_dims), the no. of dimensions of each word embedding (output_dim), and the length of each document (input_length), which we've standardised above. It returns a 2d matrix, with rows equal to each word in the document, and columns equal to the number of dimensions in the word embedding. 

*Actually its 3D, cos the batch_size is the first dimension in both input and output, but I find that confuses things more than it clarifies*
Put another way 

The embedding **takes in** a factorized corpus, e.g.:

**[The, cat, sat, on, the, mat]**    becomes    **[1,2,3,4,1,5]**

And **outputs** a word embedded corpus:

**[1,2,3,4,1,5]**    becomes (lets assume output_dim=2)   **[[0.2,0.7], [0.6,0.3], [0.1,0.8], [0.2,0.1], [0.4,0.9], [0.2,0.7]]**

In [71]:
model = Sequential()

In [72]:
model.add(Embedding(vocab_size, 64, input_length=max_length))
model.add(LSTM(512))
model.add(Dense(1, activation='sigmoid'))


In [73]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 125, 64)           925824    
_________________________________________________________________
lstm (LSTM)                  (None, 512)               1181696   
_________________________________________________________________
dense (Dense)                (None, 1)                 513       
Total params: 2,108,033
Trainable params: 2,108,033
Non-trainable params: 0
_________________________________________________________________


---

### Now train and test

In [74]:
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])

In [75]:
Xtrain, Xtest, ytrain, ytest = train_test_split(word_vec_X, y)

In [76]:
model.fit(Xtrain, ytrain, epochs=2, batch_size=128, validation_split=0.2)

Train on 643 samples, validate on 161 samples
Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x1d49ad89a08>

In [78]:
test_results=model.predict(Xtest)

In [79]:
from sklearn.metrics import accuracy_score

In [82]:
accuracy_score(np.argmax(test_results), Xtest)

TypeError: Singleton array 216 cannot be considered a valid collection.

In [81]:
model.evaluate(Xtest, ytest)



[0.6394562774431307, 0.866171]