#  Natural Language Processing (NLP) using a Recurrent Neural Network

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
from scipy.spatial.distance import cdist

In [3]:
# from tf.keras.models import Sequential  # This does not work!
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, GRU, Embedding
from tensorflow.python.keras.optimizers import Adam
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

In [4]:
tf.__version__

'1.8.0'

In [5]:
import keras
keras.__version__

Using TensorFlow backend.


'2.1.6'

In [6]:
import imdb

In [7]:
imdb.maybe_download_and_extract()

Data has apparently already been downloaded and unpacked.


In [8]:
x_train_text, y_train = imdb.load_data(train=True)
x_test_text, y_test = imdb.load_data(train=False)

In [9]:
print("Train-set size: ", len(x_train_text))
print("Test-set size:  ", len(x_test_text))

Train-set size:  25000
Test-set size:   25000


In [10]:
data_text = x_train_text + x_test_text

In [11]:
x_train_text[1]

'Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. Most people think of the homeless as just a lost cause while worrying about things such as racism, the war on Iraq, pressuring kids to succeed, technology, the elections, inflation, or worrying if they\'ll be next to end up on the streets.<br /><br />But what if you were given a bet to live on the streets for a month without the luxuries you once had from a home, the entertainment sets, a bathroom, pictures on the wall, a computer, and everything you once treasure to see what it\'s like to be homeless? That is Goddard Bolt\'s lesson.<br /><br />Mel Brooks (who directs) who stars as Bolt plays a rich man who has everything in the world until deciding to make a bet with a sissy rival (Jeffery Tambor) to see if he can live in the streets for thirty days withou

In [12]:
x_train_text[2]

'Brilliant over-acting by Lesley Ann Warren. Best dramatic hobo lady I have ever seen, and love scenes in clothes warehouse are second to none. The corn on face is a classic, as good as anything in Blazing Saddles. The take on lawyers is also superb. After being accused of being a turncoat, selling out his boss, and being dishonest the lawyer of Pepto Bolt shrugs indifferently "I\'m a lawyer" he says. Three funny words. Jeffrey Tambor, a favorite from the later Larry Sanders show, is fantastic here too as a mad millionaire who wants to crush the ghetto. His character is more malevolent than usual. The hospital scene, and the scene where the homeless invade a demolition site, are all-time classics. Look for the legs scene and the two big diggers fighting (one bleeds). This movie gets better each time I see it (which is quite often).'

In [13]:
y_train[1]

1.0

# Tokenizer

In [14]:
num_words = 1000

In [15]:
tokenizer = Tokenizer(num_words=num_words) 

In [16]:
%%time
tokenizer.fit_on_texts(data_text)

Wall time: 15.1 s


In [17]:
if num_words is None:
    num_words = len(tokenizer.word_index)

We can then inspect the vocabulary that has been gathered by the tokenizer. This is ordered by the number of occurrences of the words in the data-set. These integer-numbers are called word indices or "tokens" because they uniquely identify each word in the vocabulary.

In [18]:
tokenizer.word_index

{'psychoactive': 50949,
 "1690's": 68757,
 'tooney': 67545,
 "reality's": 64089,
 'guillotines': 32430,
 'haruhiko': 56932,
 "cunningham's": 30128,
 'catalunya': 43979,
 'dawes': 71023,
 'keyboardists': 61866,
 '50ft': 94995,
 'serbia': 21033,
 'katzman': 41248,
 'flitted': 82863,
 'nazies': 116034,
 'enhancement': 24624,
 'anisio': 37536,
 'aavjo': 75803,
 'dissimilar': 20788,
 'consented': 44171,
 'prezzo': 47134,
 "sibrel's": 45838,
 'hilarity': 5652,
 '32nd': 77148,
 "herb's": 122218,
 'nominees': 15889,
 'ojo': 102756,
 'chinaman': 38483,
 'roaming': 12620,
 'remonstration': 96473,
 'repeats': 7978,
 'grossbach': 108986,
 'sadiki': 56148,
 'reinstate': 59553,
 'limped': 34173,
 'reassertion': 94106,
 'kya': 49101,
 'snails”': 120611,
 'amorós': 57187,
 'incentives': 46952,
 'ouies': 116389,
 "pros's": 87938,
 'krusty': 64264,
 'categorical': 61067,
 'engulf': 22892,
 '\x84old': 74886,
 "rybak's": 107186,
 'ohwon': 31078,
 'stroesser': 72354,
 'faulker': 110788,
 'horrortitles': 87

We can then use the tokenizer to convert all texts in the training-set to lists of these tokens.

In [19]:
x_train_tokens = tokenizer.texts_to_sequences(x_train_text)

for example

In [20]:
x_train_text[1]

'Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. Most people think of the homeless as just a lost cause while worrying about things such as racism, the war on Iraq, pressuring kids to succeed, technology, the elections, inflation, or worrying if they\'ll be next to end up on the streets.<br /><br />But what if you were given a bet to live on the streets for a month without the luxuries you once had from a home, the entertainment sets, a bathroom, pictures on the wall, a computer, and everything you once treasure to see what it\'s like to be homeless? That is Goddard Bolt\'s lesson.<br /><br />Mel Brooks (who directs) who stars as Bolt plays a rich man who has everything in the world until deciding to make a bet with a sissy rival (Jeffery Tambor) to see if he can live in the streets for thirty days withou




### here This text corresponds to the following list of tokens:

In [21]:
np.array(x_train_tokens[1])

array([  38,   14,  744, 3506,   45,   75,   32, 1771,   15,  153,   18,
        110,    3, 1344,    5,  343,  143,   20,    1,  920,   12,   70,
        281, 1228,  395,   35,  115,  267,   36,  166,    5,  368,  158,
         38, 2058,   15,    1,  504,   88,   83,  101,    4,    1, 4339,
         14,   39,    3,  432, 1148,  136, 8697,   42,  177,  138,   14,
       2791,    1,  295,   20, 5276,  351,    5, 3029, 2310,    1,   38,
       8697,   43, 3611,   26,  365,    5,  127,   53,   20,    1, 2032,
          7,    7,   18,   48,   43,   22,   70,  358,    3, 2343,    5,
        420,   20,    1, 2032,   15,    3, 3346,  208,    1,   22,  281,
         66,   36,    3,  344,    1,  728,  730,    3, 3864, 1320,   20,
          1, 1543,    3, 1293,    2,  267,   22,  281, 2734,    5,   63,
         48,   44,   37,    5,   26, 4339,   12,    6, 2079,    7,    7,
       3425, 2891,   35, 4446,   35,  405,   14,  297,    3,  986,  128,
         35,   45,  267,    8,    1,  181,  366, 69

### We also need to convert the texts in the test-set to tokens.

In [22]:
x_test_tokens = tokenizer.texts_to_sequences(x_test_text)





# Padding and Truncating Data

we use the length of the longest sequence in the data-set, then we are wasting a lot of memory. This is particularly important for larger data-sets. 

So in order to make a compromise, we will use a sequence-length that covers most sequences in the data-set, and we will then truncate longer sequences and pad shorter sequences.

First we count the number of tokens in all the sequences in the data-set.


In [23]:
num_tokens = [len(tokens) for tokens in x_train_tokens + x_test_tokens]
num_tokens = np.array(num_tokens)

In [24]:
np.mean(num_tokens) #The average number of tokens in a sequence

221.27716

In [25]:
np.max(num_tokens) #The maximum number of tokens in a sequence

2209

In [26]:
#The max number of tokens we will allow is set to the average plus 2 standard deviations.

max_tokens = np.mean(num_tokens) + 2 * np.std(num_tokens)
max_tokens = int(max_tokens)
max_tokens

544

In [27]:
np.sum(num_tokens < max_tokens) / len(num_tokens) #This covers about 95% of the data-set.

0.9453

When padding or truncating the sequences that have a different length, we need to determine if we want to do this padding or truncating 'pre' or 'post'. If a sequence is truncated, it means that a part of the sequence is simply thrown away. If a sequence is padded, it means that zeros are added to the sequence.

So the choice of 'pre' or 'post' can be important because it determines whether we throw away the first or last part of a sequence when truncating, and it determines whether we add zeros to the beginning or end of the sequence when padding. This may confuse the Recurrent Neural Network.

In [29]:
pad = 'pre'

In [30]:
x_train_pad = pad_sequences(x_train_tokens, maxlen=max_tokens,
                            padding=pad, truncating=pad)

In [31]:
x_test_pad = pad_sequences(x_test_tokens, maxlen=max_tokens,
                           padding=pad, truncating=pad)

In [32]:
x_train_pad.shape

(25000, 544)

In [33]:
#EXAMPLE

np.array(x_train_tokens[1])

array([  38,   14,  744, 3506,   45,   75,   32, 1771,   15,  153,   18,
        110,    3, 1344,    5,  343,  143,   20,    1,  920,   12,   70,
        281, 1228,  395,   35,  115,  267,   36,  166,    5,  368,  158,
         38, 2058,   15,    1,  504,   88,   83,  101,    4,    1, 4339,
         14,   39,    3,  432, 1148,  136, 8697,   42,  177,  138,   14,
       2791,    1,  295,   20, 5276,  351,    5, 3029, 2310,    1,   38,
       8697,   43, 3611,   26,  365,    5,  127,   53,   20,    1, 2032,
          7,    7,   18,   48,   43,   22,   70,  358,    3, 2343,    5,
        420,   20,    1, 2032,   15,    3, 3346,  208,    1,   22,  281,
         66,   36,    3,  344,    1,  728,  730,    3, 3864, 1320,   20,
          1, 1543,    3, 1293,    2,  267,   22,  281, 2734,    5,   63,
         48,   44,   37,    5,   26, 4339,   12,    6, 2079,    7,    7,
       3425, 2891,   35, 4446,   35,  405,   14,  297,    3,  986,  128,
         35,   45,  267,    8,    1,  181,  366, 69

In [34]:
# after padding

x_train_pad[1]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
         38,   14,  744, 3506,   45,   75,   32, 17

### Tokenizer Inverse Map

For some strange reason, the Keras implementation of a tokenizer does not seem to have the inverse mapping from integer-tokens back to words, which is needed to reconstruct text-strings from lists of tokens. So we make that mapping here.

In [35]:
idx = tokenizer.word_index
inverse_map = dict(zip(idx.values(), idx.keys()))

In [36]:
def tokens_to_string(tokens):
    # Map from tokens back to words.
    words = [inverse_map[token] for token in tokens if token != 0]
    
    # Concatenate all words.
    text = " ".join(words)

    return text

In [39]:
#For example, this is the original text from the data-set:

x_train_text[1]

'Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. Most people think of the homeless as just a lost cause while worrying about things such as racism, the war on Iraq, pressuring kids to succeed, technology, the elections, inflation, or worrying if they\'ll be next to end up on the streets.<br /><br />But what if you were given a bet to live on the streets for a month without the luxuries you once had from a home, the entertainment sets, a bathroom, pictures on the wall, a computer, and everything you once treasure to see what it\'s like to be homeless? That is Goddard Bolt\'s lesson.<br /><br />Mel Brooks (who directs) who stars as Bolt plays a rich man who has everything in the world until deciding to make a bet with a sissy rival (Jeffery Tambor) to see if he can live in the streets for thirty days withou

In [40]:
tokens_to_string(x_train_tokens[1])

"or as george stated has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school work or vote for the matter most people think of the homeless as just a lost cause while worrying about things such as racism the war on iraq kids to succeed technology the or worrying if they'll be next to end up on the streets br br but what if you were given a bet to live on the streets for a month without the you once had from a home the entertainment sets a bathroom pictures on the wall a computer and everything you once treasure to see what it's like to be homeless that is lesson br br mel brooks who directs who stars as plays a rich man who has everything in the world until deciding to make a bet with a sissy rival to see if he can live in the streets for thirty days without the if succeeds he can do what he wants with a future project of making more buildings the on where is thrown on the street with a on his leg t

# Creating the Recurrent Neural Network

In [41]:
model = Sequential()

In [42]:
embedding_size = 8

In [43]:
model.add(Embedding(input_dim=num_words,
                    output_dim=embedding_size,
                    input_length=max_tokens,
                    name='layer_embedding'))

In [44]:
#We can now add the first Gated Recurrent Unit (GRU) to the network. 
#This will have 16 outputs. Because we will add a second GRU after this one, 
#we need to return sequences of data because the next GRU expects sequences as its input.

model.add(GRU(units=16, return_sequences=True))

In [45]:
# This adds the second GRU with 8 output units. 
#This will be followed by another GRU so it must also return sequences.


model.add(GRU(units=8, return_sequences=True))

In [46]:
model.add(GRU(units=4))

#### Adding a fully-connected / dense layer which computes a value between 0.0 and 1.0 that will be used as the classification output.

In [49]:
# using sigmoid activation ()

model.add(Dense(1, activation='sigmoid'))

In [50]:
# Adam optimizer with the given learning-rate.

optimizer = Adam(lr=1e-3)

#### Compile the Keras model so it is ready for training.

In [51]:
model.compile(loss='binary_crossentropy',
              optimizer=optimizer,
              metrics=['accuracy'])

In [52]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
layer_embedding (Embedding)  (None, 544, 8)            80000     
_________________________________________________________________
gru_1 (GRU)                  (None, 544, 16)           1200      
_________________________________________________________________
gru_2 (GRU)                  (None, 544, 8)            600       
_________________________________________________________________
gru_3 (GRU)                  (None, 4)                 156       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 5         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 2         
Total params: 81,963
Trainable params: 81,963
Non-trainable params: 0
_________________________________________________________________


# Now Training the Recurrent Neural Network

In [53]:
%%time
model.fit(x_train_pad, y_train,
          validation_split=0.05, epochs=3, batch_size=64)

Train on 23750 samples, validate on 1250 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Wall time: 25min 2s


<tensorflow.python.keras._impl.keras.callbacks.History at 0x138881fcba8>

# Performance on Test-Set
Now that the model has been trained we can calculate its classification accuracy on the test-set.

In [54]:
%%time
result = model.evaluate(x_test_pad, y_test)

Wall time: 2min 49s


In [55]:
# ACCURACY

print("Accuracy: {0:.2%}".format(result[1]))

Accuracy: 61.25%


## Example of Mis-Classified Text

In order to show an example of mis-classified text, we first calculate the predicted sentiment for the first 1000 texts in the test-set.

In [56]:
%%time
y_pred = model.predict(x=x_test_pad[0:1000])
y_pred = y_pred.T[0]

Wall time: 5.65 s


These predicted numbers fall between 0.0 and 1.0. We use a cutoff / threshold and say that all values above 0.5 are taken to be 1.0 and all values below 0.5 are taken to be 0.0. This gives us a predicted "class" of either 0.0 or 1.0.

In [57]:
cls_pred = np.array([1.0 if p>0.5 else 0.0 for p in y_pred])

In [58]:
# The true "class" for the first 1000 texts in the test-set are needed for comparison.

cls_true = np.array(y_test[0:1000])

In [59]:
# We can then get indices for all the texts that were incorrectly classified 
#by comparing all the "classes" of these two arrays.

incorrect = np.where(cls_pred != cls_true)
incorrect = incorrect[0]

In [60]:
len(incorrect)

20

Let us look at the first mis-classified text. We will use its index several times.

In [61]:
idx = incorrect[0]
idx

49

The mis-classified text is:

In [62]:
text = x_test_text[idx]
text

'George Armstrong Custer is known through history as an inept General who led his rgiment to their death at the battle of Little Big Horn. "They Died with their boots on," paints a different picture of General Custer. In this movie he is portrayed as a Flamboyant soldier whose mistakes, and misdeeds are mostly ue to his love for adventure.<br /><br />Errol Flynn plays George Armstrong Custer who we first meet as an over confident recruit at West Point. Custer quickily distinguishes himself from other cadets as beeing a poor student who always seems to be in trouble. Somehow this never appears to bother Custer and only seems to confuse him as he genuinely does not know how he gets into such predicaments. In spite of his poor standing, he eventualy graduates and becomes an officer in the United States Army. Through an error, Custer receives a promotion in rank. Before this can be corrected, he leads a Union regiment into battle against the Confederates. His campaign is successful and Cus

In [63]:
# These are the predicted and true classes for the text:

y_pred[idx]

0.44949216

In [64]:
cls_true[idx]

1.0

## New Data

Let us try and classify new texts that we make up. Some of these are obvious, while others use negation and sarcasm to try and confuse the model into mis-classifying the text.

In [65]:
text1 = "This movie is fantastic! I really like it because it is so good!"
text2 = "Good movie!"
text3 = "Maybe I like this movie."
text4 = "Meh ..."
text5 = "If I were a drunk teenager then this movie might be good."
text6 = "Bad movie!"
text7 = "Not a good movie!"
text8 = "This movie really sucks! Can I get my money back please?"
texts = [text1, text2, text3, text4, text5, text6, text7, text8]

now We first convert these texts to arrays of integer-tokens because that is needed by the model.

In [66]:
tokens = tokenizer.texts_to_sequences(texts)

In [67]:
# To input texts with different lengths into the model, we also need to pad and truncate them.

tokens_pad = pad_sequences(tokens, maxlen=max_tokens,
                           padding=pad, truncating=pad)
tokens_pad.shape

(8, 544)

In [68]:
model.predict(tokens_pad)

array([[0.55837977],
       [0.55845195],
       [0.5578019 ],
       [0.5584939 ],
       [0.5581721 ],
       [0.5568378 ],
       [0.55841994],
       [0.5292616 ]], dtype=float32)

A value close to 0.0 means a negative sentiment and a value close to 1.0 means a positive sentiment. These numbers will vary every time you train the model.

## Embeddings

The model cannot work on integer-tokens directly, because they are integer values that may range between 0 and the number of words in our vocabulary, e.g. 10000. So we need to convert the integer-tokens into vectors of values that are roughly between -1.0 and 1.0 which can be used as input to a neural network.

This mapping from integer-tokens to real-valued vectors is also called an "embedding". It is essentially just a matrix where each row contains the vector-mapping of a single token. This means we can quickly lookup the mapping of each integer-token by simply using the token as an index into the matrix. The embeddings are learned along with the rest of the model during training.

Ideally the embedding would learn a mapping where words that are similar in meaning also have similar embedding-values. Let us investigate if that has happened here.

First we need to get the embedding-layer from the model:

In [69]:
# First we need to get the embedding-layer from the model:

layer_embedding = model.get_layer('layer_embedding')

In [70]:
# We can then get the weights used for the mapping done by the embedding-layer.

weights_embedding = layer_embedding.get_weights()[0]

In [71]:
# the weights are actually just a matrix with the number of words in the 
#vocabulary times the vector length for each embedding.

weights_embedding.shape

(10000, 8)

In [72]:
# Let us get the integer-token for the word 'good', which is just an index into the vocabulary.

token_good = tokenizer.word_index['good']
token_good

49

In [73]:
# Let us also get the integer-token for the word 'great'.

token_great = tokenizer.word_index['great']
token_great

78

These integertokens may be far apart and will depend on the frequency of those words in the data-set.

Now let us compare the vector-embeddings for the words 'good' and 'great'. Several of these values are similar, although some values are quite different. Note that these values will change every time you train the model.

In [74]:
weights_embedding[token_good]

array([0.978171  , 0.5937643 , 0.43536103, 0.5319829 , 0.08964466,
       0.6342454 , 0.5165227 , 0.6295241 ], dtype=float32)

In [75]:
weights_embedding[token_great]

array([-0.05571837,  0.65573746, -0.14962612,  0.7024609 , -0.17788634,
        0.39176622, -0.01695074,  0.9315679 ], dtype=float32)

In [76]:


# Similarly, we can compare the embeddings for the words 'bad' and 'horrible'.


token_bad = tokenizer.word_index['bad']
token_horrible = tokenizer.word_index['horrible']

In [77]:
weights_embedding[token_bad]

array([ 0.86898375, -0.57986397,  0.45750946,  0.11470479,  0.9568155 ,
        0.24495511,  0.15580253, -0.3214397 ], dtype=float32)

In [78]:
weights_embedding[token_horrible]

array([ 0.7288629 ,  0.48718736,  0.5936784 , -0.35243008,  0.77482307,
        0.35306722,  0.3102466 ,  0.39246282], dtype=float32)




### Sorted Words

We can also sort all the words in the vocabulary according to their "similarity" in the embedding-space. We want to see if words that have similar embedding-vectors also have similar meanings.

Similarity of embedding-vectors can be measured by different metrics, e.g. Euclidean distance or cosine distance.

We have a helper-function for calculating these distances and printing the words in sorted order

In [79]:
def print_sorted_words(word, metric='cosine'):
    """
    Print the words in the vocabulary sorted according to their
    embedding-distance to the given word.
    Different metrics can be used, e.g. 'cosine' or 'euclidean'.
    """

    # Get the token (i.e. integer ID) for the given word.
    token = tokenizer.word_index[word]

    # Get the embedding for the given word. Note that the
    # embedding-weight-matrix is indexed by the word-tokens
    # which are integer IDs.
    embedding = weights_embedding[token]

    # Calculate the distance between the embeddings for
    # this word and all other words in the vocabulary.
    distances = cdist(weights_embedding, [embedding],
                      metric=metric).T[0]
    
    # Get an index sorted according to the embedding-distances.
    # These are the tokens (integer IDs) for words in the vocabulary.
    sorted_index = np.argsort(distances)
    
    # Sort the embedding-distances.
    sorted_distances = distances[sorted_index]
    
    # Sort all the words in the vocabulary according to their
    # embedding-distance. This is a bit excessive because we
    # will only print the top and bottom words.
    sorted_words = [inverse_map[token] for token in sorted_index
                    if token != 0]

    # Helper-function for printing words and embedding-distances.
    def _print_words(words, distances):
        for word, distance in zip(words, distances):
            print("{0:.3f} - {1}".format(distance, word))

    # Number of words to print from the top and bottom of the list.
    k = 10

    print("Distance from '{0}':".format(word))

    # Print the words with smallest embedding-distance.
    _print_words(sorted_words[0:k], sorted_distances[0:k])

    print("...")

    # Print the words with highest embedding-distance.
    _print_words(sorted_words[-k:], sorted_distances[-k:])

We can then print the words that are near and far from the word 'great' in terms of their vector-embeddings. Note that these may change each time you train the model.

In [80]:
print_sorted_words('great', metric='cosine')

Distance from 'great':
0.000 - great
0.045 - definitely
0.067 - excellent
0.081 - oops
0.086 - beautiful
0.086 - adventure
0.091 - esteem
0.095 - slipped
0.095 - months
0.097 - masterpiece
...
1.152 - turd
1.157 - stinker
1.176 - avoid
1.235 - skip
1.245 - money
1.246 - d
1.328 - waste
1.338 - awful
1.338 - 1
1.360 - bad


In [81]:
print_sorted_words('worst', metric='cosine')

Distance from 'worst':
0.000 - worst
0.086 - offensive
0.092 - mildly
0.096 - happens
0.103 - idiotic
0.110 - teenagers
0.112 - evan
0.116 - amusing
0.119 - failing
0.120 - misses
...
1.082 - times
1.092 - casts
1.108 - finding
1.108 - cedric
1.116 - developing
1.118 - since
1.124 - realization
1.127 - captured
1.160 - yesterday
1.216 - relentless


# The basic methods for doing Natural Language Processing (NLP) using a Recurrent Neural Network with integer-tokens and an embedding layer. This was used to do sentiment analysis of movie reviews from IMDB