### Tokenizer
The Tokenizer class in Keras has various methods which help to prepare text so it can be used in neural network models.

In [2]:
# Import tokenizer
from keras.preprocessing.text import Tokenizer

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Here top-n words(nb_words) will not truncate the input but it will truncate the usage. Here top-n words(nb_words) only takes top 3 words.

In [3]:

nb_words = 3 # top-n words --> Here we will take only the top 3 words
tokenizer = Tokenizer(nb_words=nb_words)
 



- The training phase is by means fit_on_texts method and you can see word index by word_index property.
- Note that there is a basic filtering of text annotations (exclamation marks and such).

In [4]:
tokenizer.fit_on_texts(["The sun is shining in June!","September is grey.","Life is beautiful in August.","I like it","This and other things?"])
print(tokenizer.word_index)

{'is': 1, 'in': 2, 'the': 3, 'sun': 4, 'shining': 5, 'june': 6, 'september': 7, 'grey': 8, 'life': 9, 'beautiful': 10, 'august': 11, 'i': 12, 'like': 13, 'it': 14, 'this': 15, 'and': 16, 'other': 17, 'things': 18}


- You can see that the value 3(top-3 words) in the Tokenizer object is clearly not respected in the sense of limiting the dictionary. It is respected however in the texts_to_sequences method which turns input into numerical arrays:
- <b>texts_to_sequences method</b> which turns input into <b>numerical arrays</b>.

In [6]:
tokenizer.texts_to_sequences(["June is beautiful and I like it!"])

[[1]]

You need to read this as: take only words with an index less or equal to 3 (the constructor parameter). A parameter-less constructor yields the full sequences:



In [9]:
tokenizer = Tokenizer()
texts = ["The sun is shining in June!","September is grey.","Life is beautiful in August.","I like it","This and other things?"]
tokenizer.fit_on_texts(texts) # list of texts to train on
print(tokenizer.word_index) # Word index
tokenizer.texts_to_sequences(["June is beautiful and I like it!"]) # applying text index to sequences.

{'is': 1, 'in': 2, 'the': 3, 'sun': 4, 'shining': 5, 'june': 6, 'september': 7, 'grey': 8, 'life': 9, 'beautiful': 10, 'august': 11, 'i': 12, 'like': 13, 'it': 14, 'this': 15, 'and': 16, 'other': 17, 'things': 18}


[[6, 1, 10, 16, 12, 13, 14]]

In [22]:
print("Length of word_index:",len(tokenizer.word_index))

Length of word_index: 18


To check whether lower-casing was applied and how many sentences were used to train:

In [10]:
print("Was lower-case applied to %s sentences?: %s"%(tokenizer.document_count,tokenizer.lower))

Was lower-case applied to 5 sentences?: True


If you want to feed sentences to a network you can’t use arrays of variable lengths, corresponding to variable length sentences. So, the trick is to use the texts_to_matrix method to convert the sentences directly to equal size arrays:

In [23]:
tokenizer.texts_to_matrix(["June is beautiful and I like it!","Like August"])

array([[0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 1., 1., 0.,
        1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.,
        0., 0., 0.]])

This creates two rows for two sentences versus the amount of words in the vocabulary.

Now you can go ahead and use networks to do stuff.

### Basic network with textual data
For example, let’s say you want to detect the word ‘shining’ in the sequences above.
The most basic way would be to use a layer with some nodes like so:

In [24]:
# Importing necessary libraries
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Flatten
from keras.layers.wrappers import TimeDistributed
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM

- If you want to feed sentences to a network->You can't use arrays of variable lengths
- So use texts_to_matrix method -> to convert the sentences directly to equal size arrays.

In [31]:
 X = tokenizer.texts_to_matrix(texts) # Having equal size arrays of the texts

In [32]:
print(X)

[[0. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1.]]


In [33]:
y = [1,0,0,0,0]

In [35]:
tokenizer.word_index # Will display index of the words

{'and': 16,
 'august': 11,
 'beautiful': 10,
 'grey': 8,
 'i': 12,
 'in': 2,
 'is': 1,
 'it': 14,
 'june': 6,
 'life': 9,
 'like': 13,
 'other': 17,
 'september': 7,
 'shining': 5,
 'sun': 4,
 'the': 3,
 'things': 18,
 'this': 15}

In [36]:
print('Total unique words present in the texts:',len(tokenizer.word_index))

Total unique words present in the texts: 18


In [37]:
# Initiallizing the vocab_size
vocab_size = len(tokenizer.word_index) + 1  # 19

In [42]:
model = Sequential()

# 2 -> Output array of shape(*,2)
# (vocab_size) 19 -> Input array of shape(*,19)

model.add(Dense(2, input_dim=vocab_size))

# After 1st layer you dont need to specify the size of the input anymore.
model.add(Dense(1, activation='sigmoid'))
 
 
model.compile(loss='binary_crossentropy', optimizer='rmsprop')

In [45]:
# fit -> trains the model for a fixed no of epochs(iteration on a dataset)

# x -> numpy array of training data
# y -> numpy array of target(label) data

# validation_split -> Float between 0 and 1. Fraction of the training data to be used as validation data.
# The model will set apart this fraction of the training data, will not train on it,
# and will evaluate the loss and any model metrics on this data at the end of each epoch.

history = model.fit(X, y=y, batch_size=200, epochs=700, verbose=0, validation_split=0.2, shuffle=True)

You can check that this indeed learned the word:

In [47]:
from keras.utils.np_utils import np as np
np.round(model.predict(X))

array([[1.],
       [0.],
       [0.],
       [0.],
       [0.]], dtype=float32)

If the vocabulary is very large the numerical sequences turn into sparse arrays and it’s more efficient to cast everything to a lower dimension with the Embedding layer.

### Embedding

Assume you have a sparse vector [0,1,0,1,1,0,0] of dimension seven. You can turn it into a non-sparse 2d vector like so:

In [54]:
model = Sequential()
# Embedding layer -> Turns positive intigers(indexes) into dense vectors of fixed size.
# Example: [[4],[20]] --> [[0.25,0.1],[0.6,-0.2]]
# This Embedding layer --> can only be used as the first layer in a model.

# here
# input_dim: int > 0. Size of the vocabulary, i.e. maximum integer index + 1. ---> 2 -> Because we use binary vector as input.
# output_dim: int >= 0. Dimension of the dense embedding.---> 2 --> Target dimension.

model.add(Embedding(2, 2, input_length=7))
model.compile('rmsprop', 'mse')
model.predict(np.array([[0,1,0,1,1,0,0]]))
 

array([[[-0.01094497, -0.04730806],
        [-0.03950269, -0.01636547],
        [-0.01094497, -0.04730806],
        [-0.03950269, -0.01636547],
        [-0.03950269, -0.01636547],
        [-0.01094497, -0.04730806],
        [-0.01094497, -0.04730806]]], dtype=float32)

Where do these numbers come from? It’s a simple map from the given range to a 2d space:

If you want to use the embedding it means that the output of the embedding layer will have dimension (5, 19, 10). This works well with LSTM or GRU (see below) but if you want a binary classifier you need to flatten this to (5, 19*10):

In [64]:
model = Sequential()
model.add(Embedding(3, 10, input_length= X.shape[1] ))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='rmsprop')
model.fit(X, y=y, batch_size=200, epochs=700, verbose=0, validation_split=0.2, shuffle=True)
 

<keras.callbacks.History at 0x243f0c8be48>

It detects ‘shining’ flawlessly:

In [65]:
model.predict(X)

array([[1.0000000e+00],
       [9.7526389e-08],
       [9.8490979e-08],
       [2.2081014e-08],
       [9.9351764e-01]], dtype=float32)

An LSTM layer has historical memory and so the dimension outputted by the embedding works in this case, no need to flatten things:



In [67]:
model = Sequential()
 
model.add(Embedding(vocab_size, 10))
model.add(LSTM(5))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='rmsprop')
model.fit(X, y=y,epochs=500, verbose=0, validation_split=0.2, shuffle=True)

<keras.callbacks.History at 0x243f4623198>

Obviously, it predicts things as well:

In [70]:
model.predict(X)

array([[0.98991185],
       [0.00538749],
       [0.0053875 ],
       [0.00538749],
       [0.01543748]], dtype=float32)

### Using word2vec

<b> 1. Embedding class:</b>
- It maps discrete labels(i.e.words) into a continous vector space.
- Embedding class -> It does not take semantic similarity of words into account.

<b> 2. Word2Vec: </b>
- This tool takes text corpus as input and produce word vectors as output.
- It first creates vocabulary from training text data and then learns vector representation of words.


In [73]:
# Loading the GloVe set is straightforward approach

embeddings_index = {} # Create a empty dictionary
glove_data = './data/glove.6B/glove.6B.50d.txt'
f = open(glove_data, encoding="utf8")
for line in f:
    values = line.split() #  values[list] -> Splits at spaces
    word = values[0] # Getting 1st list from list values ---storing---> in word
    value = np.asarray(values[1:], dtype='float32') # Value of the word
    embeddings_index[word] = value # Adding keys and values to embedding_index dictionary.
f.close()
 
print('Loaded %s word vectors.' % len(embeddings_index))

Loaded 400000 word vectors.


In [81]:
embedding_dimension = 10
word_index = tokenizer.word_index
print(word_index)
print("\nLength of word_index:",len(word_index))

{'is': 1, 'in': 2, 'the': 3, 'sun': 4, 'shining': 5, 'june': 6, 'september': 7, 'grey': 8, 'life': 9, 'beautiful': 10, 'august': 11, 'i': 12, 'like': 13, 'it': 14, 'this': 15, 'and': 16, 'other': 17, 'things': 18}

Length of word_index: 18


The embedding_matrix matrix maps words to vectors in the specified embedding dimension (here 10):

In [88]:
embedding_matrix = np.zeros((len(word_index) + 1, embedding_dimension)) 
# The embedding matrix is default filled with zero in the dimension of (19,10)

[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]


In [None]:
# If the word_index-> word is present in the embedding_index(glove word2vec dictionary) --add--> to embeddding matrix
for word, i in word_index.items(): # method items() returns a list of dict's (key, value) tuple pairs
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector[:embedding_dimension] # till 10

Now you have an embedding matrix of 19 words into dimension 10:

In [83]:
embedding_matrix.shape

(19, 10)

The embedding matrix has 19 rows and 10 columns

In [85]:
# First 5 values of embedding matrix list
for i in range(0,5):
    print(embedding_matrix[i])

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0.61849999  0.64253998 -0.46551999  0.3757      0.74838001  0.53738999
  0.0022239  -0.60576999  0.26407999  0.11703   ]
[ 0.33041999  0.24995001 -0.60873997  0.10923     0.036372    0.15099999
 -0.55083001 -0.074239   -0.092307   -0.32821   ]
[ 0.41800001  0.24968    -0.41242     0.1217      0.34527001 -0.044457
 -0.49687999 -0.17862    -0.00066023 -0.6566    ]
[-0.37233001  1.66799998  0.65616     1.28429997 -0.070946   -1.61530006
 -0.28654    -0.67316997 -0.49886999  0.10531   ]


In [91]:
embedding_layer = Embedding(embedding_matrix.shape[0],
                            embedding_matrix.shape[1],
                            weights=[embedding_matrix],
                            input_length=12) # 19,10,embedding_matrix,input_length

In order to use this new embedding you need to reshape the training data X to the basic word-to-index sequences:

In [89]:
from keras.preprocessing.sequence import pad_sequences
X = tokenizer.texts_to_sequences(texts)
X = pad_sequences(X, maxlen=12)

We have used a fixed size of 12 here but anything works really. Now the sequences with integers representing word-index are mapped to a 10-dimensional vector space using the wrod2vec embedding and we're good to go:

In [92]:
model = Sequential()
model.add(embedding_layer)
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.layers[0].trainable=False
model.compile(loss='binary_crossentropy', optimizer='rmsprop')
model.fit(X, y=y, batch_size=20, epochs=700, verbose=0, validation_split=0.2, shuffle=True)

<keras.callbacks.History at 0x243fadcfb00>

You get the same razor sharp prediction. I know all of the above networks are overkill for the simple datasets but the intention was to show you the way to use the various NLP functionalities.

In [93]:
model.predict(X)

array([[0.33196414],
       [0.33196414],
       [0.33196414],
       [0.33196414],
       [0.33196414]], dtype=float32)

In [96]:
X.shape

(5, 12)