Tokenizer is a class just like sklearn random forest<br>
First create an object (let's say tokenizer)<br>
If our corpus is a list of text, then call `fit_on_texts`, which will update internal vocabulary list even if we previously have a vocabulary list<br>
If we want to transfer from word token list to index list, call `texts_to_sequences`<br>
If we want to transfer into one-hot-encoding matrix, call `texts_to_matrix`<br>
If we want to see the index, word mapping dict, call `word_index`

 Whereas the vectors obtained through one-hot encoding are binary, sparse, and very high-dimensional (number of words in the vocabulary), word embeddings are low-dimensional floating-point vectors (dense vectors)

In [1]:
from keras.preprocessing.text import Tokenizer

In [25]:
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
samples1 = ['I am the best man']

In [21]:
seq_samples = [[1,2,3], [4,1,3]]
tokenizer = Tokenizer(num_words=1000)
tokenizer.fit_on_sequences(seq_samples)
seq = tokenizer.sequences_to_matrix(seq_samples)
seq[0]

array([ 0.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0

In [26]:
tokenizer = Tokenizer(num_words=1000)
tokenizer.fit_on_texts(samples)
sequences = tokenizer.texts_to_sequences(samples)
tokenizer.sequences_to_matrix(sequences)
tokenizer.word_index

{'ate': 7,
 'cat': 2,
 'dog': 6,
 'homework': 9,
 'mat': 5,
 'my': 8,
 'on': 4,
 'sat': 3,
 'the': 1}

In [27]:
tokenizer.fit_on_texts(samples1)
tokenizer.word_index

{'am': 11,
 'ate': 7,
 'best': 12,
 'cat': 2,
 'dog': 6,
 'homework': 9,
 'i': 10,
 'man': 13,
 'mat': 5,
 'my': 8,
 'on': 4,
 'sat': 3,
 'the': 1}

`Embedding` takes at least two arugments: the number of possible tokens (1+maximum word index) and the dimensionality of the embeddings<br>
It takes 2D tensor of integers as input, of shape (`sample size`, `sequence_length`)<br>
It outputs 3D floating-point tensor of shape (`sample size`, `sequence_length`, dimensionality

In [35]:
from keras.datasets import imdb
from keras import preprocessing

max_features = 10000
maxlen = 20

(x_train, y_train), (x_test, y_test) = imdb.load_data(
num_words=max_features)

In [37]:
# truncate from the begining
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)

The first layer in a `Sequential` model needs to receive information about its input shape (and only the first layer).<br>
Three different arguments based on layer type can do this function:<br>
[Check this for detail](https://faroit.github.io/keras-docs/1.0.1/getting-started/sequential-model-guide/)

`Dense` and `Embedding` are 2D layer (batch_size, input_dim) and `LSTM` is 3D layer (batch size, input_len, input_dim). <br>

`Flatten` layer does not affect the batch size (first dimension)

In [52]:
from keras.models import Sequential
from keras.layers import Flatten, Dense, Embedding

model = Sequential()
model.add(Embedding(max_features, 8, input_length=maxlen))
model.add(Flatten())
# output is of size sample size, input_length, dimention size

model.add(Dense(1, activation='sigmoid'))

In [53]:
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 20, 8)             80000     
_________________________________________________________________
flatten_1 (Flatten)          (None, 160)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 161       
Total params: 80,161
Trainable params: 80,161
Non-trainable params: 0
_________________________________________________________________


In [55]:
history = model.fit(x_train, y_train,
epochs=10,
batch_size=32,
validation_split=0.2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
