<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#word-embeddings" data-toc-modified-id="word-embeddings-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>word embeddings</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Word-embeddings" data-toc-modified-id="Word-embeddings-1.0.1"><span class="toc-item-num">1.0.1&nbsp;&nbsp;</span>Word embeddings</a></span></li></ul></li></ul></li><li><span><a href="#The-Embedding-layer" data-toc-modified-id="The-Embedding-layer-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>The Embedding layer</a></span></li><li><span><a href="#Text-preprocessing" data-toc-modified-id="Text-preprocessing-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Text preprocessing</a></span></li></ul></div>

In order to perform machine learning on text documents, the raw (text) data cannot be fed directly to algorithm as these algorithms expect numerical feature vectors so instead we need to turn the text content into numerical feature vectors.

From the [scikit-learn documentation](https://scikit-learn.org/stable/modules/feature_extraction.html):
<b>
We call vectorization the general process of turning a collection of text documents into numerical feature vectors.
</b>
So Vectorizing text is the process of transforming text into numeric tensors. 

Vectorization can be done in multiple ways:
- Segment text into words, and transform each word into a vector.
- Segment text into characters, and transform each character into a vector.
- Extract n-grams of words or characters, and transform each n-gram into a vector.
N-grams are overlapping groups of multiple consecutive words or characters

<b>Tokenization: </b> the segementation of text into words or characters

***Text-vectorization processes consist of applying some tokenization scheme to the text then associating numeric vectors with the generated tokens.***

## word embeddings

Another popular and powerful way to associate a vector with a word is the use of dense
word vectors, also called word embeddings. Unlike the one-hot encoding, word embeddings are learned from data.


<P>Consider the sentence "The cat sat on the mat". The vocabulary (or unique words) in this sentence is (cat, mat, on, sat, the). </P>
From <a href="https://www.tensorflow.org/tutorials/text/word_embeddings">Tensorflow documentation-Word embeddings</a>

#### Word embeddings

> Word embeddings give us a way to use an efficient, dense representation in which similar words have a similar encoding. Importantly, you do not have to specify this encoding by hand. An embedding is a dense vector of floating point values (the length of the vector is a parameter you specify). Instead of specifying the values for the embedding manually, they are trainable parameters (weights learned by the model during training, in the same way a model learns weights for a dense layer). It is common to see word embeddings that are 8-dimensional (for small datasets), up to 1024-dimensions when working with large datasets. A higher dimensional embedding can capture fine-grained relationships between words, but takes more data to learn.
<img src="images/embedding2.png?raw=1" alt="Diagram of an embedding" width="400"/>
Above is a diagram for a word embedding. Each word is represented as a 4-dimensional vector of floating point values. Another way to think of an embedding is as "lookup table". After these weights have been learned, you can encode each word by looking up the dense vector it corresponds to in the table.





## The Embedding layer
From <a href="https://www.tensorflow.org/tutorials/text/word_embeddings">Tensorflow documentation-Word embeddings</a>
> Keras makes it easy to use word embeddings. Take a look at the Embedding layer.
The Embedding layer can be understood as a lookup table that maps from integer indices (which stand for specific words) to dense vectors (their embeddings). The dimensionality (or width) of the embedding is a parameter you can experiment with to see what works well for your problem, much in the same way you would experiment with the number of neurons in a Dense layer.

> When you create an Embedding layer, the weights for the embedding are randomly initialized (just like any other layer). During training, they are gradually adjusted via backpropagation. Once trained, the learned word embeddings will roughly encode similarities between words (as they were learned for the specific problem your model is trained on). If you pass an integer to an embedding layer, the result replaces each integer with the vector from the embedding table:

In [1]:
from tensorflow.keras.layers import Embedding,Layer,Flatten,Dense,Dropout
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
import tensorflow as tf
from tensorflow.keras.models import Sequential
import os,re,string
import numpy as np

In [2]:
tf.keras.__version__

'2.4.0'

In [3]:
tf.__version__

'2.4.1'

For text or sequence problems, the Embedding layer takes as input a 2D tensor of integers, of shape (samples,sequence_length) and returns a 3D floating-point tensor of shape (samples, sequence_length, embedding_dimensionality). All sequences in a batch must have the same length, though (because you need to pack them into a single tensor), so sequences that are shorter than others should be padded with zeros, and sequences that are longer should be truncated.

In [4]:
embedding_layer=Embedding(10,5)

In [5]:
result = embedding_layer(tf.constant([[1,2],[2,4]]))
result.numpy()

array([[[ 0.02173385, -0.01164292, -0.01864962, -0.04104736,
         -0.03320467],
        [ 0.02778766,  0.01905247,  0.0294116 ,  0.02264091,
         -0.03183525]],

       [[ 0.02778766,  0.01905247,  0.0294116 ,  0.02264091,
         -0.03183525],
        [ 0.00019507,  0.0349418 ,  0.0323015 , -0.03063896,
          0.00229504]]], dtype=float32)

# Loading the IMDB data for use with an Embedding layer

You’ll restrict the movie reviews to the top 15,000 most common words and  considering looking at the first 30 words in every review. The network will learn 16-dimensional embeddings for each of the 15,000 words

In [6]:
!ls  './data/aclImdb/train/'

labeledBow.feat
neg
pos
unsupBow.feat
urls_neg.txt
urls_pos.txt
urls_unsup.txt


In [7]:
dataset_dir=os.path.dirname('./data/aclImdb/')

In [8]:
dataset_dir

'./data/aclImdb'

In [9]:
batch_size=100
seed = 100
train_data=tf.keras.preprocessing.text_dataset_from_directory('./data/aclImdb/train/',
                                                            batch_size=batch_size,
                                                              validation_split=0.25,
                                                          subset='training',
                                                              seed=seed)

val_data=tf.keras.preprocessing.text_dataset_from_directory('./data/aclImdb/train/',
                                                            batch_size=batch_size,
                                                           validation_split=0.25,
                                                            subset='validation', seed=seed)

Found 25000 files belonging to 2 classes.
Using 18750 files for training.
Found 25000 files belonging to 2 classes.
Using 6250 files for validation.


From the above result, there are 25,000 examples in the training folder, of which 75% (18750) is used for training and 25% (6250) as a validation set

# names of the labels

In [10]:
for i,label in enumerate(train_data.class_names):
    print('index' ,i,"corresponds to ", label)

index 0 corresponds to  neg
index 1 corresponds to  pos


Creates a `Dataset` that prefetches elements from this dataset

In [11]:
AUTOTUNE=tf.data.AUTOTUNE
batch_size = 1000
train_data=train_data.cache().prefetch(buffer_size=AUTOTUNE)
#val_data=val_data.cache().prefetch(buffer_size=AUTOTUNE)

In [12]:
for x,y in train_data.take(1):
    for i in range(1):
        print(x[i].numpy(),'\n \n',y[i].numpy())

b"Neither the total disaster the UK critics claimed nor the misunderstood masterpiece its few fanboys insist, Revolver is at the very least an admirable attempt by Guy Ritchie to add a little substance to his conman capers. But then, nothing is more despised than an ambitious film that bites off more than it can chew, especially one using the gangster/con-artist movie framework. As might be expected from Luc Besson's name on the credits as producer, there's a definite element of 'Cinema de look' about it: set in a kind of realistic fantasy world where America and Britain overlap, it looks great, has a couple of superbly edited and conceived action sequences and oozes style, all of which mark it up as a disposable entertainment. But Ritchie clearly wants to do more than simply rehash his own movies for a fast buck, and he's spent a lot of time thinking and reading about life, the universe and everything. If anything its problem is that he's trying to throw in too many influences (a bit 

## Text preprocessing

You’ll restrict the movie reviews to the top 15,000 most common words and will consider looking at the first 30 words in every review. The network will learn 16-dimensional embeddings for each of the 15,000 words

In [13]:
def remove_br(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, '<br />', '')
    return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation), '')


In [14]:
max_features=15000
embedding_dim=16
max_length=30
encoder=TextVectorization(max_tokens=max_features,standardize=remove_br,
                          output_mode='int',
                          output_sequence_length=max_length)

In [15]:
encoder.adapt(train_data.map(lambda x,y:x))

The .adapt method sets the layer's vocabulary. Here are the first 70 tokens. After the padding and unknown tokens they're sorted by frequency:

In [16]:
vocab=np.array(encoder.get_vocabulary())
vocab[:70]

array(['', '[UNK]', 'the', 'and', 'a', 'of', 'to', 'is', 'in', 'it',
       'this', 'i', 'that', 'was', 'as', 'with', 'for', 'movie', 'but',
       'film', 'on', 'not', 'you', 'are', 'his', 'have', 'he', 'be',
       'one', 'its', 'at', 'all', 'by', 'an', 'they', 'from', 'who',
       'like', 'so', 'her', 'or', 'just', 'about', 'has', 'out', 'if',
       'some', 'what', 'there', 'good', 'more', 'very', 'when', 'she',
       'even', 'no', 'up', 'would', 'my', 'only', 'time', 'which',
       'really', 'story', 'their', 'were', 'had', 'see', 'can', 'me'],
      dtype='<U17')

In [17]:
len(encoder.get_vocabulary())==max_features

True

In [18]:
model=Sequential()
model.add(encoder)
model.add(Embedding(input_dim=max_features,output_dim=embedding_dim,input_length=max_length))
model.add(Flatten())
model.add(Dense(14,activation='relu'))
model.add(Dropout(0.05))
model.add(Dense(1,activation='sigmoid'))

In [19]:
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
#model.summary()

In [20]:
history = model.fit(train_data,epochs=4,batch_size=32,validation_data=val_data)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


In [21]:
# predict on a sample text without padding.

sample_text = ('the animation and graphics were very good but The movie was too bad'
               ' so am confused whether to recommend this movie or not.')

predictions = model.predict(np.array([sample_text]))
result='positive review' if predictions>0.5 else 'negative review'
print(result)

negative review


In [22]:
# predict on a sample text without padding.

text = [
  "The movie was great!",
  "This is a confused movie.",
  "The movie was terrible..."
]
for i in text:
    predictions = model.predict(np.array([i]))
    result='positive review' if predictions>0.5 else 'negative review'
    print(result)

positive review
negative review
negative review
