<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Word-embeddings" data-toc-modified-id="Word-embeddings-0.0.1"><span class="toc-item-num">0.0.1&nbsp;&nbsp;</span>Word embeddings</a></span></li></ul></li></ul></li><li><span><a href="#The-Embedding-layer" data-toc-modified-id="The-Embedding-layer-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>The Embedding layer</a></span></li></ul></div>

In order to perform machine learning on text documents, the raw (text) data cannot be fed directly to algorithm as these algorithms expect numerical feature vectors so instead we need to turn the text content into numerical feature vectors.

From the [scikit-learn documentation](https://scikit-learn.org/stable/modules/feature_extraction.html):
<b>
We call vectorization the general process of turning a collection of text documents into numerical feature vectors.
</b>
So Vectorizing text is the process of transforming text into numeric tensors. 

Vectorization can be done in multiple ways:
- Segment text into words, and transform each word into a vector.
- Segment text into characters, and transform each character into a vector.
- Extract n-grams of words or characters, and transform each n-gram into a vector.
N-grams are overlapping groups of multiple consecutive words or characters

<b>Tokenization: </b> the segementation of text into words or characters

***Text-vectorization processes consist of applying some tokenization scheme to the text then associating numeric vectors with the generated tokens.***

Another popular and powerful way to associate a vector with a word is the use of dense
word vectors, also called word embeddings. Unlike the one-hot encoding, word embeddings are learned from data.


<P>Consider the sentence "The cat sat on the mat". The vocabulary (or unique words) in this sentence is (cat, mat, on, sat, the). </P>
From <a href="https://www.tensorflow.org/tutorials/text/word_embeddings">Tensorflow documentation-Word embeddings</a>

#### Word embeddings

> Word embeddings give us a way to use an efficient, dense representation in which similar words have a similar encoding. Importantly, you do not have to specify this encoding by hand. An embedding is a dense vector of floating point values (the length of the vector is a parameter you specify). Instead of specifying the values for the embedding manually, they are trainable parameters (weights learned by the model during training, in the same way a model learns weights for a dense layer). It is common to see word embeddings that are 8-dimensional (for small datasets), up to 1024-dimensions when working with large datasets. A higher dimensional embedding can capture fine-grained relationships between words, but takes more data to learn.
<img src="images/embedding2.png?raw=1" alt="Diagram of an embedding" width="400"/>
Above is a diagram for a word embedding. Each word is represented as a 4-dimensional vector of floating point values. Another way to think of an embedding is as "lookup table". After these weights have been learned, you can encode each word by looking up the dense vector it corresponds to in the table.




## The Embedding layer
From <a href="https://www.tensorflow.org/tutorials/text/word_embeddings">Tensorflow documentation-Word embeddings</a>
> Keras makes it easy to use word embeddings. Take a look at the Embedding layer.
The Embedding layer can be understood as a lookup table that maps from integer indices (which stand for specific words) to dense vectors (their embeddings). The dimensionality (or width) of the embedding is a parameter you can experiment with to see what works well for your problem, much in the same way you would experiment with the number of neurons in a Dense layer.

> When you create an Embedding layer, the weights for the embedding are randomly initialized (just like any other layer). During training, they are gradually adjusted via backpropagation. Once trained, the learned word embeddings will roughly encode similarities between words (as they were learned for the specific problem your model is trained on). If you pass an integer to an embedding layer, the result replaces each integer with the vector from the embedding table:

In [1]:
from tensorflow.keras.layers import Embedding,Flatten,Dense,Dropout
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.datasets import imdb
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
import os
import numpy as np

In [2]:

data_dir=os.path.dirname('./data/aclImdb/')
os.listdir(data_dir)

['imdb.vocab', 'imdbEr.txt', 'README', 'test', 'train']

In [3]:
def load_data(is_train=True):
    labels=[]
    texts=[]
    if is_train==True:
        train_dir = os.path.join(data_dir, 'train')
        for label in ['neg','pos']:
            dir_name=os.path.join(train_dir,label)
            for fname in os.listdir(dir_name):
                read_file=open(os.path.join(dir_name,fname),'r',encoding='utf-8').read().lower().replace('\n','')
                texts.append(read_file)
                labels.append(1 if label=='pos' else 0)  
    else:
        train_dir = os.path.join(data_dir, 'test')
        for label in ['neg','pos']:
            dir_name=os.path.join(train_dir,label)
            for fname in os.listdir(dir_name):
                read_file=open(os.path.join(dir_name,fname),'r',encoding='utf-8').read().lower().replace('\n','')
                texts.append(read_file)
                labels.append(1 if label=='pos' else 0)
    return texts,labels
        


In [4]:
train_texts,labels =load_data()

In [5]:
print(labels[0:20])

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [6]:
# keep the 15000 most common words
max_features=15000
embedding_dim=16
tokenizer=Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(train_texts)
def preprocess_text(text):
    sequence=tokenizer.texts_to_sequences(text)
    return sequence

In [7]:
sequence=preprocess_text(train_texts)

In [8]:
print(sequence[0])

[62, 4, 3, 129, 34, 44, 7576, 1414, 15, 3, 4252, 514, 43, 16, 3, 633, 133, 12, 6, 3, 1301, 459, 4, 1751, 209, 3, 10785, 7693, 308, 6, 676, 80, 32, 2137, 1110, 3008, 31, 1, 929, 4, 42, 5120, 469, 9, 2665, 1751, 1, 223, 55, 16, 54, 828, 1318, 847, 228, 9, 40, 96, 122, 1484, 57, 145, 36, 1, 996, 141, 27, 676, 122, 1, 13886, 411, 59, 94, 2278, 303, 772, 5, 3, 837, 11037, 20, 3, 1755, 646, 42, 125, 71, 22, 235, 101, 16, 46, 49, 624, 31, 702, 84, 702, 378, 3493, 12997, 2, 8422, 67, 27, 107, 3348]


In [9]:
def sent_from_seq(seq):
    words=[tokenizer.index_word.get(i) for i in seq]
    return ' '.join(words)
        

In [10]:
sent=sent_from_seq(sequence[0])
sent

"story of a man who has unnatural feelings for a pig starts out with a opening scene that is a terrific example of absurd comedy a formal orchestra audience is turned into an insane violent mob by the crazy of it's singers unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting even those from the era should be turned off the cryptic dialogue would make shakespeare seem easy to a third grader on a technical level it's better than you might think with some good cinematography by future great future stars sally kirkland and forrest can be seen briefly"

<b> number of tokens in all the sequences.</b>

In [11]:
num_tokens=[len(tokens) for tokens in sequence]
max_tokens=int(np.mean(num_tokens))
max_tokens

228

In [12]:
train_sequence_pad=pad_sequences(sequence,maxlen=max_tokens)

In [13]:
model=Sequential()
model.add(Embedding(input_dim=max_features,output_dim=embedding_dim,input_length=max_tokens))
model.add(Flatten())
model.add(Dense(34,activation='relu'))
model.add(Dropout(0.07))
model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 228, 16)           240000    
_________________________________________________________________
flatten (Flatten)            (None, 3648)              0         
_________________________________________________________________
dense (Dense)                (None, 34)                124066    
_________________________________________________________________
dropout (Dropout)            (None, 34)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 35        
Total params: 364,101
Trainable params: 364,101
Non-trainable params: 0
_________________________________________________________________


In [14]:
labels=np.array(labels)

In [15]:
train_sequence_pad.shape,labels.shape

((25000, 228), (25000,))

In [16]:
history = model.fit(train_sequence_pad, labels,epochs=10,batch_size=100,validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


# EVALUATING THE TRAINED MODEL ON THE TEXT DATA

In [17]:
test_texts,y_test=load_data(is_train=False)  


In [18]:
def preprocess__test_text(text):
    sequence=tokenizer.texts_to_sequences(text)
    test_sequence_pad=pad_sequences(sequence,maxlen=max_tokens)
    return test_sequence_pad

In [19]:
x_text=preprocess__test_text(test_texts)
x_text

array([[   0,    0,    0, ...,   32,  531,    8],
       [   4,  135,    1, ...,  176,  467,  155],
       [   0,    0,    0, ...,    8,    1,  174],
       ...,
       [ 144,  320,    4, ...,   34,  314,   38],
       [   0,    0,    0, ...,   28, 1156, 5894],
       [   0,    0,    0, ...,   58,  104, 3194]])

In [20]:
x_text.shape

(25000, 228)

In [21]:
y_test=np.array(y_test)

In [22]:
y_test.shape

(25000,)

In [23]:
loss,accu=model.evaluate(x_text,y_test)



In [24]:
print("Accuracy: {0:.2%}".format(accu))

Accuracy: 84.75%


In [42]:
text1 = "Not a good movie!"
text2="The movie was great!"
text3="The movie was terrible..."
text4="This is a confused movie."
text5 = "This movie is fantastic! I really like it because it is so good!"
text6 = "This movie really sucks! Can I get my money back please?"
text7='the animation and graphics were very good but The movie was too bad so am confused whether to recommend this movie or not.'
textss = [text1, text2, text3, text4, text5,text6,text7]

In [43]:
def pred(text):
    seq=preprocess__test_text(textss)
    pred=model.predict(seq)
    pred=np.array(pred)
    result=[text+': ==== >positive review' if i>0.5 else text + ': ===>negative review'  for text,i in zip(textss,pred)]
    return result

In [44]:
pred(textss)

['Not a good movie!: ===>negative review',
 'The movie was great!: ===>negative review',
 'The movie was terrible...: ===>negative review',
 'This is a confused movie.: ===>negative review',
 'This movie is fantastic! I really like it because it is so good!: ==== >positive review',
 'This movie really sucks! Can I get my money back please?: ===>negative review',
 'the animation and graphics were very good but The movie was too bad so am confused whether to recommend this movie or not.: ===>negative review']