## Word Embeddings
___
#### Description:

Word embeddings are vector representations of words similar to one-hot encodings, but they are better in that they hold semantic meaning. The cosine distance between two similar words will be small whereas the cosine distance between two very different words will be large. The two most popular methods used to learn word embeddings are Word2Vec and GloVe. The papers to these methods can be found here:

Word2Vec: https://arxiv.org/pdf/1301.3781.pdf

GloVe: https://nlp.stanford.edu/pubs/glove.pdf


Training good word embeddings using Word2Vec or GloVe requires a lot of data. Fortunately there are many pre-trained word embeddings publicly available online. Using transfer learning, we can take these pre-trained word embeddings and use them in our model. In this case I use 50-dimensional GloVe word embeddings. Note that if we don't use pre-trained word embeddings and simply train a Keras Embedding layer from scratch, the embeddings will likely be inferior for two reasons: it is trained on less data, and the embedding is trained with the goal of trying to minimize the loss of the entire network instead of trying to capture semantics of words. 

___
#### Dataset:

The dataset contains 1000 restaurant reviews along with a rating of '0' or '1'. A '1' if the reviewer liked the food and a '0' if they did not. The dataset can be obtained from: https://www.kaggle.com/hj5992/restaurantreviews/data

___
#### Reference:

I use the 50-dimensional word embeddings trained on 6 billion words, which can be downloaded here:
https://nlp.stanford.edu/projects/glove/

Also, this resource was used as a helpful guide: https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/

In [1]:
# Import dependencies
import numpy as np
import pandas as pd
import re

In [2]:
# Read the data
df = pd.read_table('Restaurant_Reviews.tsv')

In [3]:
df.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
Review    1000 non-null object
Liked     1000 non-null int64
dtypes: int64(1), object(1)
memory usage: 15.7+ KB


In [5]:
# Separate data into reviews and sentiments
reviews = df['Review']
sentiments = df['Liked']

In [6]:
# Display a review to get an idea of how to preprocess
reviews[np.random.randint(len(reviews))]

'Although I very much liked the look and sound of this place, the actual experience was a bit disappointing.'

In [7]:
# Read in the GloVe word embeddings as a dictionary
embeddings = {}

with open('glove.6B/glove.6B.50d.txt', 'r', encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        embedding = np.array(values[1:], dtype='float32')
        embeddings[word] = embedding

In [8]:
# We will not be using every word embedding
print('Vocab size of original file:', len(embeddings))

Vocab size of original file: 400000


In [9]:
# Define a function to clean a review
def clean_review(review):
    review = review.lower() # Make letters lowercase
    review.replace('-', ' ') # Separate hyphenated words
    review.replace('.', ' ') # Fix phrases like "salt and pepper..and of course"
    words = review.split() # Split review into words for further cleaning
    
    new_words = []
    for word in words:
        word = re.sub('[^a-z]', '', word) # Remove non-alphabetical characters
        if embeddings.get(word) is not None: # Remove words not in GloVe vocab
            new_words.append(word)
        
    review = ' '.join(new_words) # Put words back together to form clean review
    
    return review

In [10]:
# Clean every review
clean_reviews = [clean_review(review) for review in reviews]

In [11]:
# Compare an original to a cleaned review
index = np.random.randint(len(reviews))

print('Original:\n', reviews[index])
print('\nCleaned:\n', clean_review(reviews[index]))

Original:
 If there were zero stars I would give it zero stars.

Cleaned:
 if there were zero stars i would give it zero stars


In [13]:
# Tokenize and integer encode the reviews
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(clean_reviews)

sequences = tokenizer.texts_to_sequences(clean_reviews)

In [14]:
# Display a review in its new representation
print(sequences[index])

[49, 46, 27, 327, 114, 3, 51, 185, 9, 327, 114]


In [15]:
# Choose length of sequence (input)
maxlen = len(max(sequences, key=len))
print('Max length:', maxlen) 

Max length: 32


In [16]:
# Vocab size
vocab_size = len(tokenizer.word_index) + 1 # including 0th index
print('Vocab size:', vocab_size)

Vocab size: 1958


In [17]:
# Pad sequences to max length
from keras.preprocessing.sequence import pad_sequences

sequences = pad_sequences(sequences, maxlen=maxlen, padding='pre')

In [18]:
# Create embedding matrix for our vocab using GloVe embeddings
embedding_matrix = np.zeros((vocab_size, 50))
for word, i in tokenizer.word_index.items():
    embedding_matrix[i] = embeddings.get(word)

In [19]:
"""
The embedding matrix contains the embedding for every word in our 
vocabulary and the zero vector. Each embedding is a vector with 50 
values. We can think of it as a matrix where each row represents a 
word with the exception of the first row which is reserved for the 
zero vector. It is common to see people use vocabulary sizes much
larger but the small dataset used here didn't have that many unique 
words.
"""

print('(vocab size, embedding length) ->', embedding_matrix.shape)

(vocab size, embedding length) -> (1958, 50)


In [20]:
# Get the inputs and outputs ready for training
X = sequences
y = np.array(sentiments)

print('X shape:', X.shape)
print('y shape:', y.shape)

X shape: (1000, 32)
y shape: (1000,)


In [21]:
# Build an RNN
from keras.models import Sequential
from keras.layers import  Embedding, LSTM, Dense

model = Sequential([
    Embedding(vocab_size, 50, weights=[embedding_matrix], input_length=maxlen, trainable=False),
    LSTM(50, dropout=0.2),
    Dense(1, activation='sigmoid')
])

print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 32, 50)            97900     
_________________________________________________________________
lstm_1 (LSTM)                (None, 50)                20200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 51        
Total params: 118,151
Trainable params: 20,251
Non-trainable params: 97,900
_________________________________________________________________
None


In [22]:
# Compile and fit the model to X and y

# Compile
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Fit
model.fit(X, y, epochs=20, validation_split=0.2)

Train on 800 samples, validate on 200 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x2121193feb8>