# Data Science Society Term 2 Workshop 3 - Word Embedding
Welcome to the third workshop in our deep learning series. In this workshop we begin to take a look at natural languague processing often denoted NLP. 

In [None]:
!pip install keras

In [None]:
!pip install theano

In [None]:
import os
os.environ['KERAS_BACKEND'] = "theano"
import keras
from keras.models import Sequential
from keras.preprocessing import sequence
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Flatten, Embedding
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## The Data
As the name implies natural language processing revolves around processing word data. This task comes with a number of difficulties. For instance, we unlike examples we looked at before position of words is often of great importance in languages so this has to be factored in somehow. We also need to consider the fact that context is also of importance and teaching a machine to recognise context within a text is no easy task. Today we will look at simple examples. 

The data we will consider is of text taken from movie reviews online. We shall try to train a model that picks up on whether we have a positive or negative sentiment about the movie the text refers to. 

We begin by defining a list of strings which will serve as our text data. This is defined below as ````text````. We also define a list called ````labels```` which indicate whether the sentiment is positive or negative. In this case ````1```` indicates a positive sentiment, whereas ````0```` indicates a negative.

In [None]:
# Text
text = ["Enjoyable",
        "Trash",
        "Absolute banger!",
        "Good",
        "Thrilling",
        "A complete bore",
        "Sleep inducing",
        "A must watch",
        "Avoid at all costs",
        "Exhausting",
        "Never been better!",
        "Too slow",
        "Very interesting watch",
        "Glowing performance",
        "Frustrating",
        "Dull"]

# Sentiment labels
labels = [1,0,1,1,1,0,0,1,0,0,1,0,1,1,0,0]

## Converting the Text
Our text data needs to somehow be converted into a form that can be interpreted by a machine. Traditionally, most data we've dealt with in the past is some form of numerical data. It would then make sense if we looked for some way to encode this text somehow in a numerical form. Fortunately, for us Keras has the functionality we need to do this.

We define a size for our vocabulary for the text in question. We choose this value so that it is an overestimate of the number of different words we have in our data. In this case ````50```` is chosen for ````vocab_size````. Next we define an empty list ````encoded_text```` to contain our encoded text. We will loop over all of the entries within the ````text```` list and pass them along with the ````vocab_size```` into the Keras ````one_hot```` function. This essentially just converts each word in a given string into an integer; one integer for each word. We have ````[]```` enclosing the function so that the output is a list. Essentially, the loop keeps adding the lists from the output of the ````one_hot```` function to the ````encoded_text```` list. What results is a list of lists of varying lengths. The length of each list simply corresponds to the number of words that were present in that particular entry. The results are then printed so we can see the what become of our text.

In [None]:
# Estimated vocab size
vocab_size = 50

# Empty list for the encoded text
encoded_text = []

# Loop over every entry in list
for i in range(len(text)):
    # Converts phrase to set of integers
    encoded_text += [one_hot(text[i], vocab_size)]
print(encoded_text)

Now to ensure all our data is of the same length. We define a ````max_length```` of ````4```` as this was the maximum number of words one entry contained. We call upon the ````pad_sequences```` function to take ````encoded_text```` and make sure each entry is of the length ````max_length````.

In [None]:
# Maximum length
max_length = 4

# Ensures sequences are all of the same length
padded_text = pad_sequences(encoded_text, max_length)
print(padded_text)

## Constructing the Neural Network
As is usually the case we begin defining the neural network architecture by specifiying that our model be of the ````Sequential```` type. The first layer we add will be an ````Embedding```` layer. These layers essentially create a projection of our word into a vector space. The first argument needs to be the vocabulary size which we defined as ````vocab_size````. The second argument specifies the number of dimensions the vector space in which we embed our words contains. The third argument; input length is chosen to be the ````max_length```` as this is the length of our ````padded_text````. 

Next, we ````Flatten```` the data as before. Here the only dense layer we are including is the output layer which contains ````1```` neuron and uses the ````"sigmoid"```` activation function. Note: the number of neurons in the output layer need only match the number of classes for the case of a ````softmax```` activation function. 

For our training parameters we use the ````"adam"```` optimizer as before. For loss we will use ````"binary_crossentropy"```` which is a suitable loss function in a binary classification problem where you've only two outcomes. As usual we will track ````"accuracy"```` for our ````metrics```` argument. As this is a small dataset we can run for ````epochs=200````.

In [None]:
# Choose a sequential model
model = Sequential()

# Add an embedding layer
model.add(Embedding(vocab_size, 8, input_length=max_length))

# Platten data
model.add(Flatten())

# Output layer
model.add(Dense(1,activation="sigmoid"))

# Training parameters
model.compile(optimizer="adam",
              loss="binary_crossentropy",
              metrics=["accuracy"])

# Model training
model.fit(padded_text,labels,epochs=200)

## Evaluation
Here we create a test dataset to determine how well our model performs. As before we encode and pad our words in the same way as we did for the training set above. 

In [None]:
# Test samples
test_text = ["Fell asleep",
             "Boring",
             "A must see",
             "Avoid this movie",
             "Incredibly dull",
             "Good performance",
             "A bit slow",
             "Really dull"]
# Test labels
test_labels = [0,0,1,0,0,1,0,0]

# List for encoded text data
encoded_test = []

# Loops over test data
for i in range(len(test_text)):
    # Converts phrase to set of integers
    encoded_test += [one_hot(test_text[i],vocab_size)]

# Ensures the sequences are of the same length
padded_test = pad_sequences(encoded_test,max_length)
print(padded_test)

Now to test the accuracy of our model!

In [None]:
# Evaluates model performance
loss, accuracy = model.evaluate(padded_test, test_labels)
print(accuracy)