# LTSM Using TensorFlow and Keras

This is an optional notebook that shows you how to do sentiment analysis and introduces the tensorflow framework. There will be no video for this notebook as it is not the main notebook for this week. As always, feel free to post any questions to the discussion board. 

#### Sentiment analysis on movie reviews

In this lab, we will solve a sequence classification problem by building a RNN with LTSM using Keras and TensorFlow.

For this example, we will use the IMDB dataset in Keras which contains a set of movie reviews with their associated sentiment. Sentiments could be either positive or negative. Thus, the problem we are solving is a binary classification problem.

The IMDB dataset consists of 25000 movie reviews for training and 25000 for testing. The problem is to predict whether a review has a positive or negative sentiment.

Before we build our model, we need to import the necessary libraries to perform our classification.

In [15]:
# !pip install tensorflow==2.0.0-beta1

In [16]:
# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow import keras

#import numpy and matplotlib
import numpy as np

Keras contains a version of the IMDB dataset which we will use

In [17]:
data = tf.keras.datasets.imdb

Split it into training and testing.

In [18]:
num_of_words=10000
(x_train, y_train), (x_test, y_test) = data.load_data(num_words=num_of_words)

Check the data

In [19]:
print(x_train[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]


Weird review huh? The words are encoded as a sequence of word indexes where words are indexed by overall frequency. For example, "3" encodes the 3rd most frequent word in the data.

In [20]:
# A dictionary mapping words to an integer index
word_index = tf.keras.datasets.imdb.get_word_index()

# The first indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

We now decode the training data using the code above.

In [21]:
decode_review(x_train[0])

"<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for wh

We will alter the input sequences so that they all have the same length.

In [22]:
# Only consider the first 400 words within the review
max_review_length = 400
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=max_review_length)
x_test = keras.preprocessing.sequence.pad_sequences(x_test, maxlen=max_review_length)

Time to build our model. 

We will have sequential layers with an input layer, an LTSM layer, and dense output layer. The input layer is of size 32 for each input word. The second layer is an LTSM layer with size 100, and finally one output node since we have a binary classification problem.

In [23]:
# Construct our model
embedding_vecor_length = 32
model = keras.models.Sequential()
model.add(keras.layers.Embedding(num_of_words, embedding_vecor_length, input_length=max_review_length))
model.add(keras.layers.LSTM(32))
model.add(keras.layers.Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=3, batch_size=64)

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 400, 32)           320000    
_________________________________________________________________
lstm_3 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 33        
Total params: 328,353
Trainable params: 328,353
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7fa2e8e4b710>

Now that we have trained our model, lets look at the testing accuracy.

In [25]:
# Evaluate model
scores = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 87.05%


Not bad!

In this lab, we learned how to load data from keras datasets, pad and trim data to a certain size, build an LTSM model, and evaluation test set.

If you are interested, there are also a bunch of packages that can be used to simplify the NLP process. Check out the last additional resource link if you want to see an code example using the NLTK package.

## Packages

Note: The following has been combined from several websites. It states things like 'our favourite package' despite the fact that I am not familiar with any of them. 

NLTK - natural language toolkit

Does classification, stemming, tagging, parsing, semantic reasoning, and tokenization in Python. Main tool for NLP. Good for beginners. Can be very slow however. Its modularized structure makes it excellent for learning and exploring NLP concepts, but it’s not meant for production.

TextBlob

A must for people's first encounter with NLTK - whatever that means. An easy interface to help learn most basic NLP tasks like sentiment analysis, pos-tagging, or noun phrase extraction. Great for beginners, still very slow. This is our favorite library for fast-prototyping or building applications that don’t require highly optimized performance. Beginners should start here.

CoreNLP
Really fast and works in product development enviornments. Can be integrated with NLTK to boost performance. Written in Java with Python wrappers.

Gensim
Gensim is a Python library that specializes in identifying semantic similarity between two documents through vector space modeling and topic modeling toolkit. It can handle large text corpora with the help of efficiency data streaming and incremental algorithms, which is more than we can say about other packages that only target batch and in-memory processing. What we love about it is its incredible memory usage optimization and processing speed. These were achieved with the help of another Python library, NumPy. The tool’s vector space modeling capabilities are also top notch.

spaCy
spaCy is a relatively young library was designed for production usage. That’s why it’s so much more accessible than other Python NLP libraries like NLTK. spaCy offers the fastest syntactic parser available on the market today. Moreover, since the toolkit is written in Cython, it’s also really speedy and efficient. SpaCy is a new NLP library that’s designed to be fast, streamlined, and production-ready. It’s not as widely adopted, but if you’re building a new application, you should give it a try.

However, no tool is perfect. In comparison to the libraries we covered so far, spaCy supports the smallest number of languages (seven). However, the growing popularity of machine learning, NLP, and spaCy as a key library means that the tool might start supporting more programming languages soon

polyglot
This slightly lesser-known library is one of our favorites because it offers a broad range of analysis and impressive language coverage. Thanks to NumPy, it also works really fast. Using polyglot is similar to spaCy – it’s very efficient, straightforward, and basically an excellent choice for projects involving a language spaCy doesn’t support. The library stands out from the crowd also because it requests the usage of a dedicated command in the command line through the pipeline mechanisms. Definitely worth a try

scikit-learn
This handy NLP library provides developers with a wide range of algorithms for building machine learning models. It offers many functions for using the bag-of-words method of creating features to tackle text classification problems. The strength of this library is the intuitive classes methods. Also, scikit-learn has an excellent documentation that helps developers make the most of its features.

However, the library doesn’t use neural networks for text preprocessing. So if you’d like to carry out more complex preprocessing tasks like POS tagging for your text corpora, it’s better to use other NLP libraries and then return to scikit-learn for building your models

Pattern
Another gem in the NLP libraries Python developers use to handle natural languages. Pattern allows part-of-speech tagging, sentiment analysis, vector space modeling, SVM, clustering, n-gram search, and WordNet. You can take advantage of a DOM parser, a web crawler, as well as some useful APIs like Twitter or Facebook. Still, the tool is essentially a web miner and might not be enough for completing other natural language processing tasks

CoreNLP is a Java library with Python wrappers. It’s in many existing production systems due to its speed.


Gensim is most commonly used for topic modeling and similarity detection. It’s not a general-purpose NLP library, but for the tasks it does handle, it does them well.

## Additional Resources

#### NLP

[8 best NLP Libraries](https://sunscrapers.com/blog/8-best-python-natural-language-processing-nlp-libraries/)

[Basic NLP explainations](https://towardsdatascience.com/gentle-start-to-natural-language-processing-using-python-6e46c07addf3)

[Definitions of NLP terms](https://www.tutorialspoint.com/natural_language_processing/natural_language_processing_python.htm)

[Great Short Datacamp Tutorial on Sentiment Analysis and Text Classification - NLTK package](https://www.datacamp.com/community/tutorials/text-analytics-beginners-nltk)
