# LSTMs for Sentiment Analysis and Text Generation
This part of the lab is based on Antoine Tixier's notes [Introduction to CNNs and LSTMs for NLP](https://arxiv.org/pdf/1808.09772.pdf). You are strongly encouraged to have a look at these notes for a quick theoretical intro.

## Part 1: Sentiment Classification using LSTMs
In the first part of the lab, we will implement a long short-term memory (LSTM) network to perform binary movie review classification (positive/negative) using the [Keras](https://keras.io) library.

For our experiments, we will use the [sentence polarity dataset](http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz). The dataset was collected by Pang and Lee and consists of 5,331 positive and 5,331 negative snippets acquired from Rotten Tomatoes. Snippets were automatically labeled using the labels provided by Rotten Tomatoes. The positive and negative reviews are stored into the `rt-polarity.pos` and `rt-polarity.neg` files, respectively. Let's first read the data.

In [1]:
import numpy as np

def load_documents(filename):
    docs =[]

    with open(filename, encoding='utf8', errors='ignore') as f:
        for line in f:
            docs.append(line[:-1])

    return docs

docs = list()
labels = list()

docs_pos = load_documents('data/rt-polarity.pos')
docs.extend(docs_pos)
labels.extend([1]*len(docs_pos))

docs_neg = load_documents('data/rt-polarity.neg')
docs.extend(docs_neg)
labels.extend([0]*len(docs_neg))

y = np.array(labels)

print("A positive review:", docs_pos[0])
print('\n')
print("A negative review:", docs_neg[0])

A positive review: the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . 


A negative review: simplistic , silly and tedious . 


The documents that are contained in the dataset have already undergone some preprocessing. Therefore, we will only remove some punctuation marks, diacritics, and non letters, if any. Furthermore, we will represent each document as a list of tokens.

In [2]:
import re

def clean_str(string):
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)     
    string = re.sub(r"\'s", " \'s", string) 
    string = re.sub(r"\'ve", " \'ve", string) 
    string = re.sub(r"n\'t", " n\'t", string) 
    string = re.sub(r"\'re", " \'re", string) 
    string = re.sub(r"\'d", " \'d", string) 
    string = re.sub(r"\'ll", " \'ll", string) 
    string = re.sub(r",", " , ", string) 
    string = re.sub(r"!", " ! ", string) 
    string = re.sub(r"\(", " \( ", string) 
    string = re.sub(r"\)", " \) ", string) 
    string = re.sub(r"\?", " \? ", string) 
    string = re.sub(r"\s{2,}", " ", string)
    return string.strip().split()

    
def preprocessing(docs):
    preprocessed_docs = []

    for doc in docs:
        preprocessed_docs.append(clean_str(doc))

    return preprocessed_docs

processed_docs = preprocessing(docs)

print("Preprocessed document:", processed_docs[0])

Preprocessed document: ['the', 'rock', 'is', 'destined', 'to', 'be', 'the', '21st', 'century', "'s", 'new', 'conan', 'and', 'that', 'he', "'s", 'going', 'to', 'make', 'a', 'splash', 'even', 'greater', 'than', 'arnold', 'schwarzenegger', ',', 'jean', 'claud', 'van', 'damme', 'or', 'steven', 'segal']


Subsequently, we will extract the vocabulary of the dataset. We will store the vocabulary in a dictionary where keys are terms and values correspond to indices. Hence, each term will be assigned a unique index. The minimum index will be equal to 1, while the maximum index will be equal to the size of the vocabulary.

In [3]:
def get_vocab(processed_docs):
    vocab = dict()

    for doc in processed_docs:
        for word in doc:
            if word not in vocab:
                vocab[word] = len(vocab) + 1

    return vocab

vocab = get_vocab(processed_docs)
print("Size of the vocabulary:", len(vocab))
print("Index of term 'good':", vocab["good"])

Size of the vocabulary: 18777
Index of term 'good': 72


Next, we will load a set of 300-dimensional word embeddings learned with word2vec on the GoogleNews dataset. The embeddings can be downloaded from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing. Using `gensim`, we can extract only the vectors of the words found in our vocabulary. Terms not present in the set of pre-trained words are initialized randomly (uniformly in [−0.25, 0.25]). Before executing the code, set the path for the file that contains the word embeddings.

In [5]:
import numpy as np
from gensim.models.keyedvectors import KeyedVectors

def load_embeddings(fname, vocab):
    embeddings = np.zeros((len(vocab)+1, 300))
    
    model = KeyedVectors.load_word2vec_format(fname, binary=True)
    for word in vocab:
        if word in model:
            embeddings[vocab[word]] = model[word]
        else:
            embeddings[vocab[word]] = np.random.uniform(-0.25, 0.25, 300)
    return embeddings

path_to_embeddings = '/Users/christophenoblanc/Documents/ProjetsPython/DSSP/Day_17_et_18/GoogleNews-vectors-negative300.bin.gz'
embeddings = load_embeddings(path_to_embeddings, vocab)

We will now calculate the size of the largest document and create a matrix whose rows correspond to documents. Each row contains the indices of the terms appearing in the document and preserves the order of the terms in the document. That is, the first component of a row contains the index of the first term of the corresponding document, the second component contains the index of the second term etc. Documents whose length is shorter than that of the longest document are padded with zeros.

In [None]:
#your code here

# not done

We will then use the [`train_test_split`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function of scikit-learn to split our dataset randomly into a training and a test set. Set the size of the test set to 0.1.

In [None]:
from sklearn.model_selection import train_test_split

#your code here

# not done

### Defining the LSTM

To build the neural network, we will make use of the Sequential model. We will first add an [Embedding layer](https://keras.io/layers/embeddings/). The Embedding layer requires the input data to be integer encoded, so that each word is represented by a unique integer. The Embedding layer can be initialized either with random weights and learn an embedding for all of the words in the training set or with pre-trained word embeddings. In our case, it will be initialized with the 300-dimensional word embeddings that we have already loaded. The Embedding layer must specify 3 arguments: (1) `input_dim`: the size of the vocabulary, (2) `output_dim`: the size of the vector space in which the words have been embedded (i.e., 300 in our case), and (3) `input_length`: the maximum length of the input documents. In case we initialize the layer with pre-trained embeddings, we must provide another argument (`weights`) which is a list that contains a matrix whose i-th row corresponds to the embedding of term with index i. 

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding

#your code here

# not done

We will then add a [Long Short-Term Memory layer](https://keras.io/layers/recurrent/#lstm). The Long Short-Term Memory layer takes as input the number of hidden units (i.e., dimensionality of the output). Set the number of units to 100. To create a Bidirectional LSTM, we will use the [Bidirectional layer wrapper](https://keras.io/layers/wrappers/#bidirectional). This wrapper takes a recurrent layer as an argument (the Long Short-Term Memory layer in our case).

In [None]:
from tensorflow.keras.layers import LSTM, Bidirectional

#your code here

# not done

Finally, we will apply [dropout](https://keras.io/layers/core/#dropout) to the output of the LSTM (with rate 0.5) and we will add to the model a fully connected layer with one hidden unit whose value corresponds to the probability that the review is positive or negative.

In [None]:
from tensorflow.keras.layers import Dense, Dropout

#your code here

# not done

Next, we will compile the model. Since this is a binary classification task, the loss function is the binary crossentropy. To train the network, we will use the Adam optimizer.

In [None]:
#your code here

# not done

We finally print the details of the LSTM.

In [None]:
model.summary()

We train the model on CPU. Note you can get a significant speedup by using a GPU. We add a callback which saves the model that has achieved the highest test accuracy to the disk. We also add a second callback which ensures that training stops after 2 epochs without improvement in test set accuracy (early stopping strategy). Use the [fit](https://keras.io/models/model/#methods) function of Keras to train the model. Set the number of epochs to 5 and the batch size to 64.

In [None]:
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

early_stopping = EarlyStopping(monitor='val_accuracy', # go through epochs as long as accuracy on validation set increases
                               patience=2,
                               mode='max')

# make sure that the model corresponding to the best epoch is saved
checkpointer = ModelCheckpoint(filepath='bi_lstm.hdf5',
                               monitor='val_accuracy',
                               save_best_only=True,
                               verbose=0)


#your code here

# not done

## Part 2: Text Generation using LSTMs
In the second part of the lab, we will implement an LSTM network to learn sequences of characters, and we will use the model to generate new sequences of characters. Recurrent neural networks such as LSTMs can serve as predictive models, but also as generative models. They can identify the patterns in the data and based on these patterns they can then generate novel data.


We will train the LSTM network that we will implement on a technical report of a demo application that was developed by our research team. The textual content of the technical report is stored in the `demo_report.txt` file. Use the code given below to read the file and extract the textual content. How long is the text?

In [6]:
with open('data/demo_report.txt', encoding='utf-8') as f:
    text = f.read().lower()

#your code here
print(text[:100])

the role of hashtags in guiding users to content of interest is of critical importance. however, onl


In [7]:
print(len(text))

10397


Next, we will extract the vocabulary of the text (i.e., all the unique characters). We will also create two dictionaries. One that maps each character to a unique integer, and the reverse dictionary that maps each integer to the corresponding character.

In [13]:
vocab = sorted(list(set(text)))
#print('Vocabulary:', vocab)

char_to_idx = dict()
idx_to_char = dict()

#your code here
for c in vocab:
    char_to_idx[c]=len(char_to_idx)
    idx_to_char[char_to_idx[c]]=c
    
print(char_to_idx)
print("")
print(idx_to_char)

{'\n': 0, ' ': 1, '!': 2, '"': 3, '#': 4, "'": 5, '(': 6, ')': 7, ',': 8, '-': 9, '.': 10, ':': 11, 'a': 12, 'b': 13, 'c': 14, 'd': 15, 'e': 16, 'f': 17, 'g': 18, 'h': 19, 'i': 20, 'j': 21, 'k': 22, 'l': 23, 'm': 24, 'n': 25, 'o': 26, 'p': 27, 'q': 28, 'r': 29, 's': 30, 't': 31, 'u': 32, 'v': 33, 'w': 34, 'x': 35, 'y': 36, 'z': 37, '’': 38}

{0: '\n', 1: ' ', 2: '!', 3: '"', 4: '#', 5: "'", 6: '(', 7: ')', 8: ',', 9: '-', 10: '.', 11: ':', 12: 'a', 13: 'b', 14: 'c', 15: 'd', 16: 'e', 17: 'f', 18: 'g', 19: 'h', 20: 'i', 21: 'j', 22: 'k', 23: 'l', 24: 'm', 25: 'n', 26: 'o', 27: 'p', 28: 'q', 29: 'r', 30: 's', 31: 't', 32: 'u', 33: 'v', 34: 'w', 35: 'x', 36: 'y', 37: 'z', 38: '’'}


Next, we will generate our training set. Specifically, we will split the text into subsequences where the length of each subsequence is 40 characters. Note that the length of the subsequences is a hyperparameter. Therefore, we could have set it equal to smaller or larger values. To generate the subsequences, we slide a window along the text one character at a time. The class label of a subsequence corresponds to the character that follows the subsequence's last character in the text. For instance, the class label of the subsequence “for recent tweets that contain all the term” would be the character “s”. Run the following code to generate the training set.

In [24]:
import numpy as np
length = 40
sentences = list()
next_chars = list()
for i in range(0, len(text) - length):
    sentences.append(text[i: i + length])
    next_chars.append(text[i + length])

X = np.zeros((len(sentences), length, len(vocab)), dtype=np.bool)
y = np.zeros((len(sentences), len(vocab)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for j, char in enumerate(sentence):
        X[i, j, char_to_idx[char]] = 1
    y[i, char_to_idx[next_chars[i]]] = 1
    
print("Size of training matrix:", X.shape) 
# 10357:samples, 40:sequence of 40 carachers, 
# 39:vector with False, but True where is the next caracter from the 40 sequence
print(sentences[0])
print("next is:",next_chars[0])
print(sentences[1])
print("next is:",next_chars[1])
print(sentences[2])
print("next is:",next_chars[2])
print(sentences[3])
print("next is:",next_chars[3])
print(X[0,12,:])

Size of training matrix: (10357, 40, 39)
the role of hashtags in guiding users to
next is:  
he role of hashtags in guiding users to 
next is: c
e role of hashtags in guiding users to c
next is: o
 role of hashtags in guiding users to co
next is: n
[False False False False False False False False False False False False
 False False False False False False False  True False False False False
 False False False False False False False False False False False False
 False False False]


We will now define the LSTM architecture. We will again make use of the Sequential model. We will first add a [Long Short-Term Memory layer](https://keras.io/layers/recurrent/#lstm). The Long Short-Term Memory layer takes as input the number of hidden units (i.e., dimensionality of the output) as well as the size of the input. Set the number of units of the LSTM layer to 64. Moreover, since we will add a second LSTM layer right next to the first one, we need to set the `return_sequences` parameter to True so that the layer returns the full sequence and not just the last output in the sequence. Then, we will add another LSTM layer. Set the number of hidden units to 64. The second LSTM layer will be followed by a fully connected layer ([Dense](https://keras.io/layers/core/#dense)) with as many units as the size of our vocabulary. Since this is a multiclass classification task (i.e., each character corresponds to a class), we will make use of the [softmax](https://keras.io/activations/#softmax) activation function.

In [26]:
#your code here
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM

model =Sequential()
model.add(LSTM(units=64,return_sequences=True,input_shape=(40,len(vocab))))
model.add(LSTM(units=64)) # return_sequences=False : to return the vector of the 40 caracters
model.add(Dense(units=len(vocab),activation='softmax'))

Next, we will compile the model. Since this is a multiclass classification task, the loss function is the categorical crossentropy. To train the network, we will use the Adam optimizer.

In [27]:
from tensorflow.keras.optimizers import Adam

optimizer = Adam(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

We next print the details of the LSTM.

In [28]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_2 (LSTM)                (None, 40, 64)            26624     
_________________________________________________________________
lstm_3 (LSTM)                (None, 64)                33024     
_________________________________________________________________
dense_1 (Dense)              (None, 39)                2535      
Total params: 62,183
Trainable params: 62,183
Non-trainable params: 0
_________________________________________________________________


We can now train the model. Use the [fit](https://keras.io/models/model/#methods) function of Keras to train the model. Set the number of epochs to 50 and the batch size to 128. We also add a callback which at the end of each epoch uses the model in order to generate a sequence of characters. Specifically, we randomly sample a subsequence of 40 characters from the text and feed it to the model to generate the next character. We then update the subsequence by removing its first character and adding the predicted character to it. We repeat this process for 100 iterations (i.e., we generate 100 characters).

In [None]:
from random import randint
from tensorflow.keras.callbacks import LambdaCallback

def generate_text(epoch, _):
    # Prints generated text
    start_idx = randint(0, len(text) - length - 1)
    generated_text = ''
    sample_text = text[start_idx: start_idx + length]
    generated_text += sample_text
    print('Generating text with seed: "' + sample_text + '"')

    for i in range(100):
        X_test = np.zeros((1, length, len(vocab)))
        for j, char in enumerate(sample_text):
            X_test[0, j, char_to_idx[char]] = 1.

        y_pred = model.predict(X_test, verbose=0)[0]
        next_idx = np.argmax(y_pred)
        next_char = idx_to_char[next_idx]

        generated_text += next_char
        sample_text = sample_text[1:] + next_char
        
    print(generated_text)
        
test_callback = LambdaCallback(on_epoch_end=generate_text)

#your code here
model.fit(X,y,epochs=50,batch_size=128,callbacks=[test_callback])

Train on 10357 samples
Epoch 1/50
ake subsequent searches for tweets diffite te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te t
Epoch 2/50
ghquality hashtags. furthermore, it is cont the the the the the the the the the the the the the the the the the the the the the the the the 
Epoch 3/50
e recently proposed word mover's distance the the the the the the the the the the the the the the the the the the the the the the the the th
Epoch 4/50
 set of tweets from which we extract the tweets the tweets the tweets the tweets the tweets the tweets the tweets the tweets the tweets the 
Epoch 5/50
 return the one the user created. as regrecommected the tweets the tweets the tweets the tweets the tweets the tweets the tweets the tweets 
Epoch 6/50
ort in order to find appropriate hashtags the tweets the tweets the tweets the tweets the tweets the tweets the tweets the tweets the tweets
Epoch 7/50
 same with other tweets. besides duplicalled the searc

 the system has returned ten hashtags as that the tweets containing the tweets containing the tweets containing the tweets containing the tw
Epoch 24/50
 proposed system by computing precision and the user.

to metween the tweets are proposed as suggestions of the tweets are proposed as sugge
Epoch 25/50

given a tweet entered by the"
stant#tag.

given a tweet entered by the user. the search api in order to create their tweets are not apploye the search api in order to cre
Epoch 26/50
