# CS 584 Assignment 3 -- Language Model

#### Name: (Zhiyao Wen)

## In this assignment, you are required to follow the steps below:
1. Review the lecture slides.
2. Implement N-gram language modeling.
3. Implement RNN language modeling.

*** Please read the code and comments very carefully and install these packages (NumPy, sklearn, and tqdm) before you start ***

In [None]:
# !pip install numpy scikit-learn tqdm matplotlib
# !pip install -U spacy
# !python -m spacy download en_core_web_sm

## 0. Data Process
Run the following cells to preprocess training data, validation data, and test data.

### Load Data

In [9]:
train_texts = []
with open('./data/train.txt', 'r') as fp:
    for line in fp:
        train_texts.append(line)
        
valid_texts = []
with open('./data/valid.txt', 'r') as fp:
    for line in fp:
        valid_texts.append(line)
        
test_texts = []
with open('./data/input.txt', 'r') as fp:
    for line in fp:
        test_texts.append(line)


### Preprocessing

In [10]:
import re
import string
from string import punctuation

class Preprocesser(object):
    def __init__(self, punctuation=True, url=True, number=True):
        self.punctuation = punctuation
        self.url = url
        self.number = number
    
    def apply(self, text):
        
        text = self._lowercase(text)
        text = text.replace('<unk>', '')
        
        if self.url:
            text = self._remove_url(text)
            
        if self.punctuation:
            text = self._remove_punctuation(text)
            
        if self.number:
            text = self._remove_number(text)
        
        
        text = re.sub(r'\s+', ' ', text)
            
        return text
    
        
    def _remove_punctuation(self, text):
        ''' Please fill this function to remove all the punctuations in the text
        '''
        ### Start your code
        
        text = ''.join(c for c in text if c not in punctuation)
        
        ### End
        
        return text
    
    def _remove_url(self, text):
        ''' Please fill this function to remove all the urls in the text
        '''
        ### Start your code
        
        text = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '', text)
        
        ### End
        
        return text
    
    def _remove_number(self, text):
        ''' Please fill this function to remove all the numbers in the text
        '''
        
        ### Start your code
        text = ''.join([i for i in text if not i.isdigit()])
        
        ### End
        
        return text
    
    def _lowercase(self, text):
        ''' Please fill this function to lower the text
        '''
        
        ### Start your code
        
        text = text.lower()
        
        ### End
        
        return text
    
    
preprocesser = Preprocesser()

### Tokenization

In [11]:
import spacy
nlp = spacy.load('en_core_web_sm')

def tokenize(text):
    ''' Since it is a language model, we don't need to remove the stop words.
    '''
    doc = nlp(text)
    tokens = [token.text for token in doc]
    if ' ' in tokens:
        tokens.remove(' ')
    return tokens
    

## 1. N-gram (50 points)
In this section, you are required to implement an N-gram model for language modeling and two smoothing methods.
1. Implement N-gram (Bigram).
2. Implement Good Turing smoothing.
3. Implement Kneser-Ney smoothing.

### 1.1 Implement a bigram for language modeling (fill the code, 10 points)

In [30]:
from collections import defaultdict
from tqdm.notebook import tqdm
import numpy as np

class BiGram(object):
    
    def __init__(self):
        ''' Construction function of BiGram.
            Params:
                uni_count: a dictionary with default value 0
                bi_count: a dictionary that each value is a dictionary with default value 0
        '''

        self.uni_count = defaultdict(lambda: 0)
        self.bi_count = defaultdict(lambda: 0)
        
        
    def fit(self, texts):
        self._unigram_count(texts)
        #print(self.uni_count)
        self._bigram_count(texts)
        
    
    def _unigram_count(self, texts):
        ''' Count tokens, and store in self.uni_count
            Input
                texts: a list of text
        '''
        
        ### Start you code
        
        for text in texts:
            clean_text = preprocesser.apply(text)
            tokens = tokenize(clean_text)
            
            for word in tokens:
                self.uni_count[word] += 1
            
        ### End
            
    
    
    def _bigram_count(self, texts):
        ''' Count tokens in bigram way, and store in self.bi_count
            Input
                texts: a list of text
        '''
        
        ### Start you code
        for text in texts:
            clean_text = preprocesser.apply(text)
            tokens = tokenize(clean_text)
            
            for i in range(len(tokens) - 1):
                word = (tokens[i], tokens[i+1])# store the bigram in tuple
                self.bi_count[word] += 1

        
        ### End
    
    
    def probability(self, w1, w2):
        ''' Given two tokens, calculate the bigram probability
            Input
                w1: the first token of bigram
                w2: the second token of bigram
        '''
        prob = 0.
         
        ### Start you code
        count_unigram =sum(self.uni_count.values())
        
        if w2 == 0:
            prob = self.uni_count[w1] / count_unigram  #for unigram
            
        else:
            p_w1 = self.uni_count[w1]   # for bigram
            p_w2 = self.bi_count[(w1, w2)]
            
            if p_w1 != 0:
                prob = p_w2/p_w1
            else:
                prob = 0
            
        prob = min(prob,1) 
        
        ### End
        
        return prob
    
    
    def predict(self, w):
        ''' Given a word, find a word with the highest probability
            Input
                w: a word
                
            Hint: utilize self.probability(w, w2) to find which w2 has the highest probability
        '''
        
        w_next = None
        
        ### Start your code
        
        w2 = []
        
        p_w2 = []
        
        for i,j in self.bi_count.keys():
            
            if i == w:
                w2.append(j)
                p_w2.append(self.probability(i, j))
        
        w_next = w2[p_w2.index(map(p_w2))]  #predict the next word
        ### End
        
        return w_next

### 1.2 Implement Good Turing smoothing (fill the code, 15 points)

In [13]:
class GoodTuring(object):
    
    def __init__(self, bigram):
        ''' Construction function of Good Turing.
            Input
                bigram: Bigram model
            Params:
                uni_count: a dictionary with default value 0
                bi_count: a dictionary that each value is a dictionary with default value 0
                -----------------
                For bigram
                bi_nc: a dictionary with default value 0, the count of things we've seen c times.
                bi_c_star: (c+1)*N_c+1 / N_c, page 64 of slides (lecture 5).
                bi_N: \sum c*N_c, page 64 of slides (lecture 5).
                
                For unigram
                uni_nc: a dictionary with default value 0, the count of things we've seen c times.
                uni_c_star: (c+1)*N_c+1 / N_c, page 64 of slides (lecture 5).
                uni_N: \sum c*N_c, page 64 of slides (lecture 5).
            
        '''
        self.uni_count = bigram.uni_count
        self.bi_count = bigram.bi_count
        
        self.uni_nc = defaultdict(lambda: 0)
        self.bi_nc = defaultdict(lambda: 0)
        
        self.uni_c_star = defaultdict(lambda: 0)
        self.bi_c_star = defaultdict(lambda: 0)
        
        self.uni_N = 0
        self.bi_N = 0
        
        
    def fit(self, texts):
        self._calc_N_c()
        self._calc_c_star_and_N()
        
    
    def _calc_N_c(self):
        ''' Count the frequency of frequency c, and store to self.nc.
            Page 64 of slides (lecture 5)
            Hint: You could directly utililze self.bi_count and self.uni_count to calculate N_c
        '''
        
        ### Start you code
        
        # calculate N_c for ungram
        for i in self.uni_count.values():
            self.uni_nc[i] += 1
        
        
        # calculate N_c for bi gram
        for j in self.bi_count.values():
            self.bi_nc[j] += 1
        

        ### End
        
        
    def _calc_c_star_and_N(self):
        ''' Calculate c_star and N. (page 65 of slides (lecture 5))
        '''
        
        ### Start your code
        
        #calculate N for ungram
        for i in self.uni_count.keys():
            self.uni_N += self.uni_count[i]
            
        #calculate N for bigram
        for j in self.bi_count.keys():
            self.bi_N += self.bi_count[j]
        
        #calculate c_star for ungram, i is c
        self.uni_c_star[0] = self.uni_nc[1]
        for i in self.uni_nc.keys():
            if i+1 in self.uni_nc.keys() and i !=0:
                self.uni_c_star[i] = ((i+1)*self.uni_nc[i+1])/self.uni_nc[i]
        
        #calculate c_star for bigram, i is c
        self.bi_c_star[0] = self.bi_nc[1]
        for i in self.bi_nc.keys():
            if i+1 in self.bi_nc.keys() and i !=0:
                self.bi_c_star[i] = ((i+1) * self.bi_nc[i+1])/self.bi_nc[i]
        
        ### End
        
        
    def probability(self, w1, w2):
        ''' Given two words, calculate the GT probability
                p_GT = c_star / N, if c != 0
                p_GT = N_1 / N, if c = 0
                
                p = p_GT(w1, w2) / p_GT(w1)
                
            Input
                w1: the first word
                w2: the second word
                
        '''
        prob = 0.
        
        ### Start you code
        
        # caculate p_GT(w1, w2)
        if(w1,w2) in self.bi_count.keys():
            
            c = self.bi_count[(w1,w2)]
            
            if c in self.bi_c_star.keys():
                c_star = self.bi_c_star[c]
            else:
                c_star = 0
            
            p_GT_w1_w2 = c_star / self.bi_N
            
        else:
            p_GT_w1_w2 = self.bi_c_star[0] / self.bi_N
        
        # caculate p_GT(w1)
        if w1 in self.uni_count.keys():
            
            c = self.uni_count[w1]
            
            if c in self.uni_c_star.keys():
                
                c_star = self.uni_c_star[c]
            else:
                c_star = c 
                
            p_GT_w1 = c_star / self.uni_N
            
        if p_GT_w1 != 0:
            prob = p_GT_w1_w2 / p_GT_w1
        else:
            prob = 0     
        
        ### End
        
        return prob

    
    def predict(self, w):
        ''' Given a word, find a word with the highest probability
            Input
                w: a word
                
            Hint: utilize self.probability(w, w2) to find which w2 has the highest probability
        '''
        
        w_next = None
        
        ### Start your code
        
        # same as before
        w2 = []
        
        p_w2 = []
        
        for i,j in self.bi_count.keys():
            
            if i == w:
                w2.append(j)
                p_w2.append(self.probability(i, j))
        
        w_next = w2[p_w2.index(map(p_w2))]  #predict the next word

        
        ### End
        
        return w_next

### 1.3 Implement Kneser-Ney smoothing (fill the code, 15 points)

In [14]:
class KneserNey(object):
    
    def __init__(self, bigram, d=0.75):
        ''' Construction function of KneserNey.
            Params:
                uni_count: a dictionary with default value 0
                bi_count: a dictionary that each value is a dictionary with default value 0
                -----------------
                num_bigram_types: page 73 of slides (lecture 5)
                novel_continuation: \{ w_{i-1}: c(w_{i-1}, w) \}, page 73 of slides (lecture 5)
                p_continuation: page 73 of slides (lecture 5)
                novel_previous: \{ w: c(w_{i-1}, w) \}, page 75 of slides (lecture 5)
                lam: page 75 of slides (lecture 5)
                d: 0.75
            
        '''
        
        self.uni_count = bigram.uni_count
        self.bi_count = bigram.bi_count
        
        self.num_bigram_types = 0
        self.novel_continuation = defaultdict(lambda: 0)
        self.novel_previous = defaultdict(lambda: 0)
        self.p_continuation = defaultdict(lambda: 0)
        self.lam = defaultdict(lambda: 0)
        
        self.d = d
        
    
    def fit(self, texts):
        self._calc_num_bigram_types()
        self._calc_novel_continuation_and_novel_previous()
        self._calc_P_continuation()
        self._calc_lambda()
        
    
    def _calc_num_bigram_types(self):
        ''' Calculate the number of bigram types, and store in self.num_bigram_types
            page 73 of slides (lecture 5)
            
            Hint: you could utilize the bigram count (self.bi_count) which is obtained from Bigram model.
        '''
        
        ### Start your code
        
        self.num_bigram_types = len(self.bi_count.keys())
            
        ### End
      
    
    def _calc_novel_continuation_and_novel_previous(self):
        ''' Calculate novel continuation, and novel previous, 
            and store them in self.novel_continuation and self.novel_previous
            
            novel_continuation = \{ w_{i-1}: c(w_{i-1}, w) \}, page 73 of slides (lecture 5)
            novel_previous = \{ w: c(w_{i-1}, w) \}, page 75 of slides (lecture 5)
            
            Hint: you could utilize the bigram count (self.bi_count) which obtained from Bigram model.
        '''
        
        ### Start your code
        for i,j in self.bi_count.keys():
            self.novel_continuation[j] += 1
            self.novel_previous[i] += 1
    
    
        ### End
    
    
    def _calc_P_continuation(self):
        ''' Calculate p continuation, and store in self.p_continuation.
            page 73 of slides (lecture 5)
            
            Hint: you could utilize the novel continuation (self.novel_continuation).
        '''
        
        ### Start your code 
        
        
        for i in self.novel_continuation.keys():
            self.p_continuation[i] = self.novel_continuation[i] / len(self.bi_count.keys())

        
        ### End
    
    
    def _calc_lambda(self):
        ''' Calculate lambda, and store in self.lam.
            page 75 of slides (lecture 5)
            
            Hint: you could utilize the novel previous (self.novel_previous) and unigram (self.uni_count).
        '''
        
        ### Start your code
        
        for i in self.p_continuation.keys():
            
            if self.uni_count[i] != 0:
                
                self.lam[i] = self.d * self.novel_previous[i] / self.uni_count[i]
                
                
            else:
                self.lam[i] = 0
            
        ### End
        
        
    def probability(self, w1, w2):
        ''' Given two words, calculate the KN probability
            Page 74 of slides (lecture 5)
                
            Input
                w1: the first word
                w2: the second word
        '''
        
        prob = 0.
        
        # Start your code
        
        if self.uni_count[w1] != 0:
            prob = (max(self.bi_count[(w1,w2)] - self.d,0)) / (self.uni_count[w1] + self.lam[w1] * self.p_continuation[w2])
            
        else:
            prob = 0
        
        # End
            
        return prob
    
    
    def predict(self, w):
        ''' Given a word, find a word with the highest probability
            Input
                w: a word
                
            Hint: utilize self.probability(w, w2) to find which w2 has the highest probability
        '''
        
        pred = ''
        
        ### Start your code
        
        # same as before
        w2 = []
        
        p_w2 = []
        
        for i,j in self.bi_count.keys():
            
            if i == w:
                w2.append(j)
                p_w2.append(self.probability(i, j))
        
        w_next = w2[p_w2.index(map(p_w2))]  #predict the next word

                
        ### End
                
        return pred
                

### 1.4 Implement Perplexity (fill the code, 10 point)
**Hint:** Multiplication of probabilities may lead to an overflow problem. One trick is to move the computation to the logarithm space. Therefore, you could use summation instead of multiplication to calculate perplexity.

In [29]:
import math

def perplexity(model, texts):
    ''' Calculate the perplexity score.
        Inputs
            model: the model you want to evaluate (BiGram, GoodTuring, or KneserNey)
            texts: a list of validation text
        Output
            perp: the perplexity of the model on texts
    '''
    perp = 1.
    
    ### Start your code
    count = 0
    N = 0
    for text in texts:
        n = len(text)
        N += n
        
        for i in range(1, n):
            if model.probability(text[i-1], text[i]) != 0:
                perp  -=  np.log(model.probability(text[i-1],text[i]))
                
            else:
                count += 1
        
        if model.uni_count[text[0]] == 0:
            count += 1
        else:
            
            perp -=  np.log(model.probability(text[0],0))
    
    N = N - count
    
    prep = 1/N * perp
    
    ### End
    
    return perp


### 1.5 Calculate the perplexity of three models

Run the following cell to obtian the perplexity of BiGram, Good Turing, and Kneser-Ney.

**Note that, the perlexity should be less than 100.**

In [31]:
# Train Bigram
bigram = BiGram()
bigram.fit(train_texts)

# Perplexity
bigram_perplexity = perplexity(bigram, valid_texts)
print(f'The perplexity of Bigram is: {bigram_perplexity:.4f}')

The perplexity of Bigram is: 306279.2688


In [86]:
# Train Good Turing
gt = GoodTuring(bigram)
gt.fit(train_texts)

# Perplexity
gt_perplexity = perplexity(gt, valid_texts)
print(f'The perplexity of Good Turing is: {gt_perplexity:.4f}')

The perplexity of Good Turing is: -2183377.5215


In [87]:
# For Kneser-Ney
kn = KneserNey(bigram, d=0.75)
kn.fit(train_texts)

# Perplexity
kn_perplexity = perplexity(kn, valid_texts)
print(f'The perplexity of Kneser-Ney is: {kn_perplexity:.4f}')

The perplexity of Kneser-Ney is: 337007.0925


### 1.6 Use N-gram model make predictions

Run the following cells to see how your models work

In [None]:
'''
For the predict function
Maybe some problems with probaility function, so p_w2 do not have two arguments
'''

#### 1.6.1 Bigram

In [88]:
import random

sampled_texts = random.sample(test_texts, 30)
for i, text in enumerate(sampled_texts):
    clean_text = preprocesser.apply(text)
    tokens = tokenize(clean_text)
    pred = bigram.predict(tokens[-1])
    print(f'{i} ==> {text.strip()}, prediction: {pred}')

TypeError: map() must have at least two arguments.

#### 1.6.2 Good Turing

In [89]:
import random
sampled_texts = random.sample(test_texts, 30)
for i, text in enumerate(sampled_texts):
    clean_text = preprocesser.apply(text)
    tokens = tokenize(clean_text)
    pred = gt.predict(tokens[-1])
    print(f'{i} ==> {text.strip()}, prediction: {pred}')

TypeError: map() must have at least two arguments.

#### 1.6.3 Kneser-Ney

In [None]:
sampled_texts = random.sample(test_texts, 30)
for i, text in enumerate(sampled_texts):
    clean_text = preprocesser.apply(text)
    tokens = tokenize(clean_text)
    pred = kn.predict(tokens[-1])
    print(f'{i} ==> {text.strip()}, prediction: {pred}')

## 2. RNN (50 points)
In this section, you are required to implement an RNN-based language model. **Libraries are allowed in this section, such as PyTorch or TensorFlow**. And, of course, you could implement the model from scratch which will get extra credits. 

I divided the whole process into several steps.
1. Initialize parameters
2. Prepare Data
3. Implement the model
4. Train your model
5. Evaluate your model

Please note that you could change those steps by your needs. As long as you correctly implement the model and have reasonable results you will get full points.

### 2.1 Initialize parameters for the model

In [380]:
#######################################################
#                                                     #
#        Change the default values accordingly        #
#                                                     #
#######################################################

learning_rate = 1E-3
batch_size = 50
hidden_size = 100
embedding_size = 200
num_epochs = 30
window_size = 20

### 2.2 Data preparation (Fill the code: 5 points)

Here is what do you might need to do in this section:
1. Build a vocabulary.
2. Prepare the training data.
3. Prepare the validation data.

In [381]:
#preprossing
train_clean_texts = [preprocesser.apply(i) for i in train_texts]

valid_clean_texts = [preprocesser.apply(i) for i in valid_texts]

test_clean_texts = [preprocesser.apply(i) for i in test_texts]

#tokenize
train_tokenized_texts = [tokenize(i) for i in train_clean_texts]

valid_tokenized_texts = [tokenize(i) for i in valid_clean_texts]

test_tokenized_texts = [tokenize(i) for i in test_clean_texts]

In [None]:
### Start your code
import random

# 1. Build a train vocabulary.


train_tokens = [j for i in train_tokenized_texts  for j in i ]

train_vocabulary = list(set(train_tokens))


#valid vocabulary
valid_tokens = [j for i in valid_tokenized_texts  for j in i ]

valid_vocabulary = list(set(valid_tokens))


#test vocabulary
test_tokens = [j for i in test_tokenized_texts  for j in i ]

test_vocabulary = list(set(test_tokens))

# The number of combianing the three type of vocabulary is the same as train_vocabulary, so I use train_vocabulary later

# 2. Prepare the training data.

from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences

# Encode the sequences using Keras integer mapping
train_encoded = [one_hot(i, len(train_vocabulary)) for i in train_clean_texts]
valid_encoded = [one_hot(i, len(valid_vocabulary)) for i in valid_clean_texts]


# Create X_train and y_train 
X_train = [i[:-1] for i in train_encoded]

# as some parse row are all zero, we randromly choose the vacabulary index, vacabulary size 9891
y_train = [random.randint(1, 9891) if len(i[-1:]) == 0 else i[-1:][0] for i in train_encoded]  


# Padding X_train
X_train = pad_sequences(X_train, maxlen=window_size)

# 3. Prepare the validation data.

# Create X_valid and y_valid 
X_valid = [i[:-1] for i in valid_encoded]
# we add 0.1 for all element cause some element is null
y_valid = [0.1 if len(i[-1:]) == 0 else (i[-1:][0] + 0.1) for i in valid_encoded]

# Padding X_valid
X_valid = pad_sequences(X_valid, maxlen=window_size)

print(X_train.shape)
print(len(y_train))
print(X_valid.shape)
print(len(y_valid))
### End

### 2.3 Build your model (Fill the code: 10 points)


Here is what do you might need to do in this section:
1. Create a model.
2. Add an embedding layer as the first layer.
3. Add a RNN cell (GRU or LSTM) as the next layer.
4. Add a output layer.
5. Given a sequence words, for each word, predict the next word.

In [383]:
### Start your code
from keras.models import Sequential
from keras.layers import LSTMCell, Dense, Embedding
import numpy as np


rnn = Sequential()

# Add Embedding (input) layer
# input_dim = size of vocabulary
# output_dim = batch_size
# input length = window_size
rnn.add(Embedding(input_dim = len(train_vocabulary), output_dim = batch_size, input_length=window_size))

#Add RNNcell
rnn.add(LSTM(256))

# Add Dense layer 
rnn.add(Dense(len(train_vocabulary), activation='softmax'))

rnn.summary()


# Implement forward pass
LSTM_output = rnn(X_train)

# Define prediction function for next word
def predict_next_word(output, vocab):
    
    # Get index of maximum input
    ind = np.argmax(output, axis=1)
    
    return ind, [vocab[i] for i in ind]

# Predict next word for all of the sequences
train_ind_vector, train_word_vector = predict_next_word(LSTM_output, train_vocabulary)

### End


Model: "sequential_19"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_18 (Embedding)    (None, 20, 50)            494550    
                                                                 
 lstm_15 (LSTM)              (None, 256)               314368    
                                                                 
 dense_14 (Dense)            (None, 9891)              2541987   
                                                                 
Total params: 3,350,905
Trainable params: 3,350,905
Non-trainable params: 0
_________________________________________________________________


### 2.4 Setup the training step and train the model (Fill the code: 10 points)
Based on your implementation, setup your training process.

In [384]:
### Start your code
from keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.optimizers import Adam

y_train_type_numpy = np.asarray(y_train).astype('float32')
y_vallid_type_numpy = np.asarray(y_valid).astype('float32')


# Setup the model
rnn.compile(optimizer=Adam(learning_rate=learning_rate), loss='sparse_categorical_crossentropy')

# Train RNN
history = rnn.fit(X_train, y_train_type_numpy, batch_size=batch_size, epochs=num_epochs, validation_data=(X_valid, y_vallid_type_numpy))
### End


Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


### 2.5 Evaluate the model (15 points)
Calculate the model's perplexity on the valid set.

#### 2.5.1 Deliverable (5 points)
Prove
<center>$perplexity = exp(\frac{total\ loss}{number\ of\ predictions})$
    
*You can either list the steps in the notebook or submit a pdf with all the steps in the submission.*

#### 2.5.2 Implement the algorithm to calculate the perplexity of the model. (10 points)

In [385]:
perp = 0.

### Start your code
 
#prepare test set
test_encoded = [one_hot(i, len(test_vocabulary)) for i in valid_clean_texts]

# Create X_test and y_test
X_test = [i[:-1] for i in test_encoded]

# we add 0.1 for all element cause some element is null
y_test = [float(random.randint(1, 9891)) if len(i[-1:]) == 0 else i[-1:][0] for i in test_encoded] 


# Padding X_test
X_test = pad_sequences(X_test, maxlen=window_size)


# Get test prediction
test_output = rnn(X_test) 


# Convert test_output to array
test_output = np.array(test_output)

total_loss = 0.

for i in range(len(test_output)):
    
    correct_prob = test_output[0][int(y_test[i])]
    
    total_loss += -np.log(correct_prob)


perp = np.exp(total_loss/len(test_output))


# #compare with caculation of perplexity with model prediction 

# #convert test set to numpy
# y_test_type_numpy = np.asarray(y_test).astype('float32')


# loss, _ = rnn.evaluate(X_test, y_test_type_numpy)

# perplexity = np.exp(loss)
# print(perplexity)

### End

print(f'The perplexity of of RNN based model is: {perp:.4f}')


The perplexity of of RNN based model is: 3309384858.1683


### 2.6 Use RNN language modeling make predictions (10 points)
Print the predictions of next words using the RNN model for the same 30 lines of input.txt as in section 1.6

In [386]:
### Start your code

input_texts_30 = test_clean_texts[:30]

# Encode the sequences using Keras integer mapping
input_30_encoded = [one_hot(i, len(train_vocabulary)) for i in input_texts_30]

# Pad encoded sequences
X_input_30 = pad_sequences(input_30_encoded, maxlen = window_size)

# Get probability predictions for next word
input_preds_30 = rnn(X_input_30)

# Get test prediction index vector
input_ind_30_vector, input_word_30_vector = predict_next_word(input_preds_30, train_vocabulary)


for i in range(30):
    print(str([input_texts_30[i]]) + '\t ' + "Predicted next word is " +  str([input_word_30_vector[i]]))
### End


['but while the new york stock exchange did nt fall ']	 Predicted next word is ['inefficient']
['some circuit breakers installed after the october n crash failed ']	 Predicted next word is ['berkeley']
['the n stock specialist firms on the big board floor ']	 Predicted next word is ['sometimes']
['big investment banks refused to step up to the plate ']	 Predicted next word is ['jay']
['heavy selling of standard poor s stock index futures ']	 Predicted next word is ['banking']
['seven big board stocks ual amr bankamerica walt disney capital ']	 Predicted next word is ['textile']
['once again the specialists were not able to handle the ']	 Predicted next word is ['khmer']
[' james chairman of specialists henderson brothers inc it ']	 Predicted next word is ['spokeswoman']
['when the dollar is in a even central banks ']	 Predicted next word is ['hahn']
['speculators are calling for a degree of liquidity that is ']	 Predicted next word is ['hedge']
['many money managers and some traders ha