<a href="https://colab.research.google.com/github/tanyadixit21/Deep-Learning-RNN-tasks/blob/master/RNN_introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we discuss different RNN architectures and see many optimization techniques as well.

For data, we have many options like the nltk.corpus package
https://www.nltk.org/book/ch02.html

We will use the brown dataset.

https://www.nltk.org/book/ch05.html


In [14]:
!pip install sklearn



In [17]:
import nltk
import sys
import numpy as np
nltk.download('brown')
nltk.download('universal_tagset')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


True

In [0]:
data = nltk.corpus.brown.tagged_sents(tagset='universal') #this will give sentences, use nltk.corpus.brown.tagged_words(tagset='universal') for random words

In [0]:
all_tags = ['#EOS#','#UNK#','ADV', 'NOUN', 'ADP', 'PRON', 'DET', '.', 'PRT', 'VERB', 'X', 'NUM', 'CONJ', 'ADJ']

In [0]:
data = np.asarray([[(word.lower(), tag) for word, tag in sentence] for sentence in data])

In [0]:
from sklearn.model_selection import train_test_split
train_data,test_data = train_test_split(data,test_size=0.25,random_state=42) #split into train and test dataset

In [0]:
from collections import Counter

In [0]:
word_counter = Counter()

for sentence in data: #since data is an array of sentences
  words, tags = zip(*sentence) #extract the words and tags individually
  word_counter.update(words)
  

  
  

In [35]:
all_words = ['#EOS#','#UNK#'] + list(list(zip(*word_counter.most_common(20000)))[0])#to make it an array

#let's measure what fraction of data words are in the dictionary
print("Coverage = %.5f"%(float(sum(word_counter[w] for w in all_words)) / sum(word_counter.values())))

Coverage = 0.96707


In [0]:
from collections import defaultdict
word_to_id = defaultdict(lambda:1 , {word:i for i,word in enumerate(all_words)}) #we use default value as 1 as the id for words not in dictionary
tag_to_id = {tag:i for i,tag in enumerate(all_tags)}

Till this point, we have our words, our tags and our ids. Now we need to create matrices for input as well as output.

In [0]:
def to_matrix(lines,token_to_id,max_len=None,pad=0,dtype='int32',time_major=False):
    """Converts a list of names into rnn-digestable matrix with paddings added after the end"""
    
    max_len = max_len or max(map(len,lines))
    matrix = np.empty([len(lines),max_len],dtype)
    matrix.fill(pad)

   
    
    for i in range(len(lines)):
        line_ix = list(map(token_to_id.__getitem__,lines[i]))[:max_len]
        matrix[i,:len(line_ix)] = line_ix

    return matrix.T if time_major else matrix

In [45]:
batch_words,batch_tags = zip(*[zip(*sentence) for sentence in data[-3:]]) #3 sentences for testing

print("Word ids:")
print(to_matrix(batch_words,word_to_id))
print(to_matrix(batch_words,word_to_id).shape)
print("Tag ids:")
print(to_matrix(batch_tags,tag_to_id))
print(to_matrix(batch_tags,tag_to_id).shape)

Word ids:
[[    2  3057     5     2  2238  1334  4238  2454     3     6    19    26
   1070    69     8  2088     6     3     1     3   266    65   342     2
  11533     3     2   315     1     9    87   216  3322    69  1558     4
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0]
 [   45    12     8   511  8419     6    60  3246    39     2     1 10137
      3     2   845 14205     3     1     3    10  9910     2     1  3470
      9    43 11939     1     3     6     2  1046   385    73  4562     3
      9     2 19492 18192  3250     3    12    10     2   861  5240    12
      8  8936   121 19416     4]
 [   33    64    26    12   445     7  7346     9     8  3337     3 13074
   2811     3     2   463   572     2     1     1  1649    12     1     4
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0]]
(3

In [46]:
import keras
import keras.layers as L

model = keras.models.Sequential()
model.add(L.InputLayer([None],dtype='int32'))
model.add(L.Embedding(len(all_words),50))
model.add(L.SimpleRNN(64,return_sequences=True))

#add top layer that predicts tag probabilities
stepwise_dense = L.Dense(len(all_tags),activation='softmax')
stepwise_dense = L.TimeDistributed(stepwise_dense)
model.add(stepwise_dense)

Using TensorFlow backend.


Instructions for updating:
Colocations handled automatically by placer.


#Batch Training

Training: in this case we don't want to prepare the whole training dataset in advance. The main cause is that the length of every batch depends on the maximum sentence length within the batch. This leaves us two options: use custom training code or use generators.

Keras models have a model.fit_generator method that accepts a python generator yielding one batch at a time. But first we need to implement such generator:

In [0]:
from keras.utils.np_utils import to_categorical
BATCH_SIZE=32
def generate_batches(sentences,batch_size=BATCH_SIZE,max_len=None,pad=0):
    assert isinstance(sentences,np.ndarray),"Make sure sentences is a numpy array"
    
    while True:
        indices = np.random.permutation(np.arange(len(sentences)))
        for start in range(0,len(indices)-1,batch_size):
            batch_indices = indices[start:start+batch_size]
            batch_words,batch_tags = [],[]
            for sent in sentences[batch_indices]:
                words,tags = zip(*sent)
                batch_words.append(words)
                batch_tags.append(tags)

            batch_words = to_matrix(batch_words,word_to_id,max_len,pad)
            batch_tags = to_matrix(batch_tags,tag_to_id,max_len,pad)

            batch_tags_1hot = to_categorical(batch_tags,len(all_tags)).reshape(batch_tags.shape+(-1,))
            yield batch_words,batch_tags_1hot
        

https://adventuresinmachinelearning.com/keras-lstm-tutorial/

In [0]:
def compute_test_accuracy(model):
    test_words,test_tags = zip(*[zip(*sentence) for sentence in test_data])
    test_words,test_tags = to_matrix(test_words,word_to_id),to_matrix(test_tags,tag_to_id)

    #predict tag probabilities of shape [batch,time,n_tags]
    predicted_tag_probabilities = model.predict(test_words,verbose=1)
    predicted_tags = predicted_tag_probabilities.argmax(axis=-1)

    #compute accurary excluding padding
    numerator = np.sum(np.logical_and((predicted_tags == test_tags),(test_words != 0)))
    denominator = np.sum(test_words != 0)
    return float(numerator)/denominator


class EvaluateAccuracy(keras.callbacks.Callback):
    def on_epoch_end(self,epoch,logs=None):
        sys.stdout.flush()
        print("\nMeasuring validation accuracy...")
        acc = compute_test_accuracy(self.model)
        print("\nValidation accuracy: %.5f\n"%acc)
        sys.stdout.flush()

In [49]:
model.compile('adam','categorical_crossentropy')

model.fit_generator(generate_batches(train_data),len(train_data)/BATCH_SIZE,
                    callbacks=[EvaluateAccuracy()], epochs=5,)

Instructions for updating:
Use tf.cast instead.
Epoch 1/5

Measuring validation accuracy...

Validation accuracy: 0.95008

Epoch 2/5

Measuring validation accuracy...

Validation accuracy: 0.95560

Epoch 3/5

Measuring validation accuracy...

Validation accuracy: 0.95692

Epoch 4/5

Measuring validation accuracy...

Validation accuracy: 0.95593

Epoch 5/5

Measuring validation accuracy...

Validation accuracy: 0.95473



<keras.callbacks.History at 0x7f8f5470bf60>

In [50]:
acc = compute_test_accuracy(model)
print("Final accuracy: %.5f"%acc)

Final accuracy: 0.95473


In [0]:

birnn = keras.models.Sequential()
birnn.add(L.InputLayer([None],dtype='int32'))
birnn.add(L.Embedding(len(all_words),50))
birnn.add(L.Bidirectional(L.SimpleRNN(64,return_sequences=True), merge_mode='concat', weights=None))

#add top layer that predicts tag probabilities
stepwise_dense = L.Dense(len(all_tags),activation='softmax')
stepwise_dense = L.TimeDistributed(stepwise_dense)
birnn.add(stepwise_dense)

In [52]:
birnn.compile('adam','categorical_crossentropy')

birnn.fit_generator(generate_batches(train_data),len(train_data)/BATCH_SIZE,
                    callbacks=[EvaluateAccuracy()], epochs=5,)

Epoch 1/5

Measuring validation accuracy...

Validation accuracy: 0.96479

Epoch 2/5

Measuring validation accuracy...

Validation accuracy: 0.96930

Epoch 3/5

Measuring validation accuracy...

Validation accuracy: 0.96977

Epoch 4/5

Measuring validation accuracy...

Validation accuracy: 0.97004

Epoch 5/5

Measuring validation accuracy...

Validation accuracy: 0.96920



<keras.callbacks.History at 0x7f8f5470bb00>

In [53]:
acc = compute_test_accuracy(model)
print("Final accuracy: %.5f"%acc)

Final accuracy: 0.95473


In [60]:
!pip install theano




In [61]:

bi_lstm = keras.models.Sequential()
bi_lstm.add(L.InputLayer([None],dtype='int32'))
bi_lstm.add(L.Embedding(len(all_words),50))
bi_lstm.add(L.LSTM(64, recurrent_initializer='orthogonal'))

#add top layer that predicts tag probabilities
stepwise_dense = L.Dense(len(all_tags),activation='softmax')
stepwise_dense = L.TimeDistributed(stepwise_dense)
bi_lstm.add(stepwise_dense)

AssertionError: ignored

In [54]:
bi_lstm.compile('adam','categorical_crossentropy')

bi_lstm.fit_generator(generate_batches(train_data),len(train_data)/BATCH_SIZE,
                    callbacks=[EvaluateAccuracy()], epochs=5,)

NameError: ignored