# RNN Tagger

This example trains a RNN to tag words from a corpus.

The data used for training is from a Wikipedia download, which is the artificially annotated with parts of speech by the NLTK PoS tagger written by Matthew Honnibal.


In [1]:
import tensorflow as tf
import tensorflow.contrib.keras as keras
import numpy as np

import os

SENTENCE_LENGTH_MAX = 32

## Basic Text and Parsing Tools

In [2]:
import nltk
from nltk.tokenize import TreebankWordTokenizer
sentence_splitter = nltk.data.load('tokenizers/punkt/english.pickle')
tokenizer = TreebankWordTokenizer()
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /usr/share/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## Use a Wikipedia Corpus

From the corpus download page : http://wortschatz.uni-leipzig.de/en/download/

Here's the paper that explains how the corpus was constructed : 

*  D. Goldhahn, T. Eckart & U. Quasthoff: Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages.
    *  In: Proceedings of the 8th International Language Ressources and Evaluation (LREC'12), 2012


In [3]:
corpus_text_file = './data/en.wikipedia.2010.100K.txt'

In [4]:
if not os.path.isfile(corpus_text_file):
    raise RuntimeError("You need to download the corpus file : "+
                       "http://pcai056.informatik.uni-leipzig.de/downloads/corpora/eng_wikipedia_2010_100K.tar.gz")
else:
    print("Corpus available locally")

Corpus available locally


In [5]:
def corpus_sentence_tokens(corpus_text_file):
    while True:
        with open(corpus_text_file, encoding='utf-8') as f:
            for line in f:
                n,l = line.split('\t')   # Strip of the initial numbers
                for s in sentence_splitter.tokenize(l):  # Split the lines into sentences (~1 each)
                    tokens = tokenizer.tokenize(s)
                    if len(tokens) < SENTENCE_LENGTH_MAX:
                        yield tokens
        print("Corpus : Looping")
corpus_sentence_tokens_gen = corpus_sentence_tokens(corpus_text_file)

In [6]:
' | '.join(next(corpus_sentence_tokens_gen))

'Showing | that | even | in | the | modern | warfare | of | the | 1930s | and | 1940s | , | the | dilapidated | fortifications | still | had | defensive | usefulness | .'

## Reference Tagger

In [7]:
from nltk.tag.perceptron import PerceptronTagger
pos_tagger = PerceptronTagger(load=True)
' | '.join(list(pos_tagger.classes))

"RP | ( | RB | VBG | PDT | VBP | RBR | IN | . | ) | VBD | NNS | EX | RBS | NNP | DT | POS | $ | , | `` | UH | : | WP | WRB | # | VB | VBN | NN | WDT | TO | CC | NNPS | WP$ | CD | PRP | '' | JJ | JJS | FW | PRP$ | VBZ | MD | SYM | LS | JJR"

In [8]:
s = "Let 's see what part of speech analysis on Jeff 's sample text looks like .".split(' ')
#s = next(corpus_sentence_tokens_gen)
pos_tagger.tag(s)

[('Let', 'VB'),
 ("'s", 'POS'),
 ('see', 'VB'),
 ('what', 'WP'),
 ('part', 'NN'),
 ('of', 'IN'),
 ('speech', 'NN'),
 ('analysis', 'NN'),
 ('on', 'IN'),
 ('Jeff', 'NNP'),
 ("'s", 'POS'),
 ('sample', 'NN'),
 ('text', 'JJ'),
 ('looks', 'VBZ'),
 ('like', 'IN'),
 ('.', '.')]

### Twist : Not interested in all classes...

To simplify (dramatically), our RNN will be trained to just tell the difference between 'is ordinary word' and 'is entity name'.

In [9]:
tag_list = ['O', 'E']
pos_tagger_entity_tags = set(['NNP'])
pos_tagger_to_idx = dict([(t, int(t in pos_tagger_entity_tags)) 
                            for i,t in enumerate(pos_tagger.classes)])
TAG_SET_SIZE= len(tag_list)

pos_tagger_to_idx['NNP'], pos_tagger_to_idx['VBP']

(1, 0)

## GloVe Word Embeddings

In [12]:
glove_100k_50d_path = './data/glove.6B/glove.first-100k.6B.50d.txt'

if not os.path.isfile(glove_100k_50d_path):
    raise RuntimeError("You need to download GloVE Embeddings "+
                       "from http://nlp.stanford.edu/data/")
else:
    print("GloVE available locally")

GloVE available locally


Due to size constraints, only use the first 100k vectors (i.e. 100k most frequently used words)

In [15]:
import glove
word_embedding = glove.Glove.load_stanford(glove_100k_50d_path)
word_embedding.word_vectors.shape

(100000, 50)

## An RNN Part-of-Speech Tagger

### RNN Main Parameters

In [16]:
BATCH_SIZE = 64

#### Make the Embedding  Keras-Compatible

In [17]:
word_embedding.word_vectors.shape

(100000, 50)

In [18]:
EMBEDDING_DIM = word_embedding.word_vectors.shape[1]
word_embeddings = np.vstack([ 
        np.zeros((1, EMBEDDING_DIM,), dtype='float32'),   # This is the 'zero' value (used as a mask in Keras)
        np.zeros((1, EMBEDDING_DIM,), dtype='float32'),   # This is for 'UNK'  (word == 1)
        word_embedding.word_vectors,
    ])
word_embeddings.shape

(100002, 50)

### Synthesising a 'correct answer' for the Tagger

Normally, this would be the (manual) annotations from the corpus itself.  However, we don't have an annotated corpus.  Instead, we're going to use the annotations produced by the NTLK tagger - simplified to only identify 'NNP = entities'.

In [19]:
def word_to_idx_rnn(word):
    idx = word_embedding.dictionary.get(word.lower(), -1)  # since UNK=1 = (-1+2)
    return idx+2  # skip ahead 2 places

from keras.utils import to_categorical

def sentences_for_network(sentences, include_targets=False, one_hot_targets=False):
    """
    sentences: list of sentences
    include_targets:
    one_hot_target:
    """
    len_of_list = len(sentences)
    #print("sentences_for_network.sentences.length = %d" % (len_of_list,))
    
    input_values = np.zeros((len_of_list, SENTENCE_LENGTH_MAX), dtype='int32')
    for i, sent in enumerate(sentences):
        for j, word in enumerate(sent):
            input_values[i,j] = word_to_idx_rnn(word)
    
    if not include_targets: 
        return (input_values, None)

    if one_hot_targets:
        # Add extra dimension here to suit Keras' TimeDistributed(Dense(softmax))
        #   as discussed : https://github.com/fchollet/keras/issues/6363
        target_values  = np.zeros((len_of_list, SENTENCE_LENGTH_MAX, TAG_SET_SIZE), dtype='int32')
    else:
        target_values  = np.zeros((len_of_list, SENTENCE_LENGTH_MAX), dtype='int32')
        
    for i, sent in enumerate(sentences):
        sentence_tags = pos_tagger.tag(sent)
        for j, word_tag in enumerate(sentence_tags):
            tag = word_tag[1] # tags are returned as tuples (word, tag)
            pos_class = pos_tagger_to_idx[tag]  # These are the class #s
            if one_hot_targets:
                target_values[i,j] = to_categorical(pos_class, num_classes=TAG_SET_SIZE)
            else:
                target_values[i,j] = pos_class

    return (input_values, target_values)

def batch_for_network_generator():
    while True:
        batch_of_sentences = [ next(corpus_sentence_tokens_gen) for i in range(BATCH_SIZE) ]    
        yield sentences_for_network(batch_of_sentences, include_targets=True, one_hot_targets=True)

Using TensorFlow backend.


#### Test the batchifier

This just finds the next values to be produced - it isn't needed below.

In [20]:
single_batch_input, single_batch_targets = next(batch_for_network_generator())
single_batch_input.shape, single_batch_targets.shape
#single_batch_input[0]
#single_batch_targets[0]

((64, 32), (64, 32, 2))

### Define the RNN Symbolically

#### Good blog post series
*  http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/

#### Keras Examples
*  https://github.com/fchollet/keras/issues/5022

#### Keras RNN
*  https://keras.io/layers/recurrent/


In [23]:
from keras.layers import Input, Embedding, GRU, Dense #, Activation
from keras.layers import Bidirectional, TimeDistributed
from keras.models import Model, Sequential

### Build the layers

In [24]:
RNN_HIDDEN_SIZE = word_embeddings.shape[1]

model = Sequential()
model.add(Embedding(word_embeddings.shape[0],
                    word_embeddings.shape[1],
                    weights=[word_embeddings],
                    input_length=SENTENCE_LENGTH_MAX,
                    trainable=False, 
                    mask_zero=True,
                    name="SentencesEmbedded"))

# Gated Recurrent Unit
model.add(Bidirectional(GRU(RNN_HIDDEN_SIZE, return_sequences=True),
                        merge_mode='concat'))

# Fully connected layer
# RNN_HIDDEN_SIZE*2 because the bidirectional GRU are concatenated
model.add(TimeDistributed(Dense(TAG_SET_SIZE, activation='softmax'), 
                          input_shape=(BATCH_SIZE, SENTENCE_LENGTH_MAX, RNN_HIDDEN_SIZE*2),
                          name='POS-class'))

Show the model

In [25]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
SentencesEmbedded (Embedding (None, 32, 50)            5000100   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 32, 100)           30300     
_________________________________________________________________
POS-class (TimeDistributed)  (None, 32, 2)             202       
Total params: 5,030,602
Trainable params: 30,502
Non-trainable params: 5,000,100
_________________________________________________________________


### Loss Function for Training

In [26]:
model.compile(loss='categorical_crossentropy', optimizer="adam")  # , metrics=['accuracy']

### Training phase for the RNN

This will actually **train** the RNN - which can take 3-5 minutes (depending on your CPU).

In [27]:
#model.fit(x, y_one_hot)
model.fit_generator(batch_for_network_generator(), 500, epochs=1, verbose=1)

Epoch 1/1


<keras.callbacks.History at 0x7fa513f91c88>

### Check that the Tagger Network 'works'

In [28]:
def tag_results_for(test_sentences):
    #sentences_for_network(sentences, include_targets=False, one_hot_targets=False)
    input_values, target_values_int = sentences_for_network(test_sentences, include_targets=True)

    rnn_output = model.predict_on_batch(input_values)

    # rnn_output here is a softmax-vector at every word location
    for i,sent in enumerate(test_sentences): # [0:5]):
        annotated = [ 
                "%s-%d-%d" % (word, target_values_int[i,j], np.argmax(rnn_output[i,j]), )    
                for j,word in enumerate(sent) 
            ]
        print(' '.join(annotated))

In [29]:
sentences = [
    "Dr. Andrews works at Red Cat Labs .",
    "Let 's see what part of speech analysis looks like .",
    "When are you off to New York , Chaitanya ?",
]

# Uncomment this for 8 sentences from the corpus
#sentences = [ ' '.join(next(corpus_sentence_tokens_gen)) for i in range(8) ]

test_sentences_mixed = [ s.split(' ') for s in sentences ]
test_sentences_title = [ s.title().split(' ') for s in sentences ]
test_sentences_single = [ s.lower().split(' ') for s in sentences ]
#test_sentences_single = [ s.upper().split(' ') for s in sentences ]

print("Format: WORD-NLTK-RNN\n")

tag_results_for(test_sentences_mixed)
print()
tag_results_for(test_sentences_title)
print()
tag_results_for(test_sentences_single)

Format: WORD-NLTK-RNN

Dr.-1-1 Andrews-1-1 works-0-0 at-0-0 Red-1-1 Cat-1-1 Labs-1-1 .-0-0
Let-0-0 's-0-0 see-0-0 what-0-0 part-0-0 of-0-0 speech-0-0 analysis-0-0 looks-0-0 like-0-0 .-0-0
When-0-0 are-0-0 you-0-0 off-0-0 to-0-0 New-1-0 York-1-1 ,-0-0 Chaitanya-1-1 ?-0-0

Dr.-1-1 Andrews-1-1 Works-1-0 At-0-0 Red-1-1 Cat-1-1 Labs-1-1 .-0-0
Let-0-0 'S-0-0 See-0-0 What-0-0 Part-1-0 Of-0-0 Speech-1-0 Analysis-1-0 Looks-1-0 Like-0-0 .-0-0
When-0-0 Are-1-0 You-0-0 Off-1-0 To-0-0 New-1-0 York-1-1 ,-0-0 Chaitanya-1-1 ?-0-0

dr.-0-1 andrews-0-1 works-0-0 at-0-0 red-0-1 cat-0-1 labs-0-1 .-0-0
let-0-0 's-0-0 see-0-0 what-0-0 part-0-0 of-0-0 speech-0-0 analysis-0-0 looks-0-0 like-0-0 .-0-0
when-0-0 are-0-0 you-0-0 off-0-0 to-0-0 new-0-0 york-0-1 ,-0-0 chaitanya-0-1 ?-0-0


###  And let's look at the Statistics

... actually, looking at the above samples, the NLTK PoS tagger is HOPELESS when the text is converted to a single case, or title case. QED

### Exercises

1.  Make the tagger identify different PoS (say : 'verbs')

2.  Make the tagger return several different tags instead

3.  See whether more advanced 'LSTM' nodes would improve the scores

4.  Add a special 'is_uppercase' element to the embedding vector (or, more simply, just replace one of the elements with an indicator).  Does this help the NNP accuracy?