## Tagging as classification

Matthew Stone   
CS 533    
Initial version, Spring 2017.  Updated Spring 2018.

This notebook shows how to write a POS classifier (or in general any sequence classifier) by analyzing windows in a string using features.  It assumes that you have the `vocabulary.py` definitions of the vocabulary class, the `squtils` definitions for working conveniently with tagged sequence data, and the `GloVe` 6B 50-dimensional word embeddings available in the working directory.

In [6]:
from __future__ import print_function
import os
import nltk
import re
import itertools
import numpy as np
import scipy
import sklearn
import vocabulary
import squtils

Here's the definitions for the machine learning problem that we'll be working with.

Note that things are slightly more complicated here in terms of accessing the data: we need to have a generator that yields sequences, corresponding to each sentence (because we are going to have to put in START and END tags to model contexts), and we expect that this data will actually have (word, tag) items in it fundamentally.  Since generators are basically "running functions", we need to define our access to the data via a call that gives us a new way to run through the sequence.

In [9]:
vocab_file, vocab_file_type = "brown-vocab.pkl", "pickle"

embedding_file, embedding_dimensions, embedding_cache = "D:/Rutgers/4th-Semester/Natural_Language_Processing/glove.6B/glove.6B.50d.txt", 50, "brown-embedding.npy"

def mk_data_generator(subset='all') :
    if subset == 'dev' :
        start = 0
        stop = 123
    elif subset == 'test' :
        start = 123
        stop = 623
    elif subset == 'train' :
        start = 623
        stop = 4623
    else :
        start = 0
        stop = 4623
    return itertools.islice(nltk.corpus.brown.tagged_sents(categories='news',tagset='universal'),
                            start, stop)


The pattern below is familiar from last time: We're going to want to have a set of word embeddings to work with as features, to be able to generalize sparse data across items.  The code below loads in a targeted subset of the 50-dimensional GloVe embeddings, and maintains an index between rows in the resulting matrix and specific English words.

The code gives a sense of how many words are missing from GloVe (these are likely to be proper names and the like).

In [10]:
made_vocabulary = False
if made_vocabulary :
    words = vocabulary.Vocabulary.load(vocab_file, file_type=vocab_file_type)
else: 
    tokens = (w.lower() for (w,t) in itertools.chain.from_iterable(mk_data_generator()))
    words = vocabulary.Vocabulary.from_iterable(tokens, file_type=vocab_file_type)
    words.save(vocab_file)
words.stop_growth()

In [11]:
made_embedding = False
if made_embedding :
    e = squtils.load_dense_array(embedding_cache)
else: 
    e = squtils.build_dense_embedding(words, embedding_file, embedding_dimensions)
    squtils.save_dense_array(embedding_cache, e)

1282 words were not in glove


Then we need to be able to represent each window using appropriate features.  The code below uses the 50 dimensional word embedding at each position, the four character prefixes and suffixes at each position, and the word identity at each position.

In [12]:
def mkf(features, name, fl) :
    r = features.add(name)
    if r :
        fl.append(r)
    
def word_feature_columns(features, code, item) :
    f = []
    for i in range(0,50) :
        mkf(features, "{}:e{}".format(code, i), f)
    mkf(features, "{}:w_{}".format(code, item), f)
    for i in range(1,4) :
        mkf(features, "{}:{}_{}".format(code, i, item[-i:]), f)
    for i in range(1,4) :
        mkf(features, "{}:{}{}_".format(code, i, item[0:i]), f)
    return f

def word_feature_values(embeddings, vocab, item, f) :
    values = np.zeros(len(f))
    r = vocab.add(item) 
    if r: 
        values[:50] = embeddings[r]
    values[50:] = np.ones(len(f)-50)
    return values

def fivegram_features(features, embeddings, vocab, cxt) :
    (n2b, n1b, t, n1a, n2a) = cxt
    f2b = word_feature_columns(features, "w_t", n2b)
    v2b = word_feature_values(embeddings, vocab, n2b, f2b)
    f1b = word_feature_columns(features, "wt", n1b)
    v1b = word_feature_values(embeddings, vocab, n1b, f1b)
    ft = word_feature_columns(features, "t", t)
    vt = word_feature_values(embeddings, vocab, t, ft)
    f1a = word_feature_columns(features, "tw", n1a)
    v1a = word_feature_values(embeddings, vocab, n1a, f1a)
    f2a = word_feature_columns(features, "t_w", n2a)
    v2a = word_feature_values(embeddings, vocab, n2a, f2a)
    return np.concatenate([f2b, f1b, ft, f1a, f2a]), np.concatenate([v2b, v1b, vt, v1a, v2a])

In [13]:
features = vocabulary.Vocabulary()

def mk_encoder(features, embeddings, vocab) :
    def encode(cxt) :
        return fivegram_features(features, embeddings, vocab, cxt)

    return encode

feature_encoder = mk_encoder(features, e, words)
def size_f() : return len(features)

Build the training data and fix the features we're going to be using.

In [14]:
Xtr, ytr = squtils.mk_tagging_data(feature_encoder, size_f, 
                                   mk_data_generator(subset='train'))
features.stop_growth()

In [40]:
Xtr

<86156x88853 sparse matrix of type '<class 'numpy.float64'>'
	with 24468304 stored elements in Compressed Sparse Row format>

We build a classifier as always

In [41]:
classifier = \
sklearn.linear_model.SGDClassifier(loss="log",
                                   penalty="elasticnet",
                                   max_iter=5)

classifier.fit(Xtr, ytr)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='log', max_iter=5, n_iter=None,
       n_jobs=1, penalty='elasticnet', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False)

Now if we want to visualize the action of the classifier, we need some help.

Get the development data and evaluate its classification results.

In [42]:
Xd, yd = squtils.mk_tagging_data(feature_encoder, size_f, 
                                 mk_data_generator(subset='dev'))
pd = classifier.predict(Xd)
print(sklearn.metrics.accuracy_score(yd, pd))

0.9643721895537876


Play with the tagger

In [43]:
squtils.test_tagger(feature_encoder, size_f, classifier,
                    "they refuse to permit us to obtain the refuse permit")

<zip at 0x1c7c3371788>