# Word-level entailment with neural networks

In [3]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2016"

## Overview

__Problem__: For two words $w_{1}$ and $w_{2}$, predict $w_{1} \subset w_{2}$ or $w_{1} \supset w_{2}$. This is a basic, word-level version of the task of __Natural Language Inference__ (NLI).

__Approach__: Shallow feed-forward neural networks. Here's a broad overview of the model structure and task:

![fig/wordentail.png](fig/wordentail.png)

## Set-up

0. Make sure your environment includes all the requirements for [the cs224u repository](https://github.com/cgpotts/cs224u).
0. Download [the Wikipedia 2014 + Gigaword 5 distribution](http://nlp.stanford.edu/data/glove.6B.zip) of the pretrained GloVe vectors, unzip it, and put the resulting folder in the the same directory as this notebook. (If you want to put it somewhere else, change `glove_home` below.)
0. Make sure `wordentail_data_filename` below is pointing to the full path for `wordentail_data.pickle`, which is included in the cs224u repository.

In [4]:
wordentail_data_filename = 'wordentail_data.pickle'
glove_home = "glove.6B"

In [5]:
## Contents

0. [Data](#Data)
0. [Neural network architecture](#Neural-network-architecture)


SyntaxError: invalid syntax (<ipython-input-5-f05fd983e9b5>, line 4)

In [None]:
import os
import sys
import copy
import cPickle as pickle
import random
from collections import defaultdict
import numpy as np
from numpy import dot, outer
from sklearn.metrics import classification_report
import tensorflow as tf
import utils

## Data

As suggested by the task decription, the dataset consists of word pairs with a label indicating that the first entails the second or the second entails the first. The pickled data distribution is a pair in which the first member is the vocabulary for the entire dataset and the second is a dictionary establishing train/test splits:

In [None]:
wordentail_data = pickle.load(file(wordentail_data__filename))
vocab, splits = wordentail_data

The structure of `splits` creates a single training set and two different test sets that create quite different tasks in the context of our neural architecture:

In [None]:
splits.keys()

* All three sets are disjoint. 

* The `test` vocab is a subset of the `train` vocab. So every word seen at test time was seen in training. 

* The `disjoint_test` has a vocabulary that is totally disjoint from `train`. So none of the words are seen in training. 

* All the words are in the GloVe vocabulary.

The class labels are `1.0` if the first element entails the second and `-1.0` if the secod entails the first. These labels are scaled to the particular neural models we'll be using, in particular, to the `tanh` activation functions they use by default. It's also worth noting that we'll be treating these labels using a single dimensional output space, since they are completely complementary.

In [None]:
SUBSET = 1.0    # Left word entails right, as in (hippo, mammal)
SUPERSET = -1.0 # Right word entails left, as in (mammal, hippo)

## Neural network architecture

For this notebook, we'll use a simple shallow neural network  parameterized as follows:

* A weight matrix $W^{1}$ of dimension $m \times n$, where $m$ is the dimensionality of the input vector representations and $n$ is the dimensionality of the hidden layer.
* A bias term $b_{1}$ of dimension $m \times 1$.
* A weight matrix $W^{2}$ of dimension $n \times p$, where $p$ is the dimensionality of the output vector.
* A bias term $b_{2}$ of dimension $n \times 1$.

The network is then defined as follows, with $x$ the input layer, $h$ the hidden layer of dimension $n$, and $y$ the output of dimension $1 \times p$:

$$h = \tanh\left(xW^{1} + b^{1}\right)$$

$$y = tanh\left(hW^{2} + b^{2}\right)$$

We'll first implement this from scratch and then reimplement it in TensorFlow. Our hope is that this will provide a firm foundation for your own exploration of neural models for the NLI task.

## From scratch

TensorFlow is a powerful library for building deep learning models. In essence, you define the model architecture and the details of optimization. In addition, it is very high-performance, so it will scale to large datasets and complicated model designs. So, we'll want to start using it shortly. However, before making that move, it's worth building up our simple shallow architecture from scratch, as a way to explore the concepts and avoid the dangers of black-box optimization.

In [None]:
def d_tanh(z):
    """The derivative of the hyperbolic tangent function. 
    z should be a float or np-array."""
    return 1.0 - z**2

def progress_bar(iteration, error):
    """Simple over-writing progress bar for tracking the speed
    and trajectory of training."""
    sys.stderr.write('\r')
    sys.stderr.write('completed iteration %s; error is %s' % ((iteration+1), error))
    sys.stderr.flush()

class ShallowNeuralNetwork:
    """Fit a model f(f(xW1 + b1)W2 = b2)"""    
    def __init__(self, 
            input_dim=0, 
            hidden_dim=0, 
            output_dim=0, 
            afunc=np.tanh, 
            d_afunc=d_tanh,
            maxiter=100,
            eta=0.05,
            epsilon=1.5e-8,
            display_progress=True):
        """All the parameters are set as attributes.
        
        Parameters
        ----------
        input_dim, hidden_dim, output_dim : int, int, int
            The basic dimension of the network. input_dim
            and output_dim must match the dimensions of the
            training data. hidden_dim is free.
            
        afunc : vectorized activation function (default: np.tanh)
            The non-linear activation function used by the 
            network for the hidden and output layers.
            
        d_afunc :  vectorized activation function derivative (default: `d_tanh`)
            The derivative of `afunc`. It is not ensure that this 
            matches `afunc`, and craziness will result from mismatches.

        maxiter : int default: 100)
            Maximum number of training epochs
            
        eta : float (default: 0.05)
            Learning rate.
            
        epsilon : float (default: 1.5e-8)
            Training terminates if the error reaches this
            point (or `maxiter` is met).
                    
        display_progress : bool (default: True)
           Whether to use the simple over-writing `progress_bar`
           to show progress.                    
        
        """
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.output_dim = output_dim
        self.afunc = afunc 
        self.d_afunc = d_afunc 
        self.maxiter = maxiter
        self.eta = eta        
        self.epsilon = epsilon
        self.display_progress = display_progress
                
    def forward_propagation(self, ex): 
        """Computes the forward pass. ex shoud be a vector 
        of the same dimensionality as self.input_dim.
        No value is returned, but the output layer self.y
        is updated, as are self.x and self.h
        
        """        
        self.x[ : -1] = ex # ignore the bias
        self.h[ : -1] = self.afunc(dot(self.x, self.W1)) # ignore the bias
        self.y = self.afunc(dot(self.h, self.W2))        
        
    def backward_propagation(self, y_):
        """Send the error signal back through the network.
        y_ is the ground-truth label we compare against."""
        y_ = np.array(y_)       
        self.y_err = (y_ - self.y) * self.d_afunc(self.y)
        h_err = dot(self.y_err, self.W2.T) * self.d_afunc(self.h)
        self.W2 += self.eta * outer(self.h, self.y_err)
        self.W1 += self.eta * outer(self.x, h_err[:-1]) # ignore the bias
        return np.sum(0.5 * (y_ - self.y)**2)

    def fit(self, training_data): 
        """The training algorithm. 
        
        Parameters
        ----------
        training_data : list
            A list of (example, label) pairs, where `example`
            and `label` are both np.array instances.
        
        Attributes
        ----------
        self.x : the input layer
        self.h : the hidden layer
        self.y : the output layer
        self.W1 : dense weight connection from self.x to self.h
        self.W2 : dense weight connection from self.h to self.y
        
        Both self.W1 and self.W2 include the bias column as their
        final dimension.
        
        The following attributes are created here for efficiency
        but used only in `backward_propagation`:
        
        self.y_err : vector of output errors
        self.x_err : vector of input errors 
        """
        # Parameter initialization:
        self.x = np.ones(self.input_dim+1)  # +1 for the bias                                         
        self.h = np.ones(self.hidden_dim+1) # +1 for the bias        
        self.y = np.ones(self.output_dim)        
        self.W1 = utils.randmatrix(self.input_dim+1, self.hidden_dim)
        self.W2 = utils.randmatrix(self.hidden_dim+1, self.output_dim)        
        self.y_err = np.zeros(self.output_dim)
        self.x_err = np.zeros(self.input_dim+1)
        # SGD:
        iteration = 0
        error = sys.float_info.max
        while error > self.epsilon and iteration < self.maxiter:            
            error = 0.0
            random.shuffle(training_data)
            for ex, labels in training_data:
                self.forward_propagation(ex)
                error += self.backward_propagation(labels)           
            if self.display_progress:
                progress_bar(iteration, error)
            iteration += 1
                    
    def predict(self, ex):
        """Prediction for `ex`, which must be featurized as the
        training data were. Simply runs `foward_propagation` and
        returns a copy of self.y."""
        self.forward_propagation(ex)
        return copy.deepcopy(self.y)

## Input feature representation

Even in deep learning, feature representation is the most important thing and requires care!

For our task, feature representation has two parts: representing the individual words and combining those representations into a single network input.

### Representing the inputs

Our baseline word representation will be random vectors. This works well for the `test` task but is of course hopeless for the `disjoint_test` one.

In [None]:
def randvec(w, n=40, lower=-0.5, upper=0.5):
    """Returns a random vector of length n. w is ignored."""
    return np.array([random.uniform(lower, upper) for i in range(n)])

Whereas random inputs are hopeless for `disjoint_test`, GloVe vectors might not be ...

In [None]:
glove_src = os.path.join(glove_home, 'glove.6B.50d.txt')
GLOVE_MAT, GLOVE_VOCAB, _ = utils.build_glove(glove_src)

def glvvec(w):
    """Return the GloVe vector for w."""
    i = GLOVE_VOCAB.index(w)
    return GLOVE_MAT[i]

### Combining the inputs

Here we decide how to combine the two word vectors into a single representation. In more detail, where $x_{l}$ is a vector representation of the left word and $x_{r}$ is a representation of the right word, we need a combination function $\textbf{combine}$ such that $\textbf{combine}(x_{l}, x_{r})$ returns a new input vector $x$ of dimension $1 \times m$. $\textbf{combine}$ could be concatenation, vector average, vector difference, etc. (even combinations of those) &mdash; there's lots of space for experimentation here.

In [None]:
def vec_concatenate(u, v):
    return np.concatenate((u, v))

## Building datasets for experiments

In [None]:
def build_dataset(wordentail_data,vector_func=randvec, vector_combo_func=vec_concatenate): 
    # Load in the dataset:
    vocab, splits = wordentail_data
    # Make vectors a mapping from words (as strings) to their vector
    # representations, as determined by vector_func.
    vectors = {w: vector_func(w) for w in vocab}
    # Create a dataset in the format required by the neural network:
    # {'train': [(vec, [cls]), (vec, [cls]), ...],
    #  'test':  [(vec, [cls]), (vec, [cls]), ...] }
    dataset = defaultdict(list)
    for split, data in splits.items():
        for clsname, word_pairs in data.items():
            for w1, w2 in word_pairs:
                # Use vector_combo_func to combine the word vectors for
                # w1 and w2, as given by the vectors dictionary above,
                # and pair it with the singleton array containing clsname.
                item = [vector_combo_func(vectors[w1], vectors[w2]), np.array([clsname])]
                dataset[split].append(item)
    return dataset

## Running experiments

In [None]:
def experiment(dataset, network): 
    # Get the train and test sets from the dataset:
    train = dataset['train']
    test = dataset['test']
    disjoint_vocab_test = dataset['disjoint_vocab_test']    
    # Set these dimensions based on the data:
    network.input_dim = len(train[0][0])
    network.output_dim = len(train[0][1])    
    # Train the network, with the number of iterations set you by you
    # (make it a keyword argument to this function). You might want
    # to use display_progress=True to track errors andd speed.
    network.fit(train)
    # The following is evaluation code. You won't have to alter it
    # unless you did something unexpected like  transform the output
    # variables before training.
    for typ, data in (('train', train), ('test', test), ('disjoint_vocab_test', disjoint_vocab_test)):
        predictions = []
        cats = []
        for ex, cat in data:            
            # The raw prediction is a singleton list containing a float in (-1,1).
            # We want only its contents:
            prediction = network.predict(ex)[0]
            # Categorize the prediction for accuracy comparison:
            prediction = SUPERSET if prediction <= 0.0 else SUBSET            
            predictions.append(prediction)
            # Store the gold label for the classification report:
            cats.append(cat[0])
        # Report:
        print "======================================================================"
        print typ
        print classification_report(cats, predictions, target_names=['SUPERSET', 'SUBSET'])

In [None]:
dataset = build_dataset(wordentail_data, vector_func=randvec, vector_combo_func=vec_concatenate)

network = ShallowNeuralNetwork(hidden_dim=40, maxiter=500, eta=0.05, display_progress=True)

experiment(dataset, network)

## Shallow neural network in TensorFlow

In [None]:
class TensorFlowShallowNeuralNetwork:
    def __init__(self, 
            input_dim=0, 
            hidden_dim=0, 
            output_dim=0,             
            maxiter=100,
            eta=0.05):
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.output_dim = output_dim
        self.maxiter = maxiter
        self.eta = eta            
                
    def fit(self, training_data):
        self.sess = tf.InteractiveSession()
        # Network initialization:
        self.x = tf.placeholder(tf.float32, [None, self.input_dim])
        self.W1 = tf.Variable(tf.random_normal([self.input_dim, self.hidden_dim]))
        self.b1 = tf.Variable(tf.random_normal([self.hidden_dim]))
        self.W2 = tf.Variable(tf.random_normal([self.hidden_dim, self.output_dim]))
        self.b2 = tf.Variable(tf.random_normal([self.output_dim]))
        # Network structure:
        self.h = tf.nn.tanh(tf.matmul(self.x, self.W1) + self.b1)
        self.y = tf.nn.tanh(tf.matmul(self.h, self.W2) + self.b2)
        self.y_ = tf.placeholder(tf.float32, [None, self.output_dim])
        # Optimization:
        mse = tf.reduce_sum(0.5 * (self.y_-self.y)**2)
        self.optimizer = tf.train.GradientDescentOptimizer(self.eta).minimize(mse)
        # Train:
        init = tf.initialize_all_variables()
        self.sess.run(init)        
        x, y_ = zip(*training_data)
        for iteration in range(self.maxiter):            
            self.optimizer.run(feed_dict={self.x: x, self.y_: y_})                       

    def predict(self, ex):
         return self.sess.run(self.y, feed_dict={self.x: [ex]})

In [None]:
dataset = build_dataset(wordentail_data, vector_func=randvec, vector_combo_func=vec_concatenate)
tfnet = TensorFlowShallowNeuralNetwork(hidden_dim=20, maxiter=1000)
experiment(dataset, tfnet)

## Bake-off

### Deep neural network in TensorFlow