# Natural language inference

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2016 term"

## Contents

0. [Overview](#Overview)
0. [Set-up](#Set-up)
0. [Working with SNLI](#Working-with-SNLI)
   0. [Trees](#Trees)
   0. [Readers](#Readers)
0. [MaxEnt classifier approach](#MaxEnt-classifier-approach)
   0. [Baseline classifier features](#Baseline-classifier-features)
   0. [Building datasets for experiments](#Building-datasets-for-experiments)
   0. [Training](#Training)
   0. [Running experiments](#Running-experiments)
0. [Recurrent neural network approach](#Shallow-neural-network-approach)
0. [Exercises](#Exercises)
    0. [Feature selection](#A.-Feature-selection)
    0. [WordNet-based entailment features](#B.-WordNet-based-entailment-features)
    0. [A feed-forward neural baseline](#C.-A-neural-baseline)

## Overview

In the context of NLP/NLU, Natural Language Inference (NLI) is the task of predicting the logical relationships between words, phrases, sentences, (paragraphs, documents, ...). Such relationships are crucial for all kinds of reasoning in natural language: arguing, debating, problem solving, summarization, extrapolation, and so forth. 

NLI is a great task for this course. It requires serious linguistic analysis to do well, there are good publicly available datasets, and there are some natural baselines that help with getting a model up and running, and with understanding the performance of more sophisticated approaches.  NLI was also the topic of [Bill's thesis](http://nlp.stanford.edu/~wcmac/papers/nli-diss.pdf) (he popularized the name "NLI"), so you can forever endear yourself to him by working on it!

We looked at NLI briefly in our word-level entailment bake-off (the `wordentail.ipynb` notebook). The purpose of this codebook is to introduce the problem of NLI more fully in the context of the [Stanford Natural Language Inference](http://nlp.stanford.edu/projects/snli/) corpus (SNLI). We'll explore two general approaches:

* Standard classifiers
* Recurrent neural networks

This should be a good starting point for exploring richer models of NLI.

In [2]:
import os
import re
import sys
import copy
import pickle
import codecs
import numpy as np
import itertools
from collections import Counter
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import tensorflow as tf
import utils

## Set-up

0. Make sure your environment includes all the requirements for [the cs224u repository](https://github.com/cgpotts/cs224u), especially TensorFlow, which isn't included in the standard Anaconda distribution (but is [easily installed](https://anaconda.org/jjhelmus/tensorflow)).
0. Make sure `snli_sample_src` is pointing to your copy of `semparse_dateparse_data.pickle`, which should be included in the repository in the `nli-data` subfolder. (Because SNLI is very large, we'll work with a small sample from the training set in class.)
0. For exercise B in section 6, make sure you've run `nltk.download()` to get the NLTK data.

In [3]:
snli_sample_src = os.path.join('nli-data', 'snli_1.0_cs224u_sample.pickle')

snli_sample = pickle.load(file(snli_sample_src))

snli_sample.keys()

['train', 'dev']

## Working with SNLI

### Trees

In [4]:
WORD_RE = re.compile(r"([^ \(\)]+)", re.UNICODE)

def str2tree(s):
    """Turns labeled bracketing s into a tree structure (tuple of tuples)"""
    s = WORD_RE.sub(r'"\1",', s)
    s = s.replace("\\", "\\\\")
    s = s.replace(")", "),").strip(",")
    s = s.strip(",")
    return eval(s)

For baseline models, we often want just the words, also called terminal nodes or _leaves_.
This function gives us access to them as a list:

In [5]:
str2tree("( ( A child ) ( is ( playing ( in ( a yard ) ) ) ) )")

(('A', 'child'), ('is', ('playing', ('in', ('a', 'yard')))))

In [6]:
def leaves(t):
    """Returns all of the words (terminal nodes) in tree t"""
    words = []
    for x in t:
        if isinstance(x, basestring):
            words.append(x)
        else:
            words += leaves(x)
    return words

In [7]:
leaves(str2tree("( ( A child ) ( is ( playing ( in ( a yard ) ) ) ) )"))

['A', 'child', 'is', 'playing', 'in', 'a', 'yard']

### Readers

In [8]:
LABELS = ['contradiction', 'entailment', 'neutral']

In [9]:
def snli_reader(sample):
    for d in sample:
        yield (str2tree(d['sentence1_binary_parse']), 
               str2tree(d['sentence2_binary_parse']),
               d['gold_label'])
        
def train_reader():
    return snli_reader(snli_sample['train'])

def dev_reader():
    return snli_reader(snli_sample['dev'])

## MaxEnt classifier approach

### Baseline classifier features

The first baseline we define is the _word overlap_ baseline. It simply uses as
features the words that appear in both sentences.

In [10]:
def word_overlap_phi(t1, t2):
    overlap = [w1 for w1 in leaves(t1) if w1 in leaves(t2)]
    return Counter(overlap)

Another popular baseline is to use as features the full cross-product of
words from both sentences:    

In [11]:
def word_cross_product_phi(t1, t2):
    return Counter([(w1, w2) for w1, w2 in itertools.product(leaves(t1), leaves(t2))])

Both of these feature functions return count dictionaries mapping feature names to  the number of times they occur in the data. This is the representation we'll work with throughout; `sklearn` will handle the further processing it needs to build linear classifiers.

Naturally, you can do better than these feature functions! Both of these feature classes might be useful even in a more advanced model, though.

### Building datasets for experiments

The first step in training a classifier is using a feature function like the one above to turn the data into a list of _training instances_: feature representations and their  associated labels:

In [12]:
def build_dataset(reader, phi=word_overlap_phi, vectorizer=None):
    feat_dicts = []
    labels = []
    raw_examples = []
    for t1, t2, label in reader():
        d = phi(t1, t2)
        feat_dicts.append(d)
        labels.append(label)   
        raw_examples.append((t1, t2))
    if vectorizer == None:
        vectorizer = DictVectorizer(sparse=True)
        feat_matrix = vectorizer.fit_transform(feat_dicts)
    else:
        feat_matrix = vectorizer.transform(feat_dicts)
    return {'X': feat_matrix, 
            'y': labels, 
            'vectorizer': vectorizer, 
            'raw_examples': raw_examples}

### Training

In [13]:
def fit_maxent_classifier(X, y):
    mod = LogisticRegression(fit_intercept=True, C=1.0, penalty='l2')
    mod.fit(X, y)
    return mod    

### Running experiments

In [14]:
def experiment(
        train_reader=train_reader, 
        assess_reader=dev_reader, 
        phi=word_overlap_phi,
        train_func=fit_maxent_classifier):    
    train = build_dataset(train_reader, phi)    
    assess = build_dataset(assess_reader, phi, vectorizer=train['vectorizer'])
    mod = fit_maxent_classifier(train['X'], train['y'])
    predictions = mod.predict(assess['X'])
    return classification_report(assess['y'], predictions)

In [15]:
print experiment()

             precision    recall  f1-score   support

contradiction       0.38      0.54      0.45      1000
 entailment       0.44      0.35      0.39      1000
    neutral       0.37      0.29      0.32      1000

avg / total       0.40      0.39      0.39      3000



### A few ideas for better classifier features

* Cross product of synsets compatible with each word, as given by WordNet. (Here is [a codebook on using WordNet from NLTK to do things like this](http://compprag.christopherpotts.net/wordnet.html).)

* More fine-grained WordNet features &mdash; e.g., spotting pairs like _puppy_/_dog_ across the two sentences.

* Use of other WordNet relations (see Table 1 and Table 2 in [this codelab](http://compprag.christopherpotts.net/wordnet.html) for relations and their coverage).

* Using the tree structure to define features that are sensitive to how negation scopes over constituents.

* Features that are sensitive to differences in negation between the two sentences.

* Sentiment features seeking to identify contrasting polarity.

## Recurrent neural network approach

## Exercises

### A. Feature selection

Create a modification of `fit_maxent_classifier` called `fit_maxent_classifier_with_feature_selection` that does feature selection prior to fitting the model using  [sklearn.feature_selection.SelectPercentile](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectPercentile) with [sklearn.feature_selection.chi2](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html#sklearn.feature_selection.chi2) as `score_func`. You can leave the default `percentile=10`. __Submit__:

* Your `fit_maxent_classifier_with_feature_selection`.
* Your use of `experiment` with `fit_maxent_classifier_with_feature_selection`.
* Your output from the `experiment` function call.

### B. WordNet-based entailment features

[Python NLTK](http://www.nltk.org) has an excellent WordNet interface. As noted above, WordNet is a natural choice for defining useful features in the context of NLI.

__Your task__: write and submit a feature function, for use with `build_dataset`, that is just like `word_cross_product_phi` except that, given a sentence pair $(S_{1}, S_{2})$, it counts only pairs $(w_{1}, w_{2})$ such that $w_{1}$ entails $w_{2}$, for $w_{1} \in S_{1}$ and $w_{2} \in S_{2}$. For example, the sentence pair (_the cat runs_, _the animal moves_) would create the dictionary `{(cat, animal): 1.0, (runs, moves): 1.0}`.

There are many ways to do this. For the purposes of the question, we can limit attention to the WordNet hypernym relation. The following illustrates reasonable ways to go from a string $s$ to the set of all hypernyms of Synsets consistent with $s$:

In [16]:
from nltk.corpus import wordnet as wn
    
puppies = wn.synsets('puppy')
print [h for ss in puppies for h in ss.hypernyms()]

# A more conservative approach uses just the first-listed 
# Synset, which should be the most frequent sense:
print wn.synsets('puppy')[0].hypernyms()

[Synset('dog.n.01'), Synset('pup.n.01'), Synset('young_person.n.01')]
[Synset('dog.n.01'), Synset('pup.n.01')]


### C. A neural baseline

This question asked you to define and evaluate a GloVe-based vector-average neural baseline for SNLI. 

__Submit__: 

* A feature function `glove_average_concatenate_phi` that averages the GloVe representations for the words in premise, averages the Glove representation of the words in the hypothesis, and concatenates those two vectors.

* An experiment function `shallow_snli_experiment` that uses `glove_average_concatenate_phi` to train and assses a `TfShallowNeuralNetwork`.

Note: in this case, your network should have an output layer of dimension 3, one for each SNLI class. Your prediction can be the argmax of this predicted output layer: `LABELS[np.argmax(prediction)]`, where `prediction` is the output of `TfShallowNeuralNetwork.predict`. 

The code below is intended to provide guidance as to how to structure this answer. Please feel free to use a different design if you prefer.

In [18]:
glove_home = 'glove.6B'
GLOVE = utils.glove2dict(os.path.join(glove_home, 'glove.6B.50d.txt'))

In [27]:
import random

def glove_features(t):
    """Return the mean glove vector of the leaves in tree t."""
    vecs = [GLOVE[w] for w in leaves(t) if w in GLOVE]
    if vecs:
        return np.mean(vecs, axis=0)
    else:
        return randvec('x', 50)
    
def vec_concatenate(u, v):
    return np.concatenate((u, v)) 

def randvec(w, n=40, lower=-0.5, upper=0.5):
    return np.array([random.uniform(lower, upper) for i in range(n)])

def glove_featurizer(t1, t2, dim=50):
    return vec_concatenate(glove_features(t1), glove_features(t2))

In [20]:
def labelvec(label):
    """Return output vectors like [1,-1,-1], where the unique 1 is the true label."""
    vec = np.repeat(-1.0, 3)
    vec[LABELS.index(label)] = 1.0
    return vec

In [35]:
def data_prep(reader, phi):         
    dataset = []
    for t1, t2, label in reader():     
        dataset.append([phi(t1, t2), labelvec(label)])
    return dataset

def train_and_evaluate_network(network, phi, train_reader=dev_reader, assess_reader=dev_reader):
    # Use `data_prep` to prepare the train and assess data with 
    # their readers and `phi`.
    train = data_prep(train_reader, phi)
    assess = data_prep(assess_reader, phi)
    # Train `network` using its `fit` method:
    network.fit(train)
    # Store predictions and gold labels:
    predictions = []
    gold = []
    # Iterate through the assessment data:
    for ex, cat in assess:            
        # Use `network.predict` to get the prediction for `ex`:
        prediction = network.predict(ex)
        # Argmax dimension for the prediction:
        prediction = LABELS[np.argmax(prediction)]
        predictions.append(prediction)
        # Store the gold label for the classification report:
        gold.append(LABELS[np.argmax(cat)])        
    # Report:
    return classification_report(gold, predictions, target_names=LABELS)

In [36]:
import shallow_neural_networks

print train_and_evaluate_network(
    shallow_neural_networks.ShallowNeuralNetwork(hidden_dim=5, maxiter=100), 
    glove_featurizer)

completed iteration 100; error is 3781.86294107

               precision    recall  f1-score   support

contradiction       0.44      0.58      0.50      1000
   entailment       0.56      0.43      0.48      1000
      neutral       0.49      0.46      0.47      1000

  avg / total       0.50      0.49      0.49      3000






In [40]:
import shallow_neural_networks

print train_and_evaluate_network(
    shallow_neural_networks.TfShallowNeuralNetwork(), 
    glove_featurizer)

               precision    recall  f1-score   support

contradiction       0.33      1.00      0.50      1000
   entailment       0.00      0.00      0.00      1000
      neutral       0.00      0.00      0.00      1000

  avg / total       0.11      0.33      0.17      3000

