## Imports

In [3]:
import torch, numpy as np, pandas as pd
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import (GPT2Model,GPT2LMHeadModel, 
                          GPT2Config, GPT2Tokenizer,
                         BertConfig, BertTokenizer,
                         BertModel)

## Some Configuration

In [4]:
# Dictionary encoding pretrained (config, tokenizer, model) options we might consider:
language_model_dict = {'bert-base-uncased': [BertConfig, BertTokenizer, BertModel],
             'bert-base-multilingual-cased': [BertConfig, BertTokenizer, BertModel],
              'gpt2': [GPT2Config, GPT2Tokenizer, GPT2LMHeadModel],
              'gpt2-xl': [GPT2Config, GPT2Tokenizer, GPT2Model]
             }

# Dictionary encoding pretrained (config, tokenizer, model) options we might consider:
model_dict = {'bert-base-uncased': [BertConfig, BertTokenizer, BertModel],
             'bert-base-multilingual-cased': [BertConfig, BertTokenizer, BertModel],
              'gpt2': [GPT2Config, GPT2Tokenizer, GPT2LMHeadModel],
              'gpt2-xl': [GPT2Config, GPT2Tokenizer, GPT2LMHeadModel]
             }

# Choose a huggingface pretrained model from the list above, and maybe other options moving forward
config = {"model_name":'gpt2'}

# Choose a pretrained mode
pretrained_model = config["model_name"]

# Set device
device = 'cuda' if torch.cuda.is_available() else 'cpu'

## Load a pretrained gpt2 model

In [5]:
# Load tokenizer
tokenizer = model_dict[pretrained_model][1].from_pretrained(pretrained_model)

# Load configuration for bert model, and set all hidden states to be output
config = model_dict[pretrained_model][0].from_pretrained(pretrained_model, output_hidden_states=True, use_cache = False, pad_token_id=tokenizer.eos_token_id)

# Load pretrained bert with desired config
model = model_dict[pretrained_model][2].from_pretrained(pretrained_model, config = config)
model.eval()
model = model.to(device)

## Example Language Modeling usage

### 1) Use model.generate wrapper

In [6]:
# Tokenize some input
tokens = tokenizer.encode("Can you guess what I am going to say", return_tensors='pt')
tokens = tokens.to(device)

In [7]:
# Use the model to generate text beginning with previous text as context, 
# by using top-k decoding within the model.generate wrapper
tokenizer.decode(model.generate(tokens, do_sample=True, 
    max_length=30, top_k = 30)[0],skip_special_tokens=True)

"Can you guess what I am going to say? That's pretty bad.\n\nWhat's the best word for this kind of thing?\n\n"

### 2) Don't

In [8]:
# Alternative: use the model to generate text beginning with previous text as context,
# by using greedy decoding directly from the LMHead model output:

# Extract final output layer from the LM
with torch.no_grad():
    # All outputs from the Language model
    outputs = model(tokens)
    # The logits output for each of the input tokens:
    predictions = outputs[0]

This ``predictions`` output gives a score for each element of the vocab, for each of the input tokens. This is documented in the return statement of transformers.GPT2LMHeadModel, as described here: https://huggingface.co/transformers/model_doc/gpt2.html#gpt2lmheadmodel . Our ``predictions`` is the returned ``logits`` in the documentation.

In [9]:
# Check the shape to be sure:
# Shape = (batch_size, num_tokens, vocab_size)
predictions.shape

torch.Size([1, 9, 50257])

In [10]:
# Greedily generate the next word by finding the highest scoring vocab
# item for the last token in the input.
predicted_index = torch.argmax(predictions[0, -1, :]).reshape(1,1)
predicted_text = tokenizer.decode(torch.cat((tokens,predicted_index), dim =1).reshape(-1))
predicted_text

'Can you guess what I am going to say?'

## Extracting Token Embedding Representations from LM

Let's now chop off the head of this model, and extract the last hidden state as a sort of high level, embedding. Note that by virture of the pretrained model itself, we'll get an embedding per input token in doing this, and we will need to come up with out our own determination of how to associate a single embedding to a sentence. We'll separately instantiate the cut off head as it's own decoder model to read these embeddings. Finally, we'll investigate the effect of negating sentences in two ways:
 
1.  Studying the embeddings themselves, by e.g clustering, studying norm distributions, angle distributions, etc.
2.  Considering the differences of the word embeddings associated to (clause, negation) pairs, and then looking at the distribution of sentiments/clauses/phrases occurring when those differences are decoded by the LM head.

### Comparing Positive and Negative Embeddings

In [11]:
# A statement and one possible negation.
statement = 'This book is bad.'
negation = 'This book is good.'

In [12]:
def extract_hidden_output(text, model = model, tokenizer = tokenizer, layer_num = -1,
                         aggregation = 'average'):
    
    
    # Average the embeddings for each token, and return
    if aggregation == 'average':
        # tokenize the input
        tokens = tokenizer.encode(text, return_tensors = 'pt')
        tokens = tokens.to(device)
        
        with torch.no_grad():
            # All outputs from the Language model
            outputs = model(tokens)
            # All hidden states
            hidden_states = outputs[1]
            # Hidden state from layer layer_num
            layer = hidden_states[layer_num]
            layer = torch.squeeze(layer)
            if len(layer.shape) > 1:
                averaged_layer = torch.mean(layer, dim = 0)
                return averaged_layer
            else:
                return layer
        
        
        
    # Return the embeddings for each token
    if aggregation == None:
        tokens = tokenizer.encode(
            text, 
            return_tensors = 'pt'
            #,max_length=max_length,
            #pad_to_max_length=True
        )
        tokens = tokens.to(device)
        '''
        Return all token embeddings, so can study each individually. For any batch processing,
        this may require different sentiments have the same number of tokens, so we may want to set
        a max length for sentences and pad.
        '''
        pass
    

Let's compute the cosine of the angle between the last layer embeddings associated to the statement and it's negation above:

In [13]:
statement_vec = extract_hidden_output(statement).reshape(-1)
negation_vec = extract_hidden_output(negation)
cosine = torch.dot(statement_vec, negation_vec)/(torch.norm(negation_vec)*torch.norm(statement_vec))
cosine

tensor(0.9998, device='cuda:0')

This small angle for a negation confused me at first. I suppose this is not so surprising, though, and reflects that angle has little to do with similarity in these sorts of transformer models, which in and of itself is interesting, given that it is incredibly pervasive for people to use cosine similarity on hidden states as if that represents similarity. Maybe a better interpretation here is that good and bad have similar angles because they are very likely to occur in similar contexts, and this model was probably trained on MLM. For example, ('turtle', 'bad') have a larger angle between them than either ('good', 'bad') or ('good', 'lovely'), which makes sense in the latter interpretation above.


Or, perhaps the way to construct the sentence vector is not the one implemented here, and the correct one actually results in a larger angle and this example is completely inaccurate!

### Load Stanford real-life contradictions dataset

In [16]:
import xml.etree.ElementTree as ET
tree = ET.parse('data/real_contradiction.xml')
root = tree.getroot()

In [54]:
pairs = []
for child in root:
    pair =[]
    attrib = child.attrib
    if (attrib['contradiction'] =='YES') & \
    ((attrib['type'] == 'lexical') | (attrib['type'] == 'negation')):
        for statement in child:
            pair.append(statement.text)
    pairs.append(pair)

In [56]:
# Remove empty rows
pairs = [pair for pair in pairs if len(pair)>0]

In [70]:
import pandas as pd
pairs = pd.DataFrame(pairs, columns = ["statement","negation"])

## To do:
1.  Create some negation data or find some, and then study the distributions of some aspects of the embeddings above for (sentence, negation) pairs.
2. Extract the LM head and instantiate as a separate model, to use as a decoder for approach 2 above.
3.  Determine how to construct sentence embeddings from token embeddings.
4.  Reproduce this notebook for Bert, which hopefully should be straightforward aside from the generate wrapper which appears from errors to either not exist or have some different api defaults.