In [37]:
import torch, numpy as np
from transformers import GPT2Model,GPT2LMHeadModel, GPT2Config, GPT2Tokenizer
from pytorch_model_summary import summary

torch.set_grad_enabled(False) #no training in this NB
norm=np.linalg.norm

## Load in the gpt2 XL model
Using `GPT2LMHeadModel` allows us to actually decode our vectors back into language. Below we'll just get the vectors.

In [36]:
#configuration = GPT2Config()
language_model = GPT2LMHeadModel.from_pretrained('gpt2-xl') #use 'gpt2' if you don't want to download 6GB of weights and use 20 GB of ram
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-xl") #use 'gpt2' if you don't want to download 6GB of weights and use 20 GB of ram

In [37]:
tokens = tokenizer.encode("An interesting unexplored thing about language models is that they actually have a rich group structure, for instance", return_tensors='pt')

tokenizer.decode(language_model.generate(tokens, do_sample=True, 
    max_length=100, 
    top_k=50)[0],skip_special_tokens=True)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


'An interesting unexplored thing about language models is that they actually have a rich group structure, for instance each component (e.g. "this is a word", "this is not a word") is a part of a whole. In our case with Word2Vec, we use embeddings to transform this word-based group structure into the word-free vocabulary structure that we seek (because you can\'t represent the word-free space as an array of vectors, you have to "'

In [34]:
#1.5 billion params
sum(p.numel() for p in language_model.parameters())

1557611200

We can probably get the representation at any particular layer using a `pytorch` callback, but for now lets just instantiate another gpt2 model without the "LMHead" and see what kind of vectors that gives us.

In [2]:
model = GPT2Model.from_pretrained('gpt2-xl') #use 'gpt2' if you don't want to download 6GB of weights and use 20 GB of ram
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-xl") #use 'gpt2' if you don't want to download 6GB of weights and use 20 GB of ram

Some weights of GPT2Model were not initialized from the model checkpoint at gpt2-xl and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias', 'h.12.attn.masked_bias', 'h.13.attn.masked_bias', 'h.14.attn.masked_bias', 'h.15.attn.masked_bias', 'h.16.attn.masked_bias', 'h.17.attn.masked_bias', 'h.18.attn.masked_bias', 'h.19.attn.masked_bias', 'h.20.attn.masked_bias', 'h.21.attn.masked_bias', 'h.22.attn.masked_bias', 'h.23.attn.masked_bias', 'h.24.attn.masked_bias', 'h.25.attn.masked_bias', 'h.26.attn.masked_bias', 'h.27.attn.masked_bias', 'h.28.attn.masked_bias', 'h.29.attn.masked_bias', 'h.30.attn.masked_bias', 'h.31.attn.masked_bias', 'h.32.attn.masked_bias', 'h.33.attn.masked_bias', 'h.34.attn.masked_bias', 'h.35.attn.masked_bias'

In [6]:
tokens = tokenizer.encode("I am great", return_tensors='pt')

In [7]:
tokens

tensor([[  40,  716, 1049]])

In [15]:
model.train(False)
representation = model(tokens)

In [30]:
print(representation[0].shape)
print(len(representation[1]))
print(representation[1][0].shape)
print(representation[1][1].shape)

torch.Size([1, 3, 1600])
48
torch.Size([2, 1, 25, 3, 64])
torch.Size([2, 1, 25, 3, 64])


Not really sure what's being returned. From https://huggingface.co/transformers/pretrained_models.html, it seems the hidden state size for the sentence (i.e. it's embedding) is 1600, and there are 48 layers for this model. So I guess the tuple is 

element 0: hidden state
element 1: hidden states at each prior attention block.

Looking at the vectors in element 1, it seems the 3rd dimension runs over the "heads" as there are 25 for this model. Below just check to make sure the same is true for the smaller gpt2 model. It seems to match, 12 heads, final hidden state is 768.

Compare some cosine

In [23]:
model2 = GPT2Model.from_pretrained('gpt2') 
tokenizer2 = GPT2Tokenizer.from_pretrained("gpt2")

Some weights of GPT2Model were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [44]:
tokens2 = tokenizer2.encode("I am a great man", return_tensors='pt')
model2.train(False)
representation2 = model2(tokens)
print(representation2[0].shape)
print(len(representation2[1]))
print(representation2[1][0].shape)
print(representation2[1][1].shape)

torch.Size([1, 3, 768])
12
torch.Size([2, 1, 12, 3, 64])
torch.Size([2, 1, 12, 3, 64])


In [39]:
tok_pos = tokenizer2.encode("I am a great man", return_tensors='pt')
rep_pos = model2(tok_pos)[0].numpy().flatten()
tok_neg = tokenizer2.encode("I am a terrible man", return_tensors='pt')
rep_neg = model2(tok_neg)[0].numpy().flatten()

print(rep_neg.dot(rep_pos)/(norm(rep_neg)*norm(rep_pos)))

0.99936974


In [43]:
tok_pos = tokenizer2.encode("I ate dinner at 5 pm, that's a bit strange", return_tensors='pt')
rep_pos = model2(tok_pos)[0].numpy().flatten()
tok_neg = tokenizer2.encode("I am the walrus", return_tensors='pt')
rep_neg = model2(tok_neg)[0].numpy().flatten()

rep_pos = model2(tok_pos)
print(rep_pos[0].shape)
print(len(rep_pos[1]))
print(rep_pos[1][0].shape)
print(rep_pos[1][1].shape)

rep_pos = model2(tok_neg)
print(rep_pos[0].shape)
print(len(rep_pos[1]))
print(rep_pos[1][0].shape)
print(rep_pos[1][1].shape)
#print(rep_neg.dot(rep_pos)/(norm(rep_neg)*norm(rep_pos)))

torch.Size([1, 12, 768])
12
torch.Size([2, 1, 12, 12, 64])
torch.Size([2, 1, 12, 12, 64])
torch.Size([1, 5, 768])
12
torch.Size([2, 1, 12, 5, 64])
torch.Size([2, 1, 12, 5, 64])


Okay, this is a bit more work than I thought, the `hidden state` is the attention representation for each token (I guess in the previous sentence `am a` must be ignored, which is why the second dimension is length 3). If we want "sentence representations" then we will need to figure out how these are combined in the "decoder" I guess.

## ToDo
1. Error messages keep saying we should train a model before running any inference, that's pretty weird, wonder if that's actually neccessary. Need to look into that.
2. If I want to decode the hidden state of this sentence, I wonder if I can just instantiate one of those `LMHead`s and then try to decode the difference vectors
3. What's the natural way to compose the attention reps for each word?