In [3]:
import torch, numpy as np
from transformers import GPT2Model,GPT2LMHeadModel, GPT2Config, GPT2Tokenizer

torch.set_grad_enabled(False) #no training in this NB
norm=np.linalg.norm


## Load in the gpt2 XL model
Using `GPT2LMHeadModel` allows us to actually decode our vectors back into language. Below we'll just get the vectors.

In [4]:
#configuration = GPT2Config()
tokenizer = GPT2Tokenizer.from_pretrained("gpt2") #use 'gpt2' if you don't want to download 6GB of weights and use 20 GB of ram
language_model = GPT2LMHeadModel.from_pretrained('gpt2',pad_token_id=tokenizer.eos_token_id) #use 'gpt2' if you don't want to download 6GB of weights and use 20 GB of ram

In [5]:
tokens = tokenizer.encode("An interesting unexplored thing about language models is that they actually have a rich group structure, for instance", return_tensors='pt')

# top-k decoding
tokenizer.decode(language_model.generate(tokens, do_sample=True, 
    max_length=100, 
    top_k=50)[0],skip_special_tokens=True)

'An interesting unexplored thing about language models is that they actually have a rich group structure, for instance, some of which we can think of as a set of words. In such a statement, you might say: "It is the word of God, but that is not enough. So it should be a list of these words, the whole of which must be different from any others." But even for this simple statement, it must be really long. If those words mean a common word, then'

In [6]:
#~ 12 million params
sum(p.numel() for p in language_model.parameters())

124439808

We can probably get the representation at any particular layer using a `pytorch` callback, but for now lets just instantiate another gpt2 model without the "LMHead" and see what kind of vectors that gives us. *actually seems like we can get intermediate reps right from the `forward` method*

In [7]:
model = GPT2Model.from_pretrained('gpt2') #use 'gpt2' if you don't want to download 6GB of weights and use 20 GB of ram
tokenizer = GPT2Tokenizer.from_pretrained("gpt2") #use 'gpt2' if you don't want to download 6GB of weights and use 20 GB of ram

Some weights of GPT2Model were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


>Error message referred to above seems to be related to this known issue:
    https://github.com/huggingface/transformers/pull/5922
Perhaps the issue is resolved by the PR referenced in that thread. Try pulling the master branch to see if issue is resolved?

In [8]:
tokens = tokenizer.encode("I am great", return_tensors='pt')

In [9]:
tokens

tensor([[  40,  716, 1049]])

In [10]:
representation = model(tokens)

model

In [11]:
print(representation[0].shape)
print(len(representation[1]))
print(representation[1][0].shape)
print(representation[1][1].shape)

torch.Size([1, 3, 768])
12
torch.Size([2, 1, 12, 3, 64])
torch.Size([2, 1, 12, 3, 64])


"Not really sure what's being returned. From https://huggingface.co/transformers/pretrained_models.html, it seems the hidden state size for the sentence (i.e. it's embedding) is 1600, and there are 48 layers for this model. So I guess the tuple is 
- element 0: hidden state
- element 1: hidden states at each prior attention block.
Looking at the vectors in element 1, it seems the 3rd dimension runs over the "heads" as there are 25 for this model. Below just check to make sure the same is true for the smaller gpt2 model. It seems to match, 12 heads, final hidden state is 768."

>I think the above is actually not quite right. See the return statement for transformers.GPT2Model here: https://huggingface.co/transformers/model_doc/gpt2.html#gpt2model. It's hard to tell explicitly what is being returned without looking at the config (we should probably explicitly instantiate configs to avoid this issue), but just from the shape it seems to me we must be viewing an output tuple consisting of (``last_hidden_state``, ``past_key_values``) (see the referred link for definitions of those two things). This explains for example the 2 appearing in the first dimension of ``element 1``; if element 1 was meand to contain hidden states, we should expect the dimension of the hidden layers (in this case 768) to be appearing in some way, but it is not.

### Compare some cosines between sentences

In [12]:
model2 = GPT2Model.from_pretrained('gpt2') 
tokenizer2 = GPT2Tokenizer.from_pretrained("gpt2")

Some weights of GPT2Model were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
tokens2 = tokenizer2.encode("I am a great man", return_tensors='pt')
model2.train(False)
representation2 = model2(tokens2)
print(representation2[0].shape)
print(len(representation2[1]))
print(representation2[1][0].shape)
print(representation2[1][1].shape)

torch.Size([1, 5, 768])
12
torch.Size([2, 1, 12, 5, 64])
torch.Size([2, 1, 12, 5, 64])


In [21]:
tok_pos = tokenizer2.encode("good", return_tensors='pt')
rep_pos = model2(tok_pos)[0].numpy().flatten()
tok_neg = tokenizer2.encode("bad", return_tensors='pt')
rep_neg = model2(tok_neg)[0].numpy().flatten()

print(rep_neg.dot(rep_pos)/(norm(rep_neg)*norm(rep_pos)))

0.99922514


In [25]:
tok_pos = tokenizer2.encode("I ate dinner at 5 pm, that's a bit strange", return_tensors='pt')
rep_pos = model2(tok_pos)[0].numpy().flatten()
tok_neg = tokenizer2.encode("I am the walrus", return_tensors='pt')
rep_neg = model2(tok_neg)[0].numpy().flatten()

rep_pos = model2(tok_pos)
print(rep_pos[0].shape)
print(len(rep_pos[1]))
print(rep_pos[1][0].shape)
print(rep_pos[1][1].shape)

rep_pos = model2(tok_neg)
print(rep_pos[0].shape)
print(len(rep_pos[1]))
print(rep_pos[1][0].shape)
print(rep_pos[1][1].shape)
#print(rep_neg.dot(rep_pos)/(norm(rep_neg)*norm(rep_pos)))

torch.Size([1, 12, 768])
12
torch.Size([2, 1, 12, 12, 64])
torch.Size([2, 1, 12, 12, 64])
torch.Size([1, 5, 768])
12
torch.Size([2, 1, 12, 5, 64])
torch.Size([2, 1, 12, 5, 64])


Okay, this is a bit more work than I thought, the `hidden state` is the attention representation for each token (I guess in the previous sentence `am a` must be ignored, which is why the second dimension is length 3). If we want "sentence representations" then we will need to figure out how these are combined in the "decoder" I guess.

> Is there some documentation you know of that makes you think the second dimension is 3 because 'am' and 'a' are dropped? While that doesn't sound unreasonable, I don't know where to find which such words might get dropped, and it seems odd to me given that the documentation doesn't suggest any such list of words exists (that I can see so far). I think instead perhaps you accidentally wrote ``representation2 = model2(tokens)`` when you meant to write ``representation2 = model2(tokens2)``. I altered it above, and didn't notice any dimension counts that indicated dropping of words as you did (in my case, there is a 5 where you had 3, which matches the number of words (and perhaps tokens) in the sentence).

## ToDo
1. Error messages keep saying we should train a model before running any inference, that's pretty weird, wonder if that's actually neccessary. Need to look into that.
>Error message referred to above seems to be related to this known issue:
    https://github.com/huggingface/transformers/pull/5922
Perhaps the issue is resolved by the PR referenced in that thread. Maybe we should try pulling the master branch and installing the most recent build in that way to see if the issue is resolved? I haven't tried that yet, but plan to. My basic understanding is that the issue is because this api call is pulling weights for a model with a slightly different head than the LM model, so the warning is letting you know the weights aren't necessarily all pretrained as intended (though in this case, I think they are).

2. If I want to decode the hidden state of this sentence, I wonder if I can just instantiate one of those `LMHead`s and then try to decode the difference vectors
3. What's the natural way to compose the attention reps for each word?