## Analyzing GPT2 embeddings

In [12]:
# from transformers import GPT2LMHeadModel
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

In [35]:
# load gpt2 fast-tokenizer and gpt2 large (770M) model (model took > 30min ot load in local, 10s in kaggle)
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2-large", use_fast=True)
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2-large")

# Get embedding matrix and save it locally to speeed up future uses
word_embeddings = model.transformer.wte.weight      # 50K: vocab size  x  1280: d_model
torch.save(word_embeddings, 'word_embeddings.pt')   # fp32  ;  ~256MB = 4 x (50Kx1280)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [34]:
tokenizer_small = tokenizer
model_small = model

In [36]:
# Load embedding matrix on gpu
word_embeddings = torch.load('word_embeddings.pt', map_location=torch.device('cuda'), weights_only=True)
print(f"shape of the embedding matrix: {word_embeddings.shape}\n")
print(f"Chunk of the embedding of the word 'hello':\n {word_embeddings[tokenizer('hello')['input_ids'][0]][:10]}")

shape of the embedding matrix: torch.Size([50257, 1280])

Chunk of the embedding of the word 'hello':
 tensor([-0.2132,  0.0170, -0.0069,  0.1358, -0.0563,  0.0098, -0.0777,  0.0074,
        -0.0020,  0.0942], device='cuda:0', grad_fn=<SliceBackward0>)


#### Useful function

In [41]:
def is_in_vocab(token, tokenizer):
    existence = tokenizer.get_vocab().get(token) is not None
    print(f"'{token}' exists in the vocabulary !\n") if existence else print(f"'{token}' does NOT exist in the vocabulary...\n")
    return existence

#### Example: the token `vector`

In [42]:
# Encoding for token 'vector'
if is_in_vocab("vector", tokenizer):
    vector_token = tokenizer('vector')['input_ids'][0]
    vector_embedding = word_embeddings[vector_token]

    # Most cosine-similar tokens to 'vector' in the vocabulary
    similarity = torch.cosine_similarity(word_embeddings, vector_embedding.unsqueeze(0), dim=1)
    top_similarities, top_indices = torch.topk(similarity, 20)
    top_words = tokenizer.convert_ids_to_tokens(top_indices)
    top_words = '; '.join(top_words)
    print(f"Top 20 most similar words to 'vector': \n{top_words}")

    # Most EucDistance-similar tokens to 'vector' in the vocabulary
    similarity = torch.cdist(word_embeddings, vector_embedding.unsqueeze(0))
    top_similarities, top_indices = torch.topk(-similarity[:,0], 20) # similarity = - distance
    top_words = tokenizer.convert_ids_to_tokens(top_indices)
    top_words = '; '.join(top_words)
    print(f"\nTop 20 most similar words to 'vector' using euclidian distance: \n{top_words}")

'vector' exists in the vocabulary !

Top 20 most similar words to 'vector': 
vector; Ġvector; Vector; Ġvectors; ĠVector; vec; string; array; Ġvec; template; sequence; sector; pointer; dimensional; iterator; Orderable; isSpecial; factor; map; aditional

Top 20 most similar words to 'vector' using euclidian distance: 
vector; Ġvector; Vector; Ġvectors; ĠVector; the; what; What; string; It; which; from; for; this; that; Although; For; Ġvec; We; This


In [43]:
eng_math_list = ['square', 'triangle', 'ball', 'cercle', 'function', 'converge', 'diverge', 'positive', 'matrix', 'sequence', 'integer', ]
fre_math_list = ['carré', 'triangle', 'boule', 'cercle', 'fonction', 'converge', 'diverge', 'positif', 'matrice', 'suite', 'entier', ]
for word in eng_math_list:
    is_in_vocab(word, tokenizer)
print()
for word in fre_math_list:
    is_in_vocab(word, tokenizer)

'square' exists in the vocabulary !

'triangle' does NOT exist in the vocabulary...

'ball' exists in the vocabulary !

'cercle' does NOT exist in the vocabulary...

'function' exists in the vocabulary !

'converge' does NOT exist in the vocabulary...

'diverge' does NOT exist in the vocabulary...

'positive' exists in the vocabulary !

'matrix' does NOT exist in the vocabulary...

'sequence' exists in the vocabulary !

'integer' exists in the vocabulary !


'carré' does NOT exist in the vocabulary...

'triangle' does NOT exist in the vocabulary...

'boule' does NOT exist in the vocabulary...

'cercle' does NOT exist in the vocabulary...

'fonction' does NOT exist in the vocabulary...

'converge' does NOT exist in the vocabulary...

'diverge' does NOT exist in the vocabulary...

'positif' does NOT exist in the vocabulary...

'matrice' does NOT exist in the vocabulary...

'suite' does NOT exist in the vocabulary...

'entier' does NOT exist in the vocabulary...



* GPT2 wasn't a multilangual model. Its vocabulary meaningful tokens mainly consist of english words.
* Given the tokenization algorithm (BPE), many words can be split or packed and this won't exist in vocabulary

* **Let's choose a small set of english vocabulary among the vocabulary suggested in the paper data, make sure they're all represented by single token and analyze their pairwise similarity**