# Large Langugae Model

A language model is a probability distribution over sequences of a vocabulary.

Let V be a vocabulary, items could be characters, tokens or words. A language model assigns a probability to each sequence of items. 

Let $(w_1, w_2, ... ,w_{n-1})$ be a sequence of words from vocabulary V and a LM a language model over V with probablity P then:

$$ P(w_n|w_1, ... ,w_{n-1}) = neuralnetwork(w_1, ... ,w_{n-1})$$

The model predicts the next word based on the history.

We would like a model to assign higher probabilities to sentences that are real and syntactically correct.

In [1]:
%load_ext autoreload
%autoreload 2

In [41]:
########## Load the large language model GPT-2 from transformers #############
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2", return_dict_in_generate=True)

In [48]:
text = "Alice is a"
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)  # Batch size 1
with torch.no_grad():
    outputs = model(input_ids, labels=input_ids)
loss, logits = outputs[:2]
sentence_prob = loss.item()

In [49]:
sentence_prob

3.280515432357788

In [None]:
encoded_input = tokenizer(text, return_tensors='pt')
input_ids = encoded_input.input_ids

In [6]:
with torch.no_grad():
    #A transformers.modeling_outputs.CausalLMOutputWithCrossAttentions 
    #or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False)
    outputs = model.forward(input_ids=input_ids, output_hidden_states=True, return_dict=True, labels=input_ids)
loss, logits = outputs[:2]
sentence_prob = loss.item()

In [24]:
# logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size))
# â€” Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax)
from torch.nn import Softmax
m = Softmax(dim=0)
y = m(logits.flatten())
y

tensor([2.0421e-03, 3.0297e-03, 1.1638e-04,  ..., 8.3673e-07, 4.0477e-06,
        1.9094e-03])

In [33]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

t = GPT2Tokenizer.from_pretrained("gpt2")
m = GPT2LMHeadModel.from_pretrained("gpt2")

encoded_text = t(text, return_tensors="pt")

#1. step to get the logits of the next token
with torch.inference_mode():
  outputs = m(**encoded_text)


In [38]:
print("1. step to get the logits of the next token")
loss, logits = outputs.loss, outputs.logits
print(type(loss), type(logits))
print(logits.shape)
next_token_logits = logits[0, -1, :] # batch_size, sequence_length, config.vocab_size)
print(next_token_logits.shape)
print(next_token_logits)

print("2. step to convert the logits to probabilities")
next_token_probs = torch.softmax(next_token_logits, -1)
print(next_token_probs.shape)
print(next_token_probs)

print("3. step to get the top 10 and put all together")
topk_next_tokens= torch.topk(next_token_probs, 10)
topk_next_token_list = [(t.decode(idx), prob) for idx, prob in zip(topk_next_tokens.indices, topk_next_tokens.values)] 
print(topk_next_token_list)

# now sample on top 10 or get the first one
next_token, next_prob = topk_next_token_list[0]

print(next_token, next_prob)

1. step to get the logits of the next token
<class 'NoneType'> <class 'torch.Tensor'>
torch.Size([1, 1, 50257])
torch.Size([50257])
tensor([-34.4592, -34.0647, -37.3240,  ..., -42.2592, -40.6827, -34.5264])
2. step to convert the logits to probabilities
torch.Size([50257])
tensor([2.0421e-03, 3.0297e-03, 1.1638e-04,  ..., 8.3673e-07, 4.0477e-06,
        1.9094e-03])
3. step to get the top 10 and put all together
[(',', tensor(0.0643)), ('.', tensor(0.0447)), (' and', tensor(0.0286)), ('\n', tensor(0.0277)), ("'s", tensor(0.0254)), (' to', tensor(0.0176)), ('-', tensor(0.0164)), (' is', tensor(0.0158)), (' in', tensor(0.0128)), (' of', tensor(0.0119))]
, tensor(0.0643)


In [None]:
########## Backup #########

In [60]:
########## Let have a look at the vocabulary of GPT-2 ##########
vocab = tokenizer.get_vocab()
print(type(vocab))
print(len(vocab))

<class 'dict'>
50257


In [9]:
s = [tokenizer.encode("Alice")]
print(s)
torch.tensor(s)

[[44484]]


tensor([[44484]])

In [40]:
input_ids = tokenizer("Alice", return_tensors="pt").input_ids

generated_outputs = model.generate(input_ids, do_sample=True, num_return_sequences=3, output_scores=True, max_new_tokens=10)

# only use id's that were generated
# gen_sequences has shape [3, 15]
gen_sequences = generated_outputs.sequences[:, input_ids.shape[-1]:]

# let's stack the logits generated at each step to a tensor and transform
# logits to probs
probs = torch.stack(generated_outputs.scores, dim=1).softmax(-1)  # -> shape [3, 15, vocab_size]

# now we need to collect the probability of the generated token
# we need to add a dummy dim in the end to make gather work
gen_probs = torch.gather(probs, 2, gen_sequences[:, :, None]).squeeze(-1)

# now we can do all kinds of things with the probs

# 1) the probs that exactly those sequences are generated again
# those are normally going to be very small
unique_prob_per_sequence = gen_probs.prod(-1)

# 2) normalize the probs over the three sequences
normed_gen_probs = gen_probs / gen_probs.sum(0)
assert normed_gen_probs[:, 0].sum() == 1.0, "probs should be normalized"

# 3) compare normalized probs to each other like in 1)
unique_normed_prob_per_sequence = normed_gen_probs.prod(-1)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
