#**What is LLMs surprisal and how it is computed?**

* Surprisal in a language model is a metric of difficulty in understanding the language with an unexpected or a surprisal of a word based on its previous context.
* LMs provide a probability distribution over possible next words
* The input text is divided into tokens ans each token is converted into a vector representation that captures its meaning in the context on the surrounding words.
* Output of a neural network(logits) are the raw predictions of the next token in the given preceeding context and these raw predictions are passed through an activation function called softmax which calculates the predictions into probabilities.
* The probability is then used to compute the surprisal valur)negative logaritham base 2) for that word

###Understanding LLMs using the resources

####1. Psychformers.py:
* It handles both causal and masked language models,
* For each model, it loads the corresponding tokenizer and model architecture.
* In causal models, the model computes the probability of each target word given the preceding context.
* In masked models, the model computes the probability of each target word given the surrounding context, which includes both preceding and following words.
* In causal masked models, the model computes the probability of each target word given the surrounding context, similar to masked model processing but with considerations for causal modeling.

####2. Surprisal.py:
* Models experimented: GPT-2, GPTneo, GPT3, KenLM(N-gram based language model)
* 2 main classes : HuggingFaceSurprisal and NgramSurprisal
* HuggingFaceSurprisal:
  * It takes list of tokens and an array of surprisal values
  * Returns the total surprise for a piece of text by indexing it

* NgramSurprisal:
  * total surprise based on whole words, not individual characters

* Using helper function it processes
  * Character-based slice: You select a range of characters from the original text.
  * Word-based slice: You select tokens by their word positions.

###Computing surprisal using GPT2(Causal Model)

In [21]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForMaskedLM
import csv, math
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Set the model name
causal_model_name = "openai-community/gpt2-large"

# Load the model and tokenizer
causal_tokenizer = AutoTokenizer.from_pretrained(causal_model_name)
causal_model = AutoModelForCausalLM.from_pretrained(causal_model_name)

# Set the model to evaluation mode
causal_model.eval()

def calculate_surprisal_causal(sentence):
    inputs = causal_tokenizer(sentence, return_tensors="pt")
    with torch.no_grad():
        outputs = causal_model(**inputs, labels=inputs["input_ids"])

    # Get token log probabilities
    log_probs = outputs.logits.log_softmax(dim=-1)
    input_ids = inputs["input_ids"][0]

    causal_surprisals = []
    for i in range(0, len(input_ids)):
      token_id = input_ids[i]
      if i == 0:
          log_prob = log_probs[0, i, token_id].item()
          surprisal = -log_prob
      else:
          prev_log_prob = log_probs[0, i - 1, token_id].item()
          surprisal = -prev_log_prob
      causal_surprisals.append((causal_tokenizer.decode([token_id]), surprisal))

    surprisal_causal_df = pd.DataFrame(causal_surprisals, columns=["token", "surprisal"])
    #return causal_surprisals
    return surprisal_causal_df

###Computing surprisal for BERT (masked model)

#### How BERT model works? resource: [Research Paper](https://arxiv.org/pdf/1810.04805)

* Fine-Tuning and feature based approach.
* Fine-tuning approach involves adding a simple classification layer to the pre-trained model and all parameters are jointly fine-tuned on a downstream task.
* The fine-tuning approach, such as the Generative Pre-trained Transformer , introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pre-trained parameters.
* For fine-tuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the
downstream tasks. Each downstream task has separate fine-tuned models, even though they are initialized with the same pre-trained parameters.
* Feature based approach allows you to levarage the knowledge embedded in a pre-trained model by using it to create fixed representations of the input data without modyfing the model itself.
* After you have these embeddings, you take them as input to a simple, often lighter model like a logistic regression classifier or a basic neural network, which is specifically designed for your task (e.g., sentiment analysis, named entity recognition, etc.).
* This new model (which we sometimes call a classification layer) does not change the parameters of the pre-trained BERT model; it only learns from the fixed representations produced by BERT.
* BERT learns not only to fill in the masked words but also to understand relationships between pairs of sentences.
* The feature-based approach, such as ELMo, uses task-specific architectures that include the pre-trained representations as additional features.

* Task #1: Mask Language Modeling
    * we simply mask some percentage of the input tokens at random, and then predict those masked tokens. We refer to this procedure as a “masked
    LM” (MLM)(cloze task)
    * Although this allows us to obtain a bidirectional pre-trained model, a downside is that we are creating a mismatch between pre-training and
    fine-tuning, since the [MASK] token does not appear during fine-tuning. To mitigate this, we do not always replace “masked” words with the ac-
    tual [MASK] token. The training data generator chooses 15% of the token positions at random for prediction. If the i-th token is chosen, we replace
    the i-th token with (1) the [MASK] token 80% of the time (2) a random token 10% of the time (3) the unchanged i-th token 10% of the time. Then, it will be used to predict the original token with cross entropy loss.

* Task #2: Next Sentence Prediction (NSP):
  * when choosing the sentences A and B for each pre-
  training example, 50% of the time B is the actual
  next sentence that follows A (labeled as IsNext),
  and 50% of the time it is a random sentence from
  the corpus (labeled as NotNext).


###Computing surprisal using Mask Model

In [22]:
masked_model_name ="bert-base-uncased"

masked_tokenizer = AutoTokenizer.from_pretrained(masked_model_name)
masked_model = AutoModelForMaskedLM.from_pretrained(masked_model_name)

masked_model.eval()

def calculate_surprisal_masked(sentence):
    inputs = masked_tokenizer(sentence, return_tensors="pt")
    with torch.no_grad():
        outputs = masked_model(**inputs, labels=inputs["input_ids"])

     # Get token log probabilities
    log_probs = outputs.logits.log_softmax(dim=-1)
    input_ids = inputs["input_ids"][0]

    masked_surprisals = []

    for i in range(0, len(input_ids)):
        # Calculate surprisal for each token as -log(p)
        token_id = input_ids[i]
        if i == 0:
            # For the first token ([CLS]), there is no previous context;
            # we use its own position's log probability.
            log_prob = log_probs[0, i, token_id].item()
            surprisal = -log_prob
        else:
          prev_log_prob = log_probs[0, i - 1, token_id].item()
          surprisal = -prev_log_prob
        masked_surprisals.append((masked_tokenizer.decode([token_id]), surprisal))


    surprisal_masked_df = pd.DataFrame(masked_surprisals, columns=["token", "surprisal"])
    return surprisal_masked_df


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


###Computing surprisal using Permutation based model : XLNET

#### How XLNet works?
* XLNet considers all possible word order permutations in a sequence during training.
* Captures bidirectional context without masking technique.

In [23]:
from transformers import XLNetTokenizer, XLNetLMHeadModel

permuation_model_name = "xlnet-base-cased"
permuation_tokenizer = XLNetTokenizer.from_pretrained(permuation_model_name)
permuation_model = XLNetLMHeadModel.from_pretrained(permuation_model_name)
permuation_model.eval()

def calculate_surprisal_permutation(sentence):
    # Encode the sentence into input IDs
    input_ids = permuation_tokenizer.encode(sentence, return_tensors='pt')

    with torch.no_grad():
        # Forward pass with labels so that the model computes logits for the entire sequence
        outputs = permuation_model(input_ids, labels=input_ids)
    logits = outputs.logits

    # Shift the logits and labels to align each prediction with its corresponding token
    shifted_logits = logits[:, :-1, :]
    shifted_labels = input_ids[:, 1:]

    # Convert logits to log probabilities over the vocabulary
    log_probs = torch.log_softmax(shifted_logits, dim=-1)

    # For each position, extract the log probability assigned to the true token
    token_log_probs = log_probs.gather(2, shifted_labels.unsqueeze(-1)).squeeze(-1)

    # Compute surprisal as the negative log probability (apply unary minus to tensor before converting to list)
    computed_surprisals = (-token_log_probs.squeeze()).tolist()

    # Convert the full input_ids (without slicing) to tokens.
    tokens = permuation_tokenizer.convert_ids_to_tokens(input_ids[0])

    # Prepend a placeholder (None) for the first token since its surprisal isn't computed (no left context)
    full_surprisals = [None] + computed_surprisals

    return tokens, full_surprisals

###1. Analyzing sentences

In [51]:
sentences = {
    # 'sentence_1': "I ate a pizza yesterday.",  # correct sentence
    # 'sentence_2': "I ate a book yesterday.",     # semantic violation
    # 'sentence_3': "I eat a pizza yesterday.",      # morphological error
    # 'sentence_4': "I ate a piza yesterday.",       # lexical error
    'sentence_5': "The horse raced past the barn fell.", #garden path sentence
    'sentence_6': "the horse which was raced past the barn fell" #garden path sentence
}

### 1.1 Causal Model:

In [52]:
surprisal_causal_results = []

for sentence_type, sentence in sentences.items():
    df = calculate_surprisal_causal(sentence)
    df['sentence'] = sentence
    print(df)
    print("*" * 50)
    surprisal_causal_results.append(df)

surprisal_causal_df = pd.concat(surprisal_causal_results, ignore_index=True)

    token  surprisal                             sentence
0     The  12.220047  The horse raced past the barn fell.
1   horse   8.996132  The horse raced past the barn fell.
2   raced   7.437644  The horse raced past the barn fell.
3    past   3.689770  The horse raced past the barn fell.
4     the   1.376319  The horse raced past the barn fell.
5    barn   5.620616  The horse raced past the barn fell.
6    fell  12.037618  The horse raced past the barn fell.
7       .   4.344603  The horse raced past the barn fell.
**************************************************
    token  surprisal                                      sentence
0     the   9.726631  the horse which was raced past the barn fell
1   horse   9.300182  the horse which was raced past the barn fell
2   which   6.401368  the horse which was raced past the barn fell
3     was   2.283421  the horse which was raced past the barn fell
4   raced   9.934964  the horse which was raced past the barn fell
5    past   5.856984  the

**Observations:**

* Misspellings can force the tokenizer to break a word into unexpected sub-units, which in turn increases the surprisal because the sub-tokens are less likely in that context.

* For garden path sentence, surprisal values capture the increased processing difficulty for parts of the sentence where the structure is ambiguous or requires reanalysis.

### 1.2 Masked Model:

In [53]:
surprisal_masked_results = []

for sentence_type, sentence in sentences.items():
    df = calculate_surprisal_masked(sentence)
    df['sentence'] = sentence
    print(df)
    print("*" * 50)
    surprisal_masked_results.append(df)

surprisal_masked_df = pd.concat(surprisal_masked_results, ignore_index=True)

   token  surprisal                             sentence
0  [CLS]  13.900969  The horse raced past the barn fell.
1    the   4.236541  The horse raced past the barn fell.
2  horse  13.359109  The horse raced past the barn fell.
3  raced  18.895639  The horse raced past the barn fell.
4   past  11.975198  The horse raced past the barn fell.
5    the  10.862679  The horse raced past the barn fell.
6   barn  17.551842  The horse raced past the barn fell.
7   fell  18.181185  The horse raced past the barn fell.
8      .   4.902893  The horse raced past the barn fell.
9  [SEP]  30.795389  The horse raced past the barn fell.
**************************************************
    token  surprisal                                      sentence
0   [CLS]  14.039317  the horse which was raced past the barn fell
1     the   4.346106  the horse which was raced past the barn fell
2   horse  17.891470  the horse which was raced past the barn fell
3   which  13.686981  the horse which was raced past t

**Observations:**

1. [CLS] -> high surprisal because it is a special token that lacks preceding context and it receives a lower probability estimate which results in high surprisal.
2. [SEP] is rare and appears only in specific, limited contexts, the model isn’t as good at predicting it compared to common words, which leads to a higher surprisal value when [SEP] appears.

### 1.3 Permutation Model

In [54]:
surprisal_permutation_results = []

for sentence_type, sentence in sentences.items():
    tokens, surprisals = calculate_surprisal_permutation(sentence)
    df = pd.DataFrame({'token': tokens, 'surprisal': surprisals})
    df['sentence'] = sentence
    print(df)
    print("*" * 50)
    surprisal_permutation_results.append(df)

surprisal_permutation_df = pd.concat(surprisal_permutation_results, ignore_index=True)


    token  surprisal                             sentence
0    ▁The        NaN  The horse raced past the barn fell.
1  ▁horse   9.267667  The horse raced past the barn fell.
2  ▁raced  10.447077  The horse raced past the barn fell.
3   ▁past   5.547226  The horse raced past the barn fell.
4    ▁the   9.646189  The horse raced past the barn fell.
5   ▁barn   8.861709  The horse raced past the barn fell.
6   ▁fell   8.939671  The horse raced past the barn fell.
7       .   6.742737  The horse raced past the barn fell.
8   <sep>  13.904520  The horse raced past the barn fell.
9   <cls>   9.846058  The horse raced past the barn fell.
**************************************************
     token  surprisal                                      sentence
0     ▁the        NaN  the horse which was raced past the barn fell
1   ▁horse   7.157487  the horse which was raced past the barn fell
2   ▁which   4.895946  the horse which was raced past the barn fell
3     ▁was   1.850880  the horse which 

In [28]:
print(permuation_tokenizer.convert_tokens_to_ids("▁ate")) ## means _ate isn't in the library

0


**Observations:**

1. "▁I" = NaN as no probability is computed for the first token since it has no left context.
2. For Morphological error sentence(3), the probability for “yesterday,”, the model is not relying solely on the left context “I eat a pizza” but also on any available right context and on the overall likelihood of “yesterday” occurring in similar contexts.

###2. Masking the target token

In [29]:
def analyze_masked_sentence(masked_text):
    inputs = masked_tokenizer(masked_text, return_tensors="pt")
    mask_token_index = torch.where(inputs['input_ids'][0] == masked_tokenizer.mask_token_id)[0]

    with torch.no_grad():
        outputs = masked_model(**inputs)
        predictions = outputs.logits[0, mask_token_index]

    # Get top 5 predictions
    top_5 = torch.topk(predictions, 5, dim=1)
    probs = torch.softmax(top_5.values[0], dim=0)

    predicted_tokens = masked_tokenizer.convert_ids_to_tokens(top_5.indices[0])
    surprisals = -torch.log(probs).numpy()

    return pd.DataFrame({
        'predicted_token': predicted_tokens,
        'surprisal': surprisals
    })

# Test masked prediction
masked_sentence = "I ate a [MASK] yesterday."
mask_results = analyze_masked_sentence(masked_sentence)

In [30]:
print("\nTop 5 predictions for masked token:")
for _, row in mask_results.iterrows():
    print(f"Token: {row['predicted_token']}, Surprisal: {row['surprisal']:.3f}")


Top 5 predictions for masked token:
Token: lot, Surprisal: 0.391
Token: little, Surprisal: 1.756
Token: sandwich, Surprisal: 2.324
Token: fish, Surprisal: 3.627
Token: few, Surprisal: 3.628


### 3. Punctuations

In [57]:
punct_sentences = [
    # "I ate a pizza yesterday.",
    # "I ate a pizza, yesterday.",
    # "I ate a pizza yesterday; I enjoyed every bite.",
    # "I ate an apple? No way!",
    "The man hunted the deer ran into the woods.",
    "The man hunted, the deer ran into the woods."
]

####3.1 Causal Model

In [64]:
all_data = []
punct_results = []
for sentence in punct_sentences:
    df = calculate_surprisal_causal(sentence)
    df['sentence'] = sentence
    print(df)
    print("*" * 50)
    punct_results.append(df)

    # Extract data and append to all_data list
    for _, row in df.iterrows():
        all_data.append([row['token'], row['surprisal'], sentence])


punct_df = pd.DataFrame(all_data, columns=['Token', 'Surprisal', 'Sentence'])

     token  surprisal                                     sentence
0      The  12.220047  The man hunted the deer ran into the woods.
1      man   6.014305  The man hunted the deer ran into the woods.
2   hunted  11.088654  The man hunted the deer ran into the woods.
3      the   3.862102  The man hunted the deer ran into the woods.
4     deer   6.818150  The man hunted the deer ran into the woods.
5      ran  11.307253  The man hunted the deer ran into the woods.
6     into   2.734016  The man hunted the deer ran into the woods.
7      the   0.662579  The man hunted the deer ran into the woods.
8    woods   0.593332  The man hunted the deer ran into the woods.
9        .   2.190197  The man hunted the deer ran into the woods.
**************************************************
      token  surprisal                                      sentence
0       The  12.220047  The man hunted, the deer ran into the woods.
1       man   6.014305  The man hunted, the deer ran into the woods.
2    

####3.2 Masked Model

In [65]:
all_data = []
punct_results = []
for sentence in punct_sentences:
    df = calculate_surprisal_masked(sentence)
    df['sentence'] = sentence
    print(df)
    print("*" * 50)
    punct_results.append(df)

    # Extract data and append to all_data list
    for _, row in df.iterrows():
        all_data.append([row['token'], row['surprisal'], sentence])


punct_df = pd.DataFrame(all_data, columns=['Token', 'Surprisal', 'Sentence'])

     token  surprisal                                     sentence
0    [CLS]  14.793583  The man hunted the deer ran into the woods.
1      the   3.755283  The man hunted the deer ran into the woods.
2      man  20.501747  The man hunted the deer ran into the woods.
3   hunted  20.922817  The man hunted the deer ran into the woods.
4      the   8.661399  The man hunted the deer ran into the woods.
5     deer  18.019070  The man hunted the deer ran into the woods.
6      ran  16.659252  The man hunted the deer ran into the woods.
7     into  15.046646  The man hunted the deer ran into the woods.
8      the  12.250912  The man hunted the deer ran into the woods.
9    woods  13.686221  The man hunted the deer ran into the woods.
10       .  14.886926  The man hunted the deer ran into the woods.
11   [SEP]  31.439661  The man hunted the deer ran into the woods.
**************************************************
     token  surprisal                                      sentence
0    [CLS]

####3.3 Permutation Model

In [67]:
punct_results = []

for sentence in punct_sentences:
    tokens, surprisals = calculate_surprisal_permutation(sentence)
    df = pd.DataFrame({'token': tokens, 'surprisal': surprisals})
    df['sentence'] = sentence
    print(df)
    print("*" * 50)
    punct_results.append(df)

punct_df = pd.concat(punct_results, ignore_index=True)

      token  surprisal                                     sentence
0      ▁The        NaN  The man hunted the deer ran into the woods.
1      ▁man   9.656571  The man hunted the deer ran into the woods.
2   ▁hunted  15.746633  The man hunted the deer ran into the woods.
3      ▁the   5.620775  The man hunted the deer ran into the woods.
4     ▁deer  10.229841  The man hunted the deer ran into the woods.
5      ▁ran   7.447130  The man hunted the deer ran into the woods.
6     ▁into   6.161885  The man hunted the deer ran into the woods.
7      ▁the   6.603426  The man hunted the deer ran into the woods.
8    ▁woods  10.695705  The man hunted the deer ran into the woods.
9         .   3.174332  The man hunted the deer ran into the woods.
10    <sep>  15.432770  The man hunted the deer ran into the woods.
11    <cls>   9.393350  The man hunted the deer ran into the woods.
**************************************************
      token  surprisal                                      sente

### 4. First token

In [35]:
first_token_sentences = [
    "The cat sat on the mat.",
    "A dog ran in the park.",
    "My friend likes ice cream.",
    "Some birds fly south in winter."
]

####4.1 Causal Model

In [68]:
first_token_results = []
for sentence in first_token_sentences:
    df = calculate_surprisal_causal(sentence)
    first_token = df.iloc[1]
    first_token_results.append({
        'sentence': sentence,
        'first_token': first_token['token'],
        'surprisal': first_token['surprisal']
    })

first_token_df = pd.DataFrame(first_token_results)

print(first_token_df)


                          sentence first_token  surprisal
0          The cat sat on the mat.         cat   9.115954
1           A dog ran in the park.         dog   8.534173
2       My friend likes ice cream.      friend   4.784204
3  Some birds fly south in winter.       birds   8.471863


####4.2 Masked Model

In [72]:
first_token_results = []
for sentence in punct_sentences:
    tokens, surprisals = calculate_surprisal_permutation(sentence)
    df = pd.DataFrame({'token': tokens, 'surprisal': surprisals})
    df['sentence'] = sentence
    first_token_results.append(df)

first_token_df = pd.concat(first_token_results, ignore_index=True)


print(first_token_df)

      token  surprisal                                      sentence
0      ▁The        NaN   The man hunted the deer ran into the woods.
1      ▁man   9.656571   The man hunted the deer ran into the woods.
2   ▁hunted  15.746633   The man hunted the deer ran into the woods.
3      ▁the   5.620775   The man hunted the deer ran into the woods.
4     ▁deer  10.229841   The man hunted the deer ran into the woods.
5      ▁ran   7.447130   The man hunted the deer ran into the woods.
6     ▁into   6.161885   The man hunted the deer ran into the woods.
7      ▁the   6.603426   The man hunted the deer ran into the woods.
8    ▁woods  10.695705   The man hunted the deer ran into the woods.
9         .   3.174332   The man hunted the deer ran into the woods.
10    <sep>  15.432770   The man hunted the deer ran into the woods.
11    <cls>   9.393350   The man hunted the deer ran into the woods.
12     ▁The        NaN  The man hunted, the deer ran into the woods.
13     ▁man  15.173221  The man hu

####4.3 Permutation Model

In [38]:
punct_results = []

for sentence in punct_sentences:
    tokens, surprisals = calculate_surprisal_permutation(sentence)
    df = pd.DataFrame({'token': tokens, 'surprisal': surprisals})
    df['sentence'] = sentence
    punct_results.append(df)

punct_df = pd.concat(punct_results, ignore_index=True)


print("\nPunctuation analysis results:")
for sentence in punct_sentences:
    print(f"\nSentence: {sentence}")
    sentence_data = punct_df[punct_df['sentence'] == sentence]
    for _, row in sentence_data.iterrows():
        if row['surprisal'] is None:
            print(f"Token: {row['token']}, Surprisal: N/A")
        else:
            print(f"Token: {row['token']}, Surprisal: {row['surprisal']:.3f}")



Punctuation analysis results:

Sentence: I ate a pizza yesterday.
Token: ▁I, Surprisal: nan
Token: ▁, Surprisal: 6.196
Token: ate, Surprisal: 8.890
Token: ▁a, Surprisal: 8.804
Token: ▁pizza, Surprisal: 11.336
Token: ▁yesterday, Surprisal: 4.622
Token: ., Surprisal: 7.837
Token: <sep>, Surprisal: 15.849
Token: <cls>, Surprisal: 14.513

Sentence: I ate a pizza, yesterday.
Token: ▁I, Surprisal: nan
Token: ▁, Surprisal: 7.341
Token: ate, Surprisal: 12.163
Token: ▁a, Surprisal: 13.233
Token: ▁pizza, Surprisal: 10.901
Token: ,, Surprisal: 5.119
Token: ▁yesterday, Surprisal: 3.473
Token: ., Surprisal: 10.349
Token: <sep>, Surprisal: 16.040
Token: <cls>, Surprisal: 14.460

Sentence: I ate a pizza yesterday; I enjoyed every bite.
Token: ▁I, Surprisal: nan
Token: ▁, Surprisal: 7.920
Token: ate, Surprisal: 7.398
Token: ▁a, Surprisal: 8.778
Token: ▁pizza, Surprisal: 8.896
Token: ▁yesterday, Surprisal: 5.045
Token: ;, Surprisal: 12.790
Token: ▁I, Surprisal: 5.336
Token: ▁enjoyed, Surprisal: 11.743

### 5. Different Languages

In [39]:
model_name = "facebook/xglm-564M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model.eval()

def compute_surprisal_multilingual(sentence):
    # Tokenize the sentence and get input IDs
    input_ids = tokenizer.encode(sentence, return_tensors="pt")

    with torch.no_grad():
        # Run the model; providing labels lets the model compute the logits.
        outputs = model(input_ids, labels=input_ids)
    logits = outputs.logits

    # Shift logits and labels so that each prediction is aligned with the correct token.
    shifted_logits = logits[:, :-1, :]
    shifted_labels = input_ids[:, 1:]

    # Convert logits to log probabilities.
    log_probs = torch.log_softmax(shifted_logits, dim=-1)

    # For each token position (except the first), get the log probability of the correct token.
    token_log_probs = log_probs.gather(2, shifted_labels.unsqueeze(-1)).squeeze(-1)

    # Compute surprisal as the negative log probability.
    computed_surprisals = (-token_log_probs.squeeze()).tolist()

    # Convert the full input_ids to tokens.
    tokens = tokenizer.convert_ids_to_tokens(input_ids[0])

    # Prepend a placeholder (None) for the first token, which has no computed surprisal.
    full_surprisals = [None] + computed_surprisals

    return tokens, full_surprisals


sentences = {
    'spanish_1': "Comí una pizza ayer.",                     # I ate a pizza yesterday.
    'spanish_2': "Me comí una pizza deliciosa ayer.",         # I ate a delicious pizza yesterday.
    'french_1': "J'ai mangé une pizza hier.",                 # I ate a pizza yesterday.
    'french_2': "J'ai savouré une pizza délicieuse hier."      # I savored a delicious pizza yesterday.
}

# Process each sentence and build a DataFrame with the token surprisal values.
results = []
for key, sentence in sentences.items():
    tokens, surprisals = compute_surprisal_multilingual(sentence)
    df = pd.DataFrame({'token': tokens, 'surprisal': surprisals})
    df['sentence'] = sentence
    # Use the language (spanish or french) based on the key.
    df['language'] = key.split('_')[0]
    results.append(df)

# Concatenate all individual DataFrames into one final DataFrame.
final_df = pd.concat(results, ignore_index=True)

# Optionally, print token-by-token surprisal for each sentence.
print("\nToken-by-token surprisal values per sentence:")
for key, sentence in sentences.items():
    print(f"\nSentence: {sentence}")
    sentence_data = final_df[final_df['sentence'] == sentence]
    for _, row in sentence_data.iterrows():
        surprisal_str = "N/A" if row['surprisal'] is None else f"{row['surprisal']:.3f}"
        print(f"Token: {row['token']}, Surprisal: {surprisal_str}")



Token-by-token surprisal values per sentence:

Sentence: Comí una pizza ayer.
Token: </s>, Surprisal: nan
Token: ▁Com, Surprisal: 7.666
Token: í, Surprisal: 7.407
Token: ▁una, Surprisal: 3.117
Token: ▁pizza, Surprisal: 4.189
Token: ▁ayer, Surprisal: 11.494
Token: ., Surprisal: 2.453

Sentence: Me comí una pizza deliciosa ayer.
Token: </s>, Surprisal: nan
Token: ▁Me, Surprisal: 7.303
Token: ▁com, Surprisal: 8.683
Token: í, Surprisal: 6.469
Token: ▁una, Surprisal: 9.090
Token: ▁pizza, Surprisal: 3.698
Token: ▁delicios, Surprisal: 3.996
Token: a, Surprisal: 0.005
Token: ▁ayer, Surprisal: 9.851
Token: ., Surprisal: 1.803

Sentence: J'ai mangé une pizza hier.
Token: </s>, Surprisal: nan
Token: ▁J, Surprisal: 6.897
Token: ', Surprisal: 2.009
Token: ai, Surprisal: 0.697
Token: ▁mang, Surprisal: 7.612
Token: é, Surprisal: 0.239
Token: ▁une, Surprisal: 2.680
Token: ▁pizza, Surprisal: 4.618
Token: ▁hier, Surprisal: 7.304
Token: ., Surprisal: 3.227

Sentence: J'ai savouré une pizza délicieuse hi