# Problem 4: Contrastive Decoding

## Part 1: Conceptual Understanding

1. Contrastive Decoding (CD) utilizes two models: an expert (larger more sophisticated LLM) and an amateur (smaller LLM). At each decoding step for each token, CD tends to pick tokens that the expert scores as likely but the ameteur does not, the intuition being that the amateur generates more general  repetitive answers whereas the expert has knowledge to favor specific, informative tokens. Hence, CD explicitly penalizes tokens that are the amateur also thinks are likely, mostly picking tokens favored by the expert and not the
amatuer (output more reliably shows specific and in-depth knowledge not avaibale to the amateur).

The other decoding methods like greedy, beam, etc only work with the output tokens of one model as opposed to 2 models, failing to add a contrastive component and hence generating more general text that LLMs are prone to outputting. CD clearly solves the problem of general, repetitive outputs but contrasting with an amateur model that has these unwanted qualities.


2. showing the objective from the paper:

$$
L_{\mathrm{CD}}(x_{\mathrm{cont}},\, x_{\mathrm{pre}})
= \log p_{\mathrm{EXP}}\!\left(x_{\mathrm{cont}} \mid x_{\mathrm{pre}}\right)
- \log p_{\mathrm{AMA}}\!\left(x_{\mathrm{cont}} \mid x_{\mathrm{pre}}\right)
$$


where x_pre is the receding context, x_cont is the current output token being considered, p_EXP is the probability distribution used by the expert,  p_AMA is the probability distribution used by the amateur, and finally L is the resulting contrastive decoder output.

Unconstrained maximization can lead to issues. In one case, it can lead to false positives where a bad token get chosen over a good one just because the amateur dislikes the bad token a lot more (overall CD output is larger in the bad token case). In the other case, it can lead to false negatives where a good token is discarded because the amateur also likes it. Hence, plain maximization has its downsides.  


3. The plausibility constraint is the guardrail that prevents the above errors by only consider next tokens that the expert model itself deems plausible and only then using the amateur penalty to drop the generic/bad ones. It clearly prevents the previously mentioned ratio discrepancies where a token that the expert does not like is chosen just because the amateur assigns it a very very tiny probability.

alpha controls the extent to which we reject generic tokens generated by the amateur. low alpha prioritizes the experts output with little constrative sieving (possibly leading to more generic text) whereas high alpha greatly penalizes the expert tokens that are similar to those of the amateur (more diversity and specificity).



## Part 2: Implementation and Empirical Evaluation

The github repo was not cloned as it had dependency issues. I have adapted and implemented my own decoder based on the paper. Please read the comments in the cells for more details about functionality/ models used/ seeds

In [None]:
# Install required packages
#!pip install datasets transformers torch mauve-text evaluate nltk

In [None]:
import numpy as np
import pandas as pd
from tqdm import tqdm
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import nltk
from collections import Counter
import warnings
import mauve

# dowload nltk data for tokenization
nltk.download('punkt')
nltk.download('punkt_tab')

# random seeds for reproducibility
SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

warnings.filterwarnings('ignore')

# check and print device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cuda


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [None]:
# load the expert and amateur models

# expert: GPT-2 XL (1.5B params)
expert_model = AutoModelForCausalLM.from_pretrained("gpt2-xl").to(device)
expert_tokenizer = AutoTokenizer.from_pretrained("gpt2-xl")

# amateur model: GPT-2 Small (117M params)
amateur_model = AutoModelForCausalLM.from_pretrained("gpt2").to(device)
amateur_tokenizer = AutoTokenizer.from_pretrained("gpt2")

# set padding token
expert_tokenizer.pad_token = expert_tokenizer.eos_token
amateur_tokenizer.pad_token = amateur_tokenizer.eos_token

# eval mode to ouput sequences
expert_model.eval()
amateur_model.eval()

config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [None]:
# load dataset and use test set
print("\nLoading WikiText-103 dataset...")
dataset = load_dataset("wikitext", "wikitext-103-v1")
test_data = dataset['test']

# filter out empty texts and select prompts for generation (first 30-50 tokens as prompts)
def prepare_prompts(dataset, num_samples=50, min_length=50):
    prompts = []
    refs = [] # to store full original text

    for item in dataset:
        text = item['text'].strip()
        # skip headers and short texts
        if len(text) > min_length and not text.startswith('='):
            toks = expert_tokenizer.encode(text)
            if len(toks) > 50:  # sufficient length to have prompt and continuation
                # first 30 tokens as prompt
                prompt_toks = toks[:30]
                prompt_text = expert_tokenizer.decode(prompt_toks)
                prompts.append(prompt_text)
                refs.append(text)

                if len(prompts) >= num_samples:
                    break

    return prompts, refs

prompts, references = prepare_prompts(test_data, num_samples=50)
print(f"Prepared {len(prompts)} prompts for generation")
print(f"Example prompt: {prompts[0][:150]}...")


Loading WikiText-103 dataset...


README.md: 0.00B [00:00, ?B/s]

wikitext-103-v1/test-00000-of-00001.parq(…):   0%|          | 0.00/722k [00:00<?, ?B/s]

wikitext-103-v1/train-00000-of-00002.par(…):   0%|          | 0.00/156M [00:00<?, ?B/s]

wikitext-103-v1/train-00001-of-00002.par(…):   0%|          | 0.00/156M [00:00<?, ?B/s]

wikitext-103-v1/validation-00000-of-0000(…):   0%|          | 0.00/655k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/1801350 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Prepared 50 prompts for generation
Example prompt: Robert Boulter is an English film , television and theatre actor . He had a guest @-@ starring role on the television series The Bill in...


In [None]:
# implements contrastive decoding as described in Li et al.
#
def contrastive_decoding(prompt, expert_model, amateur_model, expert_tokenizer, amateur_tokenizer, max_new_tokens=100, alpha=0.1, beta=0.5, amateur_temp=1.0, amateur_context_window=None, device='cuda'):
    # set to eval and encode prompt
    expert_model.eval()
    amateur_model.eval()
    input_ids = expert_tokenizer.encode(prompt, return_tensors='pt').to(device)


    # tokenized prompt/input
    generated = input_ids.clone()

    with torch.no_grad():
        for _ in range(max_new_tokens):
            # get expert logits (full context)
            expert_outputs = expert_model(generated)
            expert_logits = expert_outputs.logits[:, -1, :]  # extract last position - [1, vocab_size]
            expert_probs = torch.softmax(expert_logits, dim=-1) # convert to probs

            # amateur logits with modified context window
            if amateur_context_window is None:
                # use full context
                amateur_input = generated
            else:
                # restrict
                amateur_input = generated[:, -amateur_context_window:]

            # find logits, apply temp, and convert to probs
            amateur_outputs = amateur_model(amateur_input)
            amateur_logits = amateur_outputs.logits[:, -1, :] # extract last position - [1, vocab_size]
            amateur_logits = amateur_logits / amateur_temp
            amateur_probs = torch.softmax(amateur_logits, dim=-1)

            # adaptive plausibility constraint using the following V_head(context) = {x : P_expert(x|context) >= alpha * max P_expert(x'|context)}
            max_prob_expert = expert_probs.max()
            mask = (expert_probs >= alpha * max_prob_expert).float()

            # compute contrastive scores for tokens in plausibility  nad choose highest
            contrastive_scores = torch.log(expert_probs + 1e-10) - beta * torch.log(amateur_probs + 1e-10) # 1e-10 added for numerical stability
            contrastive_scores = contrastive_scores * mask + (1 - mask) * (-1e10)
            next_tok = torch.argmax(contrastive_scores, dim=-1, keepdim=True)

            # append to generated sequence
            generated = torch.cat([generated, next_tok], dim=-1)

            # stop if EOS is generated
            if next_tok.item() == expert_tokenizer.eos_token_id:
                break

    generated_text = expert_tokenizer.decode(generated[0], skip_special_tokens=True)
    return generated_text

# quick test
test_prompt = prompts[0]
print(f"Testing contrastive decoding with prompt ---{test_prompt}\n")
generated = contrastive_decoding(test_prompt, expert_model, amateur_model, expert_tokenizer, amateur_tokenizer, max_new_tokens=50, device=device)
print(f"Generated output: {generated}")

Testing contrastive decoding with prompt ---Robert Boulter is an English film , television and theatre actor . He had a guest @-@ starring role on the television series The Bill in

Generated output: Robert Boulter is an English film , television and theatre actor . He had a guest @-@ starring role on the television series The Bill in the UK. He also had a ...


In [None]:
# Compute MAUVE score between generated and reference texts.
def compute_mauve_score(generated_texts, reference_texts, device='cuda'):

    # requires at least some samples
    if len(generated_texts) < 2 or len(reference_texts) < 2:
        return 0.0

    try:
        output = mauve.compute_mauve(p_text=reference_texts, q_text=generated_texts, device_id=0 if device == 'cuda' else -1, max_text_length=512, verbose=False, featurize_model_name="gpt2")
        return output.mauve
    except Exception as e:
        print(f"Error computing: {e}")
        return 0.0

In [None]:
# compute distinct-n metric for diversity evaluation.
def compute_distinct_n(texts, n=2):
    all_ngrams = []

    for text in texts:
        tokens = nltk.word_tokenize(text.lower())
        ngrams = list(zip(*[tokens[i:] for i in range(n)]))
        all_ngrams.extend(ngrams)

    if len(all_ngrams) == 0:
        return 0.0

    unique = len(set(all_ngrams)) # unique ngrams
    total = len(all_ngrams) # total ngrams

    return unique / total

In [None]:
# compute multiple diversity metrics for different ngrams
def compute_multiple_diversity_metrics(texts):
    return {
        'distinct-1': compute_distinct_n(texts, n=1),
        'distinct-2': compute_distinct_n(texts, n=2),
        'distinct-3': compute_distinct_n(texts, n=3),
    }

In [None]:
# compute perplexity as a coherence metric using the expert model.
def compute_perplexity(texts, model, tokenizer, device='cuda'):
    model.eval()
    total_loss = 0
    total_tokens = 0

    with torch.no_grad():
        for text in texts:
            # encode
            encodings = tokenizer(text, return_tensors='pt', truncation=True, max_length=512).to(device)
            input_ids = encodings.input_ids

            # Compute loss
            outputs = model(input_ids, labels=input_ids)
            loss = outputs.loss

            total_loss += loss.item() * input_ids.size(1)
            total_tokens += input_ids.size(1)

    avg_loss = total_loss / total_tokens
    perplexity = np.exp(avg_loss)

    return perplexity

In [None]:
# experimental configs
temperatures = [0.5, 1.0, 1.5]
max_context = 1024  # GPT-2's maximum context length
context_windows = [max_context, max_context // 2, 1]  # full, half, single token

# store them
configs = []
for temp in temperatures:
    for c in context_windows:
        c_name = "full" if c == max_context else f"{c}"
        configs.append({
            'temperature': temp,
            'context_window': c,
            'name': f"temp={temp}_ctx={c_name}"
        })

print(f"Total configurations: {len(configs)}")
#print(configs)
#list configs
for i, config in enumerate(configs):
    print(f"{i+1}. {config['name']}")

Total configurations: 9
1. temp=0.5_ctx=full
2. temp=0.5_ctx=512
3. temp=0.5_ctx=1
4. temp=1.0_ctx=full
5. temp=1.0_ctx=512
6. temp=1.0_ctx=1
7. temp=1.5_ctx=full
8. temp=1.5_ctx=512
9. temp=1.5_ctx=1


In [None]:
# run all experiments
results = []
all_generated_texts = {}

# use smaller subset for faster experimentation (increase for full evaluation)
num_samples = 30  # 30 samples per configuration (not specified how many to take in problem)
generation_len = 80  # generate 80 tokens per sample

print("\n" + "="*50)
print("RUNNING EXPERIMENTS")
print("="*50)

for config in configs:
    print(f"\n{'='*80}")
    print(f"{config['name']} with Temperature: {config['temperature']} and Context Window: {config['context_window']}")
    print(f"{'='*80}")

    generated_texts = []

    # generate
    print(f"Generating {num_samples} samples...")
    for i, prompt in enumerate(tqdm(prompts[:num_samples])):
        try:
            # alpha and beta as per paper
            generated = contrastive_decoding(prompt, expert_model, amateur_model, expert_tokenizer, amateur_tokenizer, max_new_tokens=generation_len, alpha=0.1, beta=0.5, amateur_temp=config['temperature'], amateur_context_window=config['context_window'], device=device)
            generated_texts.append(generated)
        except Exception as e:
            print(f"Error generating sample {i}: {e}")
            generated_texts.append(prompt)  # fallback to prompt

    # store all generated texts and compute metrics
    all_generated_texts[config['name']] = generated_texts

    print("Computing metrics...")

    diversity = compute_multiple_diversity_metrics(generated_texts)
    mauve_score = compute_mauve_score(generated_texts[:num_samples], references[:num_samples], device=device)
    perplexity = compute_perplexity(generated_texts, expert_model, expert_tokenizer, device=device)

    # store results
    result = {
        'Configuration': config['name'],
        'Temperature': config['temperature'],
        'Context Window': config['context_window'],
        'MAUVE': mauve_score,
        'Perplexity': perplexity,
        'Distinct-1': diversity['distinct-1'],
        'Distinct-2': diversity['distinct-2'],
        'Distinct-3': diversity['distinct-3'],
    }
    results.append(result)

    # live results
    print(f"Results: Distinct-2={diversity['distinct-2']:.4f}, MAUVE={mauve_score:.4f}, Perplexity={perplexity:.2f}")

print("\n" + "="*50)
print("ALL EXPERIMENTS DONE")
print("="*50)


RUNNING EXPERIMENTS

temp=0.5_ctx=full with Temperature: 0.5 and Context Window: 1024
Generating 30 samples...


100%|██████████| 30/30 [01:47<00:00,  3.57s/it]


Computing metrics...


Featurizing p:   0%|          | 0/30 [00:00<?, ?it/s]

Featurizing q:   0%|          | 0/30 [00:00<?, ?it/s]

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Results: Distinct-2=0.7904, MAUVE=0.4290, Perplexity=13.05

temp=0.5_ctx=512 with Temperature: 0.5 and Context Window: 512
Generating 30 samples...


100%|██████████| 30/30 [01:47<00:00,  3.58s/it]

Computing metrics...





Featurizing p:   0%|          | 0/30 [00:00<?, ?it/s]

Featurizing q:   0%|          | 0/30 [00:00<?, ?it/s]

Results: Distinct-2=0.7904, MAUVE=0.4290, Perplexity=13.05

temp=0.5_ctx=1 with Temperature: 0.5 and Context Window: 1
Generating 30 samples...


100%|██████████| 30/30 [01:51<00:00,  3.71s/it]

Computing metrics...





Featurizing p:   0%|          | 0/30 [00:00<?, ?it/s]

Featurizing q:   0%|          | 0/30 [00:00<?, ?it/s]

Results: Distinct-2=0.4843, MAUVE=0.3647, Perplexity=6.98

temp=1.0_ctx=full with Temperature: 1.0 and Context Window: 1024
Generating 30 samples...


100%|██████████| 30/30 [01:49<00:00,  3.65s/it]

Computing metrics...





Featurizing p:   0%|          | 0/30 [00:00<?, ?it/s]

Featurizing q:   0%|          | 0/30 [00:00<?, ?it/s]

Results: Distinct-2=0.6442, MAUVE=0.6326, Perplexity=8.10

temp=1.0_ctx=512 with Temperature: 1.0 and Context Window: 512
Generating 30 samples...


100%|██████████| 30/30 [01:50<00:00,  3.67s/it]

Computing metrics...





Featurizing p:   0%|          | 0/30 [00:00<?, ?it/s]

Featurizing q:   0%|          | 0/30 [00:00<?, ?it/s]

Results: Distinct-2=0.6442, MAUVE=0.6326, Perplexity=8.10

temp=1.0_ctx=1 with Temperature: 1.0 and Context Window: 1
Generating 30 samples...


100%|██████████| 30/30 [01:52<00:00,  3.73s/it]

Computing metrics...





Featurizing p:   0%|          | 0/30 [00:00<?, ?it/s]

Featurizing q:   0%|          | 0/30 [00:00<?, ?it/s]

Results: Distinct-2=0.4834, MAUVE=0.2399, Perplexity=6.50

temp=1.5_ctx=full with Temperature: 1.5 and Context Window: 1024
Generating 30 samples...


100%|██████████| 30/30 [01:49<00:00,  3.64s/it]

Computing metrics...





Featurizing p:   0%|          | 0/30 [00:00<?, ?it/s]

Featurizing q:   0%|          | 0/30 [00:00<?, ?it/s]

Results: Distinct-2=0.5659, MAUVE=0.4642, Perplexity=7.14

temp=1.5_ctx=512 with Temperature: 1.5 and Context Window: 512
Generating 30 samples...


100%|██████████| 30/30 [01:50<00:00,  3.67s/it]

Computing metrics...





Featurizing p:   0%|          | 0/30 [00:00<?, ?it/s]

Featurizing q:   0%|          | 0/30 [00:00<?, ?it/s]

Results: Distinct-2=0.5659, MAUVE=0.4642, Perplexity=7.14

temp=1.5_ctx=1 with Temperature: 1.5 and Context Window: 1
Generating 30 samples...


100%|██████████| 30/30 [01:54<00:00,  3.83s/it]

Computing metrics...





Featurizing p:   0%|          | 0/30 [00:00<?, ?it/s]

Featurizing q:   0%|          | 0/30 [00:00<?, ?it/s]

Results: Distinct-2=0.5060, MAUVE=0.4642, Perplexity=6.55

ALL EXPERIMENTS DONE


In [None]:
# final results displayed
results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))

    Configuration  Temperature  Context Window    MAUVE  Perplexity  Distinct-1  Distinct-2  Distinct-3
temp=0.5_ctx=full          0.5            1024 0.429044   13.049411    0.337901    0.790376    0.937339
 temp=0.5_ctx=512          0.5             512 0.429044   13.049411    0.337901    0.790376    0.937339
   temp=0.5_ctx=1          0.5               1 0.364712    6.978471    0.217611    0.484322    0.601240
temp=1.0_ctx=full          1.0            1024 0.632609    8.097668    0.275421    0.644231    0.805909
 temp=1.0_ctx=512          1.0             512 0.632609    8.097668    0.275421    0.644231    0.805909
   temp=1.0_ctx=1          1.0               1 0.239914    6.498304    0.215473    0.483373    0.593349
temp=1.5_ctx=full          1.5            1024 0.464198    7.144456    0.237052    0.565859    0.732339
 temp=1.5_ctx=512          1.5             512 0.464198    7.144456    0.237052    0.565859    0.732339
   temp=1.5_ctx=1          1.5               1 0.464198    6.548

##Best Configs and Analysis:

The results for the different configs are shown above while the best performing config for each metric is shown below. The distinctness (distinct-2) tends to generally be higher with a lower temparature and larger context window. This makes sense as a lower temp and context window makes the ameteur more confident in its more general tokens, hence helping the expert omit more general tokens and be more unique. Moreover, coherence/perplexity tends to follow the same numeric pattern (however, a lower perplexity is better). This is not surprising as perplexity goes down with more repitition or low diversity (the expert model naturally assigns a higher probabilty to repeated text). This is the diversity-coherence tradeoff. The extent to which the generated text matches human text (mauve) increases with diversity and that is expected. Humans generally dont repeat text. Limiting what the ameteur can see (reducing context window), reduces the diversity as amatuer is less certain about to predict and has the same effect as a high temperature (unsure amateur's probability distribution is more spread out).

In [None]:
# best configs for each metric
print("\n" + "="*50)
print("BEST CONFIGURATIONS PER METRIC")
print("="*50)

# diversity (Distinct-2)
best_diversity = results_df.loc[results_df['Distinct-2'].idxmax()]
print(f"\nBest for Diversity (Distinct-2):")
print(f"  Configuration: {best_diversity['Configuration']}")
print(f"  Distinct-2: {best_diversity['Distinct-2']:.4f}")

# mauve
best_mauve = results_df.loc[results_df['MAUVE'].idxmax()]
print(f"\nBest for MAUVE:")
print(f"  Configuration: {best_mauve['Configuration']}")
print(f"  MAUVE: {best_mauve['MAUVE']:.4f}")

# coherence (lowest perplexity)
best_coherence = results_df.loc[results_df['Perplexity'].idxmin()]
print(f"\nBest for Coherence (Lowest Perplexity):")
print(f"  Configuration: {best_coherence['Configuration']}")
print(f"  Perplexity: {best_coherence['Perplexity']:.2f}")

# worst diversity
worst_diversity = results_df.loc[results_df['Distinct-2'].idxmin()]
print(f"\nWorst for Diversity (Distinct-2):")
print(f"  Configuration: {worst_diversity['Configuration']}")
print(f"  Distinct-2: {worst_diversity['Distinct-2']:.4f}")


BEST CONFIGURATIONS PER METRIC

Best for Diversity (Distinct-2):
  Configuration: temp=0.5_ctx=full
  Distinct-2: 0.7904

Best for MAUVE:
  Configuration: temp=1.0_ctx=full
  MAUVE: 0.6326

Best for Coherence (Lowest Perplexity):
  Configuration: temp=1.0_ctx=1
  Perplexity: 6.50

Worst for Diversity (Distinct-2):
  Configuration: temp=1.0_ctx=1
  Distinct-2: 0.4834


## Qualitative Examples and Analysis:

Shown below are 2-3 examples for both the best and worst configs in terms of diversity. The worst config examples clearly show much higher repition/ lack of diversity.

In [None]:
print("\n" + "="*50)
print("QUALITATIVE EXAMPLES")
print("="*50)

# Get best and worst configurations based on Distinct-2
best_config_name = best_diversity['Configuration']
worst_config_name = worst_diversity['Configuration']

best_texts = all_generated_texts[best_config_name]
worst_texts = all_generated_texts[worst_config_name]
num_examples = 3

print(f"\n{'='*50}")
print(f"Best Configuration: {best_config_name}")
print(f"Distinct-2: {best_diversity['Distinct-2']:.4f}")
print(f"{'='*50}")

for i in range(num_examples):
    print(f"\n--- Example {i+1} ---")
    print(f"Prompt: {prompts[i][:80]}...")
    print(f"\nGenerated Text:")
    # show only the generated part (remove prompt)
    generated_only = best_texts[i][len(prompts[i]):]
    print(generated_only[:300] + "..." if len(generated_only) > 300 else generated_only)

print(f"\n{'='*50}")
print(f"Worst Configuration: {worst_config_name}")
print(f"Distinct-2: {worst_diversity['Distinct-2']:.4f}")
print(f"{'='*50}")

for i in range(num_examples):
    print(f"\n--- Example {i+1} ---")
    print(f"Prompt: {prompts[i][:80]}...")
    print(f"\nGenerated Text:")
    # show only the generated part (remove prompt)
    generated_only = worst_texts[i][len(prompts[i]):]
    print(generated_only[:300] + "..." if len(generated_only) > 300 else generated_only)




QUALITATIVE EXAMPLES

Best Configuration: temp=0.5_ctx=full
Distinct-2: 0.7904

--- Example 1 ---
Prompt: Robert Boulter is an English film , television and theatre actor . He had a gues...

Generated Text:
 the episode "A Date for Bill". He played the role of Captain Michael Forge on Star Trek: Enterprise from September 2009 until its end in May 2013. In April 2012 , he portrayed a Starfleet Captain in an episode of the science fiction comedy web-series Redshirt . Boulter later guest-starred in an epi...

--- Example 2 ---
Prompt: In 2006 , Boulter starred alongside Whishaw in the play Citizenship written by M...

Generated Text:
 comedy The Office entitled "The Meeting", playing the character of Simon Boulter's best friend.


Whishaw recently starred opposite Jude Law and Helena Bonham Carter in The Hundred-Foot Journey, a biographical drama film written and directed by Whishaw and released in the UK on September 21st. Whis...

--- Example 3 ---
Prompt: In 2000 Boulter had a guest @