Andrew Marasco \
*Automated Dialogue Summarization for Messaging Platform* \
Flatiron School Capstone Project #2 \
January, 2026

NOTE: This notebook was developed and trained using Google Colab GPU for reproducibility.




Environment: Google Colab (GPU)
Core model: BERT encoder + GPT-2 decoder (EncoderDecoderModel)
Dataset subset size:
Train: ~1,000–2,000 examples
Validation: ~200
Test: ~200
Max lengths:
Dialogue: 256–512 tokens
Summary: ~64 tokens

### Status 1/29/26
Step 1 Complete \
Step 2 complete \
Step 3.5 in progress

In [None]:
import torch
torch.cuda.is_available()

True

## **Step 1: Dataset Exploration and Preparation**

1.1: Loading Dataset + Inspecting Structure

In [1]:
!pip -q install -U transformers datasets evaluate accelerate rouge_score sentencepiece

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m87.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m515.2/515.2 kB[0m [31m37.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m35.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.6/47.6 MB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone


In [2]:
import datasets, huggingface_hub
print("datasets:", datasets.__version__)
print("huggingface_hub:", huggingface_hub.__version__)


datasets: 4.5.0
huggingface_hub: 1.3.5


In [3]:
import random
import numpy as np
import pandas as pd
import torch

from datasets import load_dataset
import evaluate

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

In [4]:
dataset = load_dataset("knkarthick/samsum")
print(dataset)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]



train.csv: 0.00B [00:00, ?B/s]

validation.csv: 0.00B [00:00, ?B/s]

test.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/14731 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14731
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
})


In [5]:
print({k: len(v) for k, v in dataset.items()})
print("Columns:", dataset["train"].column_names)

{'train': 14731, 'validation': 818, 'test': 819}
Columns: ['id', 'dialogue', 'summary']


1.2: Inspecting a few examples

In [6]:
def show_example(split="train", idx=None):
    import random
    if idx is None:
        idx = random.randint(0, len(dataset[split]) - 1)
    ex = dataset[split][idx]
    print(f"Split: {split} | Index: {idx}")
    print("\n--- DIALOGUE ---")
    print(ex["dialogue"])
    print("\n--- SUMMARY (target) ---")
    print(ex["summary"])
    return ex

_ = show_example("train")
_ = show_example("train")
_ = show_example("validation")


Split: train | Index: 10476

--- DIALOGUE ---
David: The new movie of Jonhy English has come out, have you seen it?
Patricia: No but I have been meaning to go tough. I heard it's hilarious.
David: Rowan Atkison is just awesome, love that guy! In Mr. Bean I would just laugh so hard ahaha
Patricia: Me too 😂 I couldn't watch some scenes sometimes cause they would make me nervous from all the constant crap he did ahhaha
David: ahahaa xD  Anyway.. wanna go to the 21:40 session today? I ain't got much going on so..
Patricia: Sure! Where are you having dinner?
David: Was thinking of just ordering a pizza, you have any ideas?
Patricia: There's a new Mexican place and they do take out's, want me to grab something and meet you at your place?
David: Oh that's what I'm talking about! Bring me 2 chicken burritos and nachoooos with guacamole.
Patricia: Anything else for the little boy? ahaha xD
David: While you're at it a coke would do 😂
Patricia: Jesus.. x) Leaving my place now, cya in a bit.

--- 

1.3: Analyzing Characteristics (distribution of length)

In [7]:
def length_stats(split="train", n=2000):
  n = min(n, len(dataset[split]))
  sample = dataset[split].select(range(n))
  df = pd.DataFrame({
      "dialogue_words": [len(x.split()) for x in sample["dialogue"]],
      "summary_words": [len(x.split()) for x in sample["summary"]],
  })
  return df.describe(percentiles=[.5, .8, .9, .95, .99])

stats = length_stats("train", n=2000)
stats

Unnamed: 0,dialogue_words,summary_words
count,2000.0,2000.0
mean,95.3685,20.521
std,73.369355,11.365524
min,7.0,1.0
50%,75.0,18.0
80%,142.0,30.0
90%,194.0,37.0
95%,248.05,44.0
99%,344.06,53.0
max,471.0,60.0


1.4: Creating Training/Validation Splits for Project \
Note: Dataset is already split, but this step exists in the project instructions

In [8]:
SEED = 42
TRAIN_SIZE = 2000
VAL_SIZE = 300

train_ds = dataset["train"].shuffle(seed=SEED).select(range(TRAIN_SIZE))
val_ds   = dataset["validation"].shuffle(seed=SEED).select(range(VAL_SIZE))

print(len(train_ds), len(val_ds))

2000 300


1.5: Tokenization Setup using BERT encoder and GPT-2 Decoder

In [9]:
from transformers import AutoTokenizer

encoder_name = "bert-base-uncased"
decoder_name = "gpt2"

enc_tok = AutoTokenizer.from_pretrained(encoder_name)
dec_tok = AutoTokenizer.from_pretrained(decoder_name)

# Setting GPT-2 pad token to EOS
dec_tok.pad_token = dec_tok.eos_token

MAX_INPUT_LEN = 512
MAX_TARGET_LEN = 64

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [10]:
def preprocess_batch(batch):
  # encode dialogue (encoder input)
  enc = enc_tok(
      batch["dialogue"],
      truncation=True,
      padding="max_length",
      max_length=MAX_INPUT_LEN,
  )

  # encoding summary (decoder labels)
  dec = dec_tok(
      batch["summary"],
      truncation=True,
      padding="max_length",
      max_length=MAX_TARGET_LEN,
  )

  enc["labels"] = dec["input_ids"]
  return enc

In [11]:
train_tok = train_ds.map(preprocess_batch, batched=True, remove_columns=train_ds.column_names)
val_tok = val_ds.map(preprocess_batch, batched=True, remove_columns=val_ds.column_names)

train_tok.set_format(type="torch")
val_tok.set_format(type="torch")

train_tok[0]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

{'input_ids': tensor([  101,  4205,  1024,  1026,  5371,  1035,  2678,  1028,  4205,  1024,
          2054,  2079,  2017,  2228, 10590,  1024,  2507,  2033,  1037, 10819,
         10590,  1024,  7929,  3666,  4205,  1024,  2292,  2033,  2113, 10590,
          1024,  2064,  1005,  1056,  2428,  2963,  1037,  2843,  2045,  1025,
          1013,  4205,  1024,  3398,  1025,  1013,  4205,  1024,  1045,  2228,
          1045,  2342,  2000,  2501,  2009,  5064,  2842,  4205,  1024,  2672,
          2083,  1996,  8278,  1998,  4007, 10590,  1024,  2008,  5791,  2003,
          1037,  2307,  2801,   999, 10590,  1024,  1045,  3984,  2008,  1005,
          1055,  2339,  1045,  2435,  2017,  1996,  8278,  1998,  5361,  2009,
          1024,  1040,  4205,  1024,  3398,  1060,  2094,  4205,  1024,  7929,
          1045,  1005,  2222,  3046,  2000,  3275,  2009,  2041,  2101, 10590,
          1024,  7929, 10590,  1024,  1045,  1005,  2222,  2022,  3403,  1024,
          1052,   102,     0,     0,   

1.6: Building DataLoaders for efficient model training

In [12]:
from torch.utils.data import DataLoader

BATCH_SIZE = 8

train_loader = DataLoader(train_tok, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_tok, batch_size=BATCH_SIZE, shuffle=False)

batch = next(iter(train_loader))
{k: v.shape for k, v in batch.items()}

{'input_ids': torch.Size([8, 512]),
 'token_type_ids': torch.Size([8, 512]),
 'attention_mask': torch.Size([8, 512]),
 'labels': torch.Size([8, 64])}

## **Step 2: Model Architecture Implementation**



2.1: Creating the Encoder-Decoder Model (BERT -> GPT2)

In [13]:
from transformers import EncoderDecoderModel

encoder_name = "bert-base-uncased"
decoder_name = "gpt2"

model = EncoderDecoderModel.from_encoder_decoder_pretrained(
    encoder_name,
    decoder_name
)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertModel LOAD REPORT from: bert-base-uncased
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
cls.predictions.bias                       | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED |  | 
cls.predictions.transform.dense.bias       | UNEXPECTED |  | 
cls.seq_relationship.bias                  | UNEXPECTED |  | 
cls.predictions.transform.dense.weight     | UNEXPECTED |  | 
cls.seq_relationship.weight                | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/148 [00:00<?, ?it/s]

GPT2LMHeadModel LOAD REPORT from: gpt2
Key                                                 | Status     | 
----------------------------------------------------+------------+-
h.{0...11}.attn.bias                                | UNEXPECTED | 
transformer.h.{0...11}.ln_cross_attn.bias           | MISSING    | 
transformer.h.{0...11}.crossattention.c_proj.weight | MISSING    | 
transformer.h.{0...11}.crossattention.c_proj.bias   | MISSING    | 
transformer.h.{0...11}.crossattention.c_attn.weight | MISSING    | 
transformer.h.{0...11}.crossattention.q_attn.bias   | MISSING    | 
transformer.h.{0...11}.ln_cross_attn.weight         | MISSING    | 
transformer.h.{0...11}.crossattention.c_attn.bias   | MISSING    | 
transformer.h.{0...11}.crossattention.q_attn.weight | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

2.2: Configuring Special Tokens (EOS/PAD/start tokens)

In [14]:
# decoder (GPT2) special token IDs
model.config.eos_token_id = dec_tok.eos_token_id
model.config.pad_token_id = dec_tok.pad_token_id

# Start token for decoding:
model.config.decoder_start_token_id = dec_tok.eos_token_id

In [15]:
# Force generation_config to match model.config for generation-critical tokens

model.generation_config.decoder_start_token_id = model.config.decoder_start_token_id
model.generation_config.eos_token_id = model.config.eos_token_id
model.generation_config.pad_token_id = model.config.pad_token_id

print("config decoder_start_token_id:", model.config.decoder_start_token_id)
print("gen_config decoder_start_token_id:", model.generation_config.decoder_start_token_id)

config decoder_start_token_id: 50256
gen_config decoder_start_token_id: 50256


In [16]:
# Making generation_config exist and including start/eos/pad IDs
gen_cfg = model.generation_config

gen_cfg.decoder_start_token_id = model.config.decoder_start_token_id
gen_cfg.eos_token_id = model.config.eos_token_id
gen_cfg.pad_token_id = model.config.pad_token_id

model.generation_config = gen_cfg

2.3: Setting generation defaults (beam search, length, etc.)

In [17]:
from transformers import GenerationConfig

model.generation_config = GenerationConfig(
    max_new_tokens=64,
    num_beams=4,
    early_stopping=True,
    length_penalty=1.0,
    eos_token_id=model.config.eos_token_id,
    pad_token_id=model.config.pad_token_id,
)

2.4: Move Model to GPU

In [18]:
model = model.to(device)
device

'cuda'

2.5: Forward Pass Check

In [19]:
batch = next(iter(train_loader))
batch = {k: v.to(device) for k, v in batch.items()}

out = model(**batch)
print("Loss:", float(out.loss))
print("Logits shape:", tuple(out.logits.shape))


Loss: 6.997280120849609
Logits shape: (8, 64, 50257)


Consider using tensor.detach() first. (Triggered internally at /pytorch/torch/csrc/autograd/generated/python_variable_methods.cpp:836.)
  print("Loss:", float(out.loss))


2.6: Building Simple Inference Function (Prototype 1)

In [58]:
def generate_summary(dialogue: str, num_beams=4, max_new_tokens=64):
  inputs = enc_tok(
      dialogue,
      return_tensors="pt",
      truncation=True,
      padding="max_length",
      max_length=MAX_INPUT_LEN
  ).to(device)

  summary_ids = model.generate(
      input_ids=inputs["input_ids"],
      attention_mask=inputs["attention_mask"],
      num_beams=num_beams,
      max_new_tokens=max_new_tokens,
      no_repeat_ngram_size=3,
      repetition_penalty=1.2,
      pad_token_id=model.config.pad_token_id,
      eos_token_id=model.config.eos_token_id,
      decoder_start_token_id=model.config.decoder_start_token_id
  )

  return dec_tok.decode(summary_ids[0], skip_special_tokens=True)

Testing on real sample

In [59]:
print("decoder_start_token_id:", model.config.decoder_start_token_id)
print("eos_token_id:", model.config.eos_token_id)
print("pad_token_id:", model.config.pad_token_id)


decoder_start_token_id: 50256
eos_token_id: 50256
pad_token_id: 50256


In [60]:
# Explicitly sync generation_config with model.config
model.generation_config.decoder_start_token_id = model.config.decoder_start_token_id
model.generation_config.eos_token_id = model.config.eos_token_id
model.generation_config.pad_token_id = model.config.pad_token_id

# REAL sanity check (do not change these lines)
print("config decoder_start_token_id:", model.config.decoder_start_token_id)
print("gen_config decoder_start_token_id:", model.generation_config.decoder_start_token_id)



config decoder_start_token_id: 50256
gen_config decoder_start_token_id: 50256


In [61]:
ex = dataset["validation"][0]
print(ex["dialogue"])
print("\n---model output (untrained) ---")
print(generate_summary(ex["dialogue"]))
print("\n--- target summary ---")
print(ex["summary"])

A: Hi Tom, are you busy tomorrow’s afternoon?
B: I’m pretty sure I am. What’s up?
A: Can you go with me to the animal shelter?.
B: What do you want to do?
A: I want to get a puppy for my son.
B: That will make him so happy.
A: Yeah, we’ve discussed it many times. I think he’s ready now.
B: That’s good. Raising a dog is a tough issue. Like having a baby ;-) 
A: I'll get him one of those little dogs.
B: One that won't grow up too big;-)
A: And eat too much;-))
B: Do you know which one he would like?
A: Oh, yes, I took him there last Monday. He showed me one that he really liked.
B: I bet you had to drag him away.
A: He wanted to take it home right away ;-).
B: I wonder what he'll name it.
A: He said he’d name it after his dead hamster – Lemmy  - he's  a great Motorhead fan :-)))

---model output (untrained) ---
Sophie is going to buy a new dress for her birthday. She will wear it on her birthday party. Sophie wants to go shopping with her friends this weekend. 

--- target summary ---
A 

## **Step 3: Training and Optimization**

3.1: Defining Evalutation Metric (ROUGE)

In [63]:
import evaluate
rouge = evaluate.load("rouge")

def compute_metrics(eval_pred):
  preds, labels = eval_pred

 # If preds are logits (rare with predict_with_generate), convert to token ids
  if isinstance(preds, tuple):
        preds = preds[0]

# convert to numpy
  preds = np.array(preds)
  labels = np.array(labels)

# if preds are logits: (batch, seq_len, vocab) -> token ids
  if preds.ndim == 3:
        preds = preds.argmax(axis=-1)

# ensure integer type
  preds = preds.astype(np.int64)

# replacing -100 in labels so we can decode
  labels = labels.astype(np.int64)
  labels[labels == -100] = dec_tok.pad_token_type_id

# Clip to valid token range to avoid decode overflow
  vocab_size = dec_tok.vocab_size
  preds = np.clip(preds, 0, vocab_size -1)
  labels = np.clip(labels, 0, vocab_size -1)


  decoded_preds = dec_tok.batch_decode(preds, skip_special_tokens=True)
  decoded_labels = dec_tok.batch_decode(labels, skip_special_tokens=True)

  result = rouge.compute(
      predictions=decoded_preds,
      references=decoded_labels
  )

  # Return ROUGE-L F1 (most interpretable for business discussion)
  return {k: round(v, 4) for k, v in result.items()}

3.2: Data Collator (batch handling)

In [64]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(
    tokenizer=dec_tok,
    model=model,
)

3.3: Training Configuration

In [66]:
import torch
model.gradient_checkpointing_enable()
model.config.use_cache = False
torch.cuda.empty_cache()

In [67]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    eval_strategy="steps",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=5e-5,
    num_train_epochs=2,
    fp16=True,
    logging_steps=100,
    eval_steps=1000,
    save_steps=1000,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="rougeL",
    greater_is_better=True,
    predict_with_generate=True,
    report_to="none",
)

3.4: Trainer Setup - bringing together model, data, optimizer, loss, metrics, checkpointing

In [68]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_tok,
    eval_dataset=val_tok,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [69]:
model.generation_config.num_beams = 4
model.generation_config.early_stopping = True
model.generation_config.max_new_tokens = 64


3.5: Training the Model

In [70]:
trainer.train()



Step,Training Loss,Validation Loss


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

TrainOutput(global_step=250, training_loss=8.778486938476563, metrics={'train_runtime': 469.7623, 'train_samples_per_second': 8.515, 'train_steps_per_second': 0.532, 'total_flos': 2446165278720000.0, 'train_loss': 8.778486938476563, 'epoch': 2.0})

In [71]:
model = trainer.model
model.eval()

EncoderDecoderModel(
  (encoder): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elemen

In [72]:
print(model is trainer.model)

True


3.6a: Free Memory

In [73]:
import torch, gc
gc.collect()
torch.cuda.empty_cache()

3.6b: Evaluation

In [74]:
trainer.evaluate()



{'eval_loss': 1.5168776512145996,
 'eval_rouge1': 0.16,
 'eval_rouge2': 0.0245,
 'eval_rougeL': 0.1323,
 'eval_rougeLsum': 0.1322,
 'eval_runtime': 111.7404,
 'eval_samples_per_second': 2.685,
 'eval_steps_per_second': 1.342,
 'epoch': 2.0}

3.7: Showing qualitative before/after examples

In [75]:
def demo_samples(n=5, split="validation"):
  idxs = random.sample(range(len(dataset[split])), n)
  for i, idx in enumerate(idxs, 1):
    ex = dataset[split][idx]
    pred = generate_summary(ex["dialogue"], num_beams=1, max_new_tokens=64)

    print("="*80)
    print(f"Example {i} (idx={idx})")
    print("\nDIALOGUE:\n", ex["dialogue"])
    print("\nMODEL SUMMARY:\n", pred)
    print("\nREFERENCE SUMMARY:\n", ex["summary"])

demo_samples(n=5)


The following generation flags are not valid and may be ignored: ['early_stopping']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Example 1 (idx=654)

DIALOGUE:
 George: Fun fact time XD
George: IQ decreases by 20% after a 2-week holiday
Pete: lol 
Matt: haha wonder what happens after a gap year xD
Pete: <file_gif>

MODEL SUMMARY:
 Mason and Mason are going to the gym tomorrow. 

REFERENCE SUMMARY:
 IQ decreases by 20% after a 2-week holiday.
Example 2 (idx=114)

DIALOGUE:
 Kyle: Who wants to go out for a drink?
Megan: No, sorry, I'm cleaning the house today.
Roseanne: You've always loved cleaning, haven't you? I remember how angry you used to get with your brother for leaving a mess in the kitchen.
Vince: Yeah, she'd always yell at me, even though I was the one in charge when our parents were away.
Kyle: I don't get why it matters so much whether I clean my flat once a week or once a month. No one died from a bit of dust.
Megan: Remind me to never stay at your place :P
Roseanne: I'm somewhere in between. My house is always a mess, but I hate it when it's dirty.
Vince: What's the difference?
Roseanne: I can't sta

3.8: ROUGE sanity check

In [77]:
from evaluate import load
rouge_eval = load("rouge")

preds, refs = [], []

for i in range(20):
    ex = dataset["validation"][i]
    preds.append(
        generate_summary(
            ex["dialogue"],
            num_beams=1,
            max_new_tokens=64
        )
    )
    refs.append(ex["summary"])

rouge_eval.compute(predictions=preds, references=refs)


{'rouge1': np.float64(0.17924219233810146),
 'rouge2': np.float64(0.030814014273481648),
 'rougeL': np.float64(0.1269713248636462),
 'rougeLsum': np.float64(0.12701810342057013)}