Andrew Marasco \
*Automated Dialogue Summarization for Messaging Platform* \
BART MVP \
Flatiron School Capstone Project #2 \
January, 2026

## STEP 1.0
Installing & Restarting Runtime

In [10]:
!pip -q install -U transformers datasets evaluate accelerate sentencepiece rouge_score


## Step 1.1

Imports + Global Settings for Model

In [11]:
import numpy as np
from datasets import load_dataset
import evaluate

from transformers import(
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
)

MODEL_NAME = "facebook/bart-base"
MAX_SOURCE_LEN = 512
MAX_TARGET_LEN = 64
SEED = 42

In [12]:
import transformers
print(transformers.__version__)

5.1.0


In [13]:
!pip -q install -U transformers accelerate

## STEP 1.2

Loading SAMSum Dataset, Confirming Splits

In [14]:
ds = load_dataset("knkarthick/samsum")
print(ds)
print("Train:", len(ds["train"]), "Val:", len(ds["validation"]), "Test:", len(ds["test"]))

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14731
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
})
Train: 14731 Val: 818 Test: 819


## STEP 1.3

Check Point

In [15]:
from transformers import AutoTokenizer
tok_tmp = AutoTokenizer.from_pretrained(MODEL_NAME)

def token_len(text):
    return len(tok_tmp(text, truncation=False, add_special_tokens=True)["input_ids"])

sample = ds["train"].shuffle(seed=SEED).select(range(500))
dialog_lens = [token_len(x["dialogue"]) for x in sample]
sum_lens = [token_len(x["summary"]) for x in sample]

print("Dialogue token lengths (500 sample):",
      "median =", int(np.median(dialog_lens)),
      "p95 =", int(np.percentile(dialog_lens, 95)),
      "max =", int(np.max(dialog_lens)))

print("Summary token lengths (500 sample):",
      "median =", int(np.median(sum_lens)),
      "p95 =", int(np.percentile(sum_lens, 95)),
      "max =", int(np.max(sum_lens)))


Dialogue token lengths (500 sample): median = 120 p95 = 367 max = 809
Summary token lengths (500 sample): median = 24 p95 = 57 max = 78


## STEP 2.0

Loading Tokenizer and Model

In [16]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

Loading weights:   0%|          | 0/259 [00:00<?, ?it/s]

## STEP 2.1

Preprocessing, Tokenizing (static padding + '-100' label masking)

In [17]:
def preprocess(batch):
  inputs = batch["dialogue"]
  targets = batch["summary"]

  model_inputs = tokenizer(
      inputs,
      max_length=MAX_SOURCE_LEN,
      truncation=True,
      padding="max_length",
  )

  labels = tokenizer(
      text_target=targets,
      max_length=MAX_TARGET_LEN,
      truncation=True,
      padding="max_length",
  )["input_ids"]

  labels = [
      [(tok if tok != tokenizer.pad_token_id else -100) for tok in seq]
      for seq in labels
  ]
  model_inputs["labels"] = labels
  return model_inputs

tokenized = ds.map(preprocess, batched=True, remove_columns=ds["train"].column_names)
tokenized

Map:   0%|          | 0/14731 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 14731
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 818
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 819
    })
})

## Step 2.2

Confirming -100 Masking

In [18]:
i = 0
print("Example dialogue:\n", ds["train"][i]["dialogue"][:300], "...\n")
print("Example summary:\n", ds["train"][i]["summary"], "\n")

print("Label ids (first 30):", tokenized["train"][i]["labels"][:30])
print("Count of -100 in labels:", sum(x == -100 for x in tokenized["train"][i]["labels"]))


Example dialogue:
 Amanda: I baked  cookies. Do you want some?
Jerry: Sure!
Amanda: I'll bring you tomorrow :-) ...

Example summary:
 Amanda baked cookies and will bring Jerry some tomorrow. 

Label ids (first 30): [0, 10127, 5219, 17241, 15269, 8, 40, 836, 6509, 103, 3859, 4, 2, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100]
Count of -100 in labels: 51


## Step 3.0

Creating ROUGE Metric + compute_metric

In [19]:
rouge = evaluate.load("rouge")

def compute_metrics(eval_pred):
  preds, labels = eval_pred
  labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

  pred_str = tokenizer.batch_decode(preds, skip_special_tokens=True)
  label_str = tokenizer.batch_decode(labels, skip_special_tokens=True)

  result = rouge.compute(
      predictions=pred_str,
      references=label_str,
      use_stemmer=False,
  )
  return {k: round(v, 4) for k, v in result.items()}

Downloading builder script: 0.00B [00:00, ?B/s]

## Step 3.1

Creating Training Arguments for Model

In [20]:
args = Seq2SeqTrainingArguments(
    output_dir=".bart_samsum_mvp",
    seed=SEED,

    eval_strategy="steps",
    eval_steps=500,
    logging_steps=100,
    save_steps=500,
    save_total_limit=2,

    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=8,
    fp16=True,

    learning_rate=5e-5,
    num_train_epochs=1,
    predict_with_generate=True,

    # Generation settings for evaluation
    generation_max_length=64,
    generation_num_beams=4,
)

Step 3.1b

In [21]:
model.generation_config.no_repeat_ngram_size = 3
model.generation_config.repetition_penalty = 1.2
model.generation_config.max_length = 64
model.generation_config.num_beams = 4

## Step 3.2

Creating Data Collator and Trainer

In [22]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

trainer = Seq2SeqTrainer(
    model=model,
    args=args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

## Step 3.3

Baseline Evaluation before Training

In [23]:
trainer.evaluate()

{'eval_loss': 4.1244401931762695,
 'eval_model_preparation_time': 0.003,
 'eval_rouge1': 0.2959,
 'eval_rouge2': 0.0935,
 'eval_rougeL': 0.226,
 'eval_rougeLsum': 0.2258,
 'eval_runtime': 368.125,
 'eval_samples_per_second': 2.222,
 'eval_steps_per_second': 1.111}

Step 3.4

Actual Training

In [24]:
trainer.train()

Step,Training Loss,Validation Loss,Model Preparation Time,Rouge1,Rouge2,Rougel,Rougelsum
500,14.475944,1.559928,0.003,0.4717,0.2372,0.395,0.3944


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

TrainOutput(global_step=921, training_loss=14.735736030967953, metrics={'train_runtime': 793.5814, 'train_samples_per_second': 18.563, 'train_steps_per_second': 1.161, 'total_flos': 4491013883166720.0, 'train_loss': 14.735736030967953, 'epoch': 1.0})

Step 3.4a - Evaluating after training

In [25]:
trainer.evaluate()

{'eval_loss': 1.5051378011703491,
 'eval_model_preparation_time': 0.003,
 'eval_rouge1': 0.4867,
 'eval_rouge2': 0.2493,
 'eval_rougeL': 0.4096,
 'eval_rougeLsum': 0.4091,
 'eval_runtime': 185.3168,
 'eval_samples_per_second': 4.414,
 'eval_steps_per_second': 2.207,
 'epoch': 1.0}

Step 3.4b - Check for any Label Bugs

In [26]:
i = 0
labels = tokenized["train"][i]["labels"]
print("First 40 label ids:", labels[:40])
print("Num -100:", sum(x == -100 for x in labels))
print("Num non--100:", sum(x != -100 for x in labels))


First 40 label ids: [0, 10127, 5219, 17241, 15269, 8, 40, 836, 6509, 103, 3859, 4, 2, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100]
Num -100: 51
Num non--100: 13


Step 3.4c - Confirming Decoder labels decode to actual summary

In [27]:
i = 0
label_ids = [x if x != -100 else tokenizer.pad_token_id for x in tokenized["train"][i]["labels"]]
decoded = tokenizer.decode(label_ids, skip_special_tokens=True)

print("REFERENCE SUMMARY:\n", ds["train"][i]["summary"])
print("\nDECODED LABELS:\n", decoded)


REFERENCE SUMMARY:
 Amanda baked cookies and will bring Jerry some tomorrow.

DECODED LABELS:
 Amanda baked cookies and will bring Jerry some tomorrow.


## Step 3.5

Evaluation Using ROUGE

In [28]:
metrics = trainer.evaluate()
metrics

{'eval_loss': 1.5051378011703491,
 'eval_model_preparation_time': 0.003,
 'eval_rouge1': 0.4867,
 'eval_rouge2': 0.2493,
 'eval_rougeL': 0.4096,
 'eval_rougeLsum': 0.4091,
 'eval_runtime': 189.5798,
 'eval_samples_per_second': 4.315,
 'eval_steps_per_second': 2.157,
 'epoch': 1.0}

## Step 3.6

Saving Model (for faster reloads for demos)

In [29]:
trainer.save_model("./bart_samsum_mvp_model")
tokenizer.save_pretrained("./bart_samsum_mvp_model")

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

('./bart_samsum_mvp_model/tokenizer_config.json',
 './bart_samsum_mvp_model/tokenizer.json')

## Step 4.0

Evaluating on the test split

In [30]:
test_metrics = trainer.evaluate(eval_dataset=tokenized["test"])
test_metrics

{'eval_loss': 1.5486162900924683,
 'eval_model_preparation_time': 0.003,
 'eval_rouge1': 0.4712,
 'eval_rouge2': 0.232,
 'eval_rougeL': 0.3948,
 'eval_rougeLsum': 0.3953,
 'eval_runtime': 180.1241,
 'eval_samples_per_second': 4.547,
 'eval_steps_per_second': 2.276,
 'epoch': 1.0}

## Step 4.1

Building small demo function

In [31]:
import torch

def summarize_dialogue(dialogue, num_beams=4):
  inputs = tokenizer(
      dialogue,
      return_tensors="pt",
      truncation=True,
      max_length=MAX_SOURCE_LEN,
      padding=True,
  )
  inputs = {k: v.to(model.device) for k, v in inputs.items()}

  with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=64,
        num_beams=num_beams,
        no_repeat_ngram_size=3,
        repetition_penalty=1.2,
        early_stopping=True,
    )

  return tokenizer.decode(output_ids[0], skip_special_tokens=True)

## Step 4.2

Generating 10 Demo Examples

In [32]:
demo_idxs = [0, 3, 7, 25, 50, 89, 122, 333, 555, 667]

for i in demo_idxs:
  dialogue = ds["test"][i]["dialogue"]
  ref = ds["test"][i]["summary"]
  pred = summarize_dialogue(dialogue)

  print("="*100)
  print(f"TEST IDX: {i}\n")
  print("DIALOGUE: \n", dialogue)
  print("\nREFERENCE SUMMARY:\n", ref)
  print("\nMODEL SUMMARY:\n", pred)

TEST IDX: 0

DIALOGUE: 
 Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye

REFERENCE SUMMARY:
 Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.

MODEL SUMMARY:
 Hannah can't find Betty's number. She called Larry last time they were at the park together.
TEST IDX: 3

DIALOGUE: 
 Will: hey babe, what do you want for dinner tonight?
Emma:  gah, don't even worry about it tonight
Will: what do you mean? everything ok?
Emma: not really, but it's ok, don't worry about cooking though, I'm not hungry
Will: Well what time will you be home?
Emma: soon, hopefully
Will: you sure? Maybe you want me to pick you up

## Step 4.3

Great - 7, 89, 333 \
Decent - 555 \
Failure - 25

## Step 4.4

Improving Coherence for Demos

In [34]:
def summarize_dialogue(dialogue, num_beams=4):
  inputs = tokenizer(
      dialogue,
      return_tensors="pt",
      truncation=True,
      max_length=MAX_SOURCE_LEN,
      padding=True,
  )
  inputs = {k: v.to(model.device) for k, v in inputs.items()}

  with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=64,
        num_beams=num_beams,
        length_penalty=1.0,
        min_new_tokens=10,
        no_repeat_ngram_size=3,
        repetition_penalty=1.1,
        early_stopping=True,
    )

  return tokenizer.decode(output_ids[0], skip_special_tokens=True)

## Step 4.5: Error Analysis

Truncation at 512 tokens might drop important context, meaning summaries could miss key details.

Some dialogues are ambiguous like implied context, sarcasm, etc, and the model produces plausible outputs but sometimes uses awkward phrasing.

====== 3 GOOD DEMO SUMMARY OUTPUTS: ========================================= \

TEST IDX: 7 \

DIALOGUE: \
 Rita: I'm so bloody tired. Falling asleep at work. :-( \
Tina: I know what you mean. \
Tina: I keep on nodding off at my keyboard hoping that the boss doesn't notice \
Rita: The time just keeps on dragging on and on and on.... \
Rita: I keep on looking at the clock and there's still 4 hours of this \
drudgery to go. \
Tina: Times like these I really hate my work. \
Rita: I'm really not cut out for this level of boredom. \
Tina: Neither am I. \

REFERENCE SUMMARY: \
 Rita and Tina are bored at work and have still 4 hours left. \

MODEL SUMMARY: \
 Rita is tired at work. She keeps on looking at the clock and there's \
 still 4 hours of drudgery to go. \

Demo #1 (IDX 7) Assessment: \
What Went Well: The model captured the core context of the situation \
(work fatigue, boredom at work) and preserved a key detail \
("still 4 hours left"). \
Area for Improvement: The summary doesn't explicitly mention Tina's boredom \
as much as the reference, but it is still a good summary. \


TEST IDX: 89 \

DIALOGUE:  \
 Tom: Ben. We've decided. 2pm in the Oval Room. \
Ben: Ok, I'll be there \
Tom: Take all your papers, it's going to be a fight! And remember: take \
no  prisoners, shoot to kill! \
Ben: hahaha, we have to win this battle. \
Tom: We will, the justice is on our side. \

REFERENCE SUMMARY: \
 Tom will meet Ben in the Oval Room at 2pm and tells him to bring the papers. \

MODEL SUMMARY: \
 Ben and Tom are meeting at 2 pm in the Oval Room. \

Demo #2 (IDX 89) Assessment: \
What Went Well: Correctly extracts meeting time and the meeting location \
(Oval Room, 2pm). \
Area for Improvement: The summary doesn't mention anything related to \
the "bring all your papers", and "it's going to be a fight!", which \
could be considered important context. \

TEST IDX: 333 \

DIALOGUE: \
 Carmen: how are you feeling, Viola? it is so so close... \
Alfred: My dearest Viola <3 \
Viola: I think as one's feeling before the wedding - a little bit light \
in the stomach! ive got some things to organize still! \
Carmen: i will be on friday night, i could give you a helping hand :)) \
Viola: Thanks darling, i will let you know x \
Carmen: (Y) my number just in case +00123456789 \
Viola: (Y) <3 \

REFERENCE SUMMARY: \
 Viola is having her wedding soon and still has some things to organize. Carmen comes on Friday and is willing to help Viola.

MODEL SUMMARY:
 Carmen will give Viola a helping hand on Friday night.


======= 1 "DECENT" DEMO SUMMARY OUTPUT: ===================================== \

TEST IDX: 555 \

DIALOGUE: \
 Lydia: <file_photo> \
Camila: Say whaaaaat? \
Lydia: Would you believe it? \
Camila: But how?
Lydia: I added him on Facebook. He accepted my invitation. I viewed his \ profile aaaand... "engaged"! \
Camila: What a bastard... But I don't get it... He knew you would see it. \
Lydia: I don't know... Maybe it's a sort of open relationship? \
Camila: Does he still write to you? \
Lydia: Yes, but I'm trying to ignore him. \
Camila: Gosh... If this girl knew... \
Lydia: Yeah... I'm embarrassed now. \
Camila: You shoulnd't, you didn't know. It's not your fault. \
And nothing happened. You two just texted. \
Lydia: But how... It was so dirty. \
Camila: C'mon, how would you know. I can't believe he is still texting \
you, now... When you know everything... \
Lydia: That's weird, I know. Do you think I should talk to him? \
Camila: I wouldn't pull any punches. \
Lydia: But we work in the same company. I don't want it to be awkward. \
Camila: You're kidding? You should talk to him. It's not fair what \
he's doing and it cannot be like that anymore! Think about this girl! \
Lydia: Yeah, you're right. This scumbug will regret that he met me. \

REFERENCE SUMMARY: \
 Lydia has exchanged sexual messages with him. Lydia does not feel \
 like pursuing the affair because he is engaged. Lydia will have  \
 a word with him because what he is doing is unfair. \

MODEL SUMMARY: \
 Lydia added him on Facebook. He accepted her invitation. \
 Lydia is embarrassed now. Lydia doesn't want to talk to him anymore \

===== 1 "FAILURE" DEMO SUMMARY OUTPUT: ====================================== \

TEST IDX: 25 \

DIALOGUE:  \
 Ollie: Okay, Kelly! Ur up nxt! \
Kelly: Me? I don't wanna. \
Mickey: C'mon! \
Jessica: Yeah! What's yours? \
Kelly: Fine. It's a sculpture garden in Finnland. \
Ollie: What's scary about sculptures? Wait! Do they resemble  \
vampires and stuff? \
Mickey: Nah, I'm sure they look rly nice. \
Kelly: It's not the sculptures, it's the amount of them \
and their faces! \
Jessica: Faces? What faces? \
Kelly: Well, they resemble ppl in different activities like  \
hugging, training, doing sport and so on. But the faces are  \
just morbid and there's like a hundred of them. \
All staring you! \
Ollie: Another one? \
Mickey: Certainly! \
Jessica: Well, Ollie, ur turn! \
Ollie: Nagoro village in Japan! \
Mickey: Y? \
Ollie: Well, maybe it's not scary, but it similar to Kelly's \
place. It's just creepy as hell. \
Jessica: Bt y? \
Ollie: Imagine a village with ppl living in it. And in the \
same village u have these human-sized figures. And there's  \
more of them than the ppl that actually live there! \
Kelly: Creepy AH! \
Mickey: WTF?! Y would ppl even do that? \
Jessica: Idk. Idc. Never. Going. There. \
Ollie: See! Mine was the worst! \
Jessica: Bt not the scariest! \
Ollie: Point taken. \
Mickey: Listen, guys, fun talking to u, bt gotta go.  \
Kelly: Yeah, me too. Bye! \
Jessica: Bye! \
Ollie: Cu! \

REFERENCE SUMMARY: \
 Kelly is scared of sculpture garden figures in Finnland, \
 she finds figure's faces morbid. For Ollie it's Nagoro  \
 village in Japan, it's creepy. \

MODEL SUMMARY: \
 Ollie and Kelly are going to Nagoro village in Japan. \