Andrew Marasco \
*Automated Dialogue Summarization for Messaging Platform* \
Flatiron School Capstone Project #2 \
January, 2026

NOTE: This notebook was developed and trained using Google Colab GPU for reproducibility.




Environment: Google Colab (GPU)
Core model: BERT encoder + GPT-2 decoder (EncoderDecoderModel)
Dataset subset size:
Train: ~1,000â€“2,000 examples
Validation: ~200
Test: ~200
Max lengths:
Dialogue: 256â€“512 tokens
Summary: ~64 tokens

## Step 1: Dataset Exploration and Preparation

1.1: Loading Dataset + Inspecting Structure

In [34]:
!pip -q install -U transformers datasets evaluate accelerate rouge_score sentencepiece

In [35]:
import datasets, huggingface_hub
print("datasets:", datasets.__version__)
print("huggingface_hub:", huggingface_hub.__version__)


datasets: 4.5.0
huggingface_hub: 1.3.4


In [36]:
import random
import numpy as np
import pandas as pd
import torch

from datasets import load_dataset
import evaluate

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

In [37]:
dataset = load_dataset("knkarthick/samsum")
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14731
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
})


In [38]:
print({k: len(v) for k, v in dataset.items()})
print("Columns:", dataset["train"].column_names)

{'train': 14731, 'validation': 818, 'test': 819}
Columns: ['id', 'dialogue', 'summary']


1.2: Inspecting a few examples

In [39]:
def show_example(split="train", idx=None):
    import random
    if idx is None:
        idx = random.randint(0, len(dataset[split]) - 1)
    ex = dataset[split][idx]
    print(f"Split: {split} | Index: {idx}")
    print("\n--- DIALOGUE ---")
    print(ex["dialogue"])
    print("\n--- SUMMARY (target) ---")
    print(ex["summary"])
    return ex

_ = show_example("train")
_ = show_example("train")
_ = show_example("validation")


Split: train | Index: 10476

--- DIALOGUE ---
David: The new movie of Jonhy English has come out, have you seen it?
Patricia: No but I have been meaning to go tough. I heard it's hilarious.
David: Rowan Atkison is just awesome, love that guy! In Mr. Bean I would just laugh so hard ahaha
Patricia: Me too ðŸ˜‚ I couldn't watch some scenes sometimes cause they would make me nervous from all the constant crap he did ahhaha
David: ahahaa xD  Anyway.. wanna go to the 21:40 session today? I ain't got much going on so..
Patricia: Sure! Where are you having dinner?
David: Was thinking of just ordering a pizza, you have any ideas?
Patricia: There's a new Mexican place and they do take out's, want me to grab something and meet you at your place?
David: Oh that's what I'm talking about! Bring me 2 chicken burritos and nachoooos with guacamole.
Patricia: Anything else for the little boy? ahaha xD
David: While you're at it a coke would do ðŸ˜‚
Patricia: Jesus.. x) Leaving my place now, cya in a bit.

1.3: Analyzing Characteristics (distribution of length)

In [40]:
def length_stats(split="train", n=2000):
  n = min(n, len(dataset[split]))
  sample = dataset[split].select(range(n))
  df = pd.DataFrame({
      "dialogue_words": [len(x.split()) for x in sample["dialogue"]],
      "summary_words": [len(x.split()) for x in sample["summary"]],
  })
  return df.describe(percentiles=[.5, .8, .9, .95, .99])

stats = length_stats("train", n=2000)
stats

Unnamed: 0,dialogue_words,summary_words
count,2000.0,2000.0
mean,95.3685,20.521
std,73.369355,11.365524
min,7.0,1.0
50%,75.0,18.0
80%,142.0,30.0
90%,194.0,37.0
95%,248.05,44.0
99%,344.06,53.0
max,471.0,60.0


1.4: Creating Training/Validation Splits for Project
Note: Dataset is already split, but this step exists in the project instructions

In [41]:
SEED = 42
TRAIN_SIZE = 2000
VAL_SIZE = 300

train_ds = dataset["train"].shuffle(seed=SEED).select(range(TRAIN_SIZE))
val_ds   = dataset["validation"].shuffle(seed=SEED).select(range(VAL_SIZE))

print(len(train_ds), len(val_ds))

2000 300


1.5: Tokenization Setup using BERT encoder and GPT-2 Decoder

In [42]:
from transformers import AutoTokenizer

encoder_name = "bert-base-uncased"
decoder_name = "gpt2"

enc_tok = AutoTokenizer.from_pretrained(encoder_name)
dec_tok = AutoTokenizer.from_pretrained(decoder_name)

# Setting GPT-2 pad token to EOS
dec_tok.pad_token = dec_tok.eos_token

MAX_INPUT_LEN = 512
MAX_TARGET_LEN = 64

In [43]:
def preprocess_batch(batch):
  # encode dialogue (encoder input)
  enc = enc_tok(
      batch["dialogue"],
      truncation=True,
      padding="max_length",
      max_length=MAX_INPUT_LEN,
  )

  # encoding summary (decoder labels)
  dec = dec_tok(
      batch["summary"],
      truncation=True,
      padding="max_length",
      max_length=MAX_TARGET_LEN,
  )

  enc["labels"] = dec["input_ids"]
  return enc

In [44]:
train_tok = train_ds.map(preprocess_batch, batched=True, remove_columns=train_ds.column_names)
val_tok = val_ds.map(preprocess_batch, batched=True, remove_columns=val_ds.column_names)

train_tok.set_format(type="torch")
val_tok.set_format(type="torch")

train_tok[0]

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

{'input_ids': tensor([  101,  4205,  1024,  1026,  5371,  1035,  2678,  1028,  4205,  1024,
          2054,  2079,  2017,  2228, 10590,  1024,  2507,  2033,  1037, 10819,
         10590,  1024,  7929,  3666,  4205,  1024,  2292,  2033,  2113, 10590,
          1024,  2064,  1005,  1056,  2428,  2963,  1037,  2843,  2045,  1025,
          1013,  4205,  1024,  3398,  1025,  1013,  4205,  1024,  1045,  2228,
          1045,  2342,  2000,  2501,  2009,  5064,  2842,  4205,  1024,  2672,
          2083,  1996,  8278,  1998,  4007, 10590,  1024,  2008,  5791,  2003,
          1037,  2307,  2801,   999, 10590,  1024,  1045,  3984,  2008,  1005,
          1055,  2339,  1045,  2435,  2017,  1996,  8278,  1998,  5361,  2009,
          1024,  1040,  4205,  1024,  3398,  1060,  2094,  4205,  1024,  7929,
          1045,  1005,  2222,  3046,  2000,  3275,  2009,  2041,  2101, 10590,
          1024,  7929, 10590,  1024,  1045,  1005,  2222,  2022,  3403,  1024,
          1052,   102,     0,     0,   

1.6: Building DataLoaders for efficient model training

In [45]:
from torch.utils.data import DataLoader

BATCH_SIZE = 8

train_loader = DataLoader(train_tok, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_tok, batch_size=BATCH_SIZE, shuffle=False)

batch = next(iter(train_loader))
{k: v.shape for k, v in batch.items()}

{'input_ids': torch.Size([8, 512]),
 'token_type_ids': torch.Size([8, 512]),
 'attention_mask': torch.Size([8, 512]),
 'labels': torch.Size([8, 64])}

## Step 2: Model Architecture Implementation

2.1: Creating the Encoder-Decoder Model (BERT -> GPT2)

In [46]:
from transformers import EncoderDecoderModel

encoder_name = "bert-base-uncased"
decoder_name = "gpt2"

model = EncoderDecoderModel.from_encoder_decoder_pretrained(
    encoder_name,
    decoder_name
)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertModel LOAD REPORT from: bert-base-uncased
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
cls.predictions.transform.LayerNorm.weight | UNEXPECTED |  | 
cls.predictions.bias                       | UNEXPECTED |  | 
cls.predictions.transform.dense.weight     | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED |  | 
cls.seq_relationship.weight                | UNEXPECTED |  | 
cls.predictions.transform.dense.bias       | UNEXPECTED |  | 
cls.seq_relationship.bias                  | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/148 [00:00<?, ?it/s]

GPT2LMHeadModel LOAD REPORT from: gpt2
Key                                                 | Status     | 
----------------------------------------------------+------------+-
h.{0...11}.attn.bias                                | UNEXPECTED | 
transformer.h.{0...11}.crossattention.c_proj.bias   | MISSING    | 
transformer.h.{0...11}.crossattention.c_attn.bias   | MISSING    | 
transformer.h.{0...11}.ln_cross_attn.bias           | MISSING    | 
transformer.h.{0...11}.crossattention.c_proj.weight | MISSING    | 
transformer.h.{0...11}.crossattention.c_attn.weight | MISSING    | 
transformer.h.{0...11}.crossattention.q_attn.bias   | MISSING    | 
transformer.h.{0...11}.crossattention.q_attn.weight | MISSING    | 
transformer.h.{0...11}.ln_cross_attn.weight         | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

2.2: Configuring Special Tokens (EOS/PAD/start tokens)

In [None]:
# decoder (GPT2) special token IDs
model.config.eos_token_id = dec_tok.eos_token_id
model.config.pad_token_id = dec_tok.pad_token_id

# Start token for decoding:
model.config.decoder_start_token_id = dec_tok.bos_token_id or dec_tok.eos_token_id

In [33]:
!ls

sample_data
