In [None]:
# ER (Experience Replay) â€” UNSW-NB15 Continual Learning Baseline

Goal: Implement Experience Replay (ER) for a class-incremental IDS scenario on UNSW-NB15.
Training  tasks where attack classes appear progressively, while Normal traffic is always present.


**ðŸ”¹1. Problem Setting**

this is to simulate a class incremental learning scenario for IDS using the UNSW-NB15 dataset. The goal is to study catastrophic forgetting when new attack classes are introduced over time.
example:

Task 1 â†’ Normal + Generic

Task 2 â†’ Normal + Generic + Exploits

Normal traffic is always present


matches real IDS evolution. Goal:
Evaluate whether Experience Replay reduces forgetting when new attack
classes are introduced.**bold text**

In [42]:
# ===== Imports =====
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F

from torch.utils.data import TensorDataset, DataLoader
from sklearn.preprocessing import OneHotEncoder, RobustScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score, f1_score

import random

# ===== Device =====
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)


Using device: cpu


## 2.Dataset Loading
The dataset contains 45 columns, including:
- Flow-based numerical features (e.g., duration, packets, bytes)
- Categorical protocol-related features (proto, service, state)
- Two label columns:
    - `attack_cat` (multi-class attack category)
    - `label` (binary normal vs attack)

For our class-incremental learning setup, i will use `attack_cat`
as the target label.

In [2]:
import pandas as pd


train = pd.read_csv("UNSW_NB15_training-set.csv")
test  = pd.read_csv("UNSW_NB15_testing-set.csv")

print("Train shape:", train.shape)
print("Test shape:", test.shape)

train.head()


Train shape: (175341, 45)
Test shape: (82332, 45)


Unnamed: 0,id,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,...,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,attack_cat,label
0,1,0.121478,tcp,-,FIN,6,4,258,172,74.08749,...,1,1,0,0,0,1,1,0,Normal,0
1,2,0.649902,tcp,-,FIN,14,38,734,42014,78.473372,...,1,2,0,0,0,1,6,0,Normal,0
2,3,1.623129,tcp,-,FIN,8,16,364,13186,14.170161,...,1,3,0,0,0,2,6,0,Normal,0
3,4,1.681642,tcp,ftp,FIN,12,12,628,770,13.677108,...,1,3,1,1,0,2,1,0,Normal,0
4,5,0.449454,tcp,-,FIN,10,6,534,268,33.373826,...,1,40,0,0,0,2,39,0,Normal,0


## 3.Class-Incremental Task Scenario Definition

To simulate a realistic evolving threat environment, defining
a class-incremental learning (CIL) scenario.

- Task 1: The model is trained on Normal + Generic traffic.
- Task 2: A new attack class (Exploits) is introduced,
  while previously seen classes remain present.

This setup mimics real-world IDS deployment, where new attack
types appear over time and the model must adapt without
forgetting previously learned patterns.


In [3]:
# Task scenario (using the exmaple i mentioned )
NORMAL = "Normal"
attack_order = ["Generic", "Exploits", "Fuzzers", "DoS", "Reconnaissance", "Analysis", "Backdoor", "Shellcode", "Worms"]

# just an exmaple: Task1 -> Task2
scenario_1plus1 = [
    [NORMAL, "Generic"],                 # Task 1 classes
    [NORMAL, "Generic", "Exploits"],     # Task 2 classes
]

#The build_task_df function filters the dataset to include only the classes available at a given task.

def build_task_df(df, classes):
    """Filter dataframe to only the given attack_cat classes."""
    return df[df["attack_cat"].isin(classes)].copy()

# Build task datasets
task1_train = build_task_df(train, scenario_1plus1[0])
task1_test  = build_task_df(test,  scenario_1plus1[0])

task2_train = build_task_df(train, scenario_1plus1[1])
task2_test  = build_task_df(test,  scenario_1plus1[1])

print("Task 1 train shape:", task1_train.shape)
print("Task 1 test shape: ", task1_test.shape)
print("Task 2 train shape:", task2_train.shape)
print("Task 2 test shape: ", task2_test.shape)

print("\nTask 1 class counts:\n", task1_train["attack_cat"].value_counts())
print("\nTask 2 class counts:\n", task2_train["attack_cat"].value_counts())


Task 1 train shape: (96000, 45)
Task 1 test shape:  (55871, 45)
Task 2 train shape: (129393, 45)
Task 2 test shape:  (67003, 45)

Task 1 class counts:
 attack_cat
Normal     56000
Generic    40000
Name: count, dtype: int64

Task 2 class counts:
 attack_cat
Normal      56000
Generic     40000
Exploits    33393
Name: count, dtype: int64


**4. Feature Selection**

I  removed non-feature columns (id, attack_cat, label), and separated categorical and numerical features.

Before training the model, separate input features from labels.

removing:
- `id` â†’ Identifier (not informative for learning)
- `attack_cat` â†’ Multi-class label (used as target)
- `label` â†’ Binary label (not used in this multi-class CIL setup)

The remaining columns form the feature space used by the model.

removing the non-informative or target columns. The model should only see feature values, not labels or identifiers. This leaves us with 42 raw features.
Why drop the binary label?
Because we are solving a multi-class incremental problem using attack_cat, not binary classification.

In [4]:
#After removing three columns from the original 45, we obtain 42 feature columns before encoding.
# Columns to drop
drop_cols = ["id", "attack_cat", "label"]

feature_cols = [c for c in train.columns if c not in drop_cols]

print("Number of feature columns:", len(feature_cols))
print(feature_cols[:10])


Number of feature columns: 42
['dur', 'proto', 'service', 'state', 'spkts', 'dpkts', 'sbytes', 'dbytes', 'rate', 'sttl']


In [5]:
print("Categorical columns present:",
      all(col in feature_cols for col in ["proto","service","state"]))


Categorical columns present: True


We use a two-hidden-layer MLP for classification. The input dimension is 186 after preprocessing. The output layer size corresponds to the number of classes at each task. ReLU activations are used in hidden layers and raw logits are passed to CrossEntropyLoss.

In [11]:
import torch.nn as nn
import torch.nn.functional as F

class MLPIDS(nn.Module):   #This is your Intrusion Detection Model. It is a simple Multi-Layer Perceptron (MLP)
    def __init__(self, input_dim: int, num_classes: int):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, 256)
        self.fc2 = nn.Linear(256, 128)
        self.out = nn.Linear(128, num_classes)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.out(x)

model = MLPIDS(input_dim=input_dim, num_classes=num_classes_task1).to(device)
print(model)


MLPIDS(
  (fc1): Linear(in_features=186, out_features=256, bias=True)
  (fc2): Linear(in_features=256, out_features=128, bias=True)
  (out): Linear(in_features=128, out_features=2, bias=True)
)


In [15]:
from sklearn.preprocessing import LabelEncoder
import numpy as np

le1 = LabelEncoder()
le1.fit(task1_train["attack_cat"])

y1_train = le1.transform(task1_train["attack_cat"])
y1_test  = le1.transform(task1_test["attack_cat"])

print("Task1 classes:", list(le1.classes_))
print("Unique y1 labels:", np.unique(y1_train))


Task1 classes: ['Generic', 'Normal']
Unique y1 labels: [0 1]


## 6. Feature Encoding and Scaling
 applying a preprocessing pipeline consisting of:

- OneHotEncoder for categorical protocol-related features:
  - `proto`
  - `service`
  - `state`

- RobustScaler for numerical features.

RobustScaler is chosen because network traffic features are often
heavy tailed and contains extreme values. It is more stable than
StandardScaler in this context.

To preserve class incremental learning realism, the preprocessing
pipeline is fitted only on Task 1 training data.


In [21]:
from sklearn.preprocessing import OneHotEncoder, RobustScaler
from sklearn.compose import ColumnTransformer

categorical_cols = ["proto", "service", "state"]
drop_cols = ["id", "attack_cat", "label"]
feature_cols = [c for c in train.columns if c not in drop_cols]
numeric_cols = [c for c in feature_cols if c not in categorical_cols]

preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
        ("num", RobustScaler(), numeric_cols),
    ]
)

preprocessor.fit(task1_train[feature_cols])


## 7. Transform Task Data into Model Inputs

 applying the preprocessing pipeline to convert each task dataset into
numerical model inputs.

- The transformer output may be sparse (due to one-hot encoding),
  so we convert it into a dense NumPy array for PyTorch.
- We record the final input dimension (`input_dim`) which defines the
  neural network input layer size.


In [22]:
import numpy as np

def to_dense_array(X):
    return X.toarray() if hasattr(X, "toarray") else np.asarray(X)

X1_train_np = to_dense_array(preprocessor.transform(task1_train[feature_cols]))
X1_test_np  = to_dense_array(preprocessor.transform(task1_test[feature_cols]))

input_dim = X1_train_np.shape[1]
print("Input dim:", input_dim)


Input dim: 186


## 8. Label Encoding for Task 1

For Task 1, encode attack categories into integer labels
using LabelEncoder.
Since Task 1 contains only:
- Generic
- Normal
the model output layer will initially have 2 classes.

This mapping ensures that labels are in the range [0, C-1],
which is required for CrossEntropyLoss in PyTorch.
Why not use the same encoder for all tasks?

Because at this stage the model has only seen Task1 classes. Later, when Task2 introduces a new class, we expand the classifier and update the mapping.


In [23]:
from sklearn.preprocessing import LabelEncoder
import numpy as np

le1 = LabelEncoder()
le1.fit(task1_train["attack_cat"])

y1_train = le1.transform(task1_train["attack_cat"])
y1_test  = le1.transform(task1_test["attack_cat"])

print("Task1 classes:", list(le1.classes_))
print("Counts:", np.bincount(y1_train))


Task1 classes: ['Generic', 'Normal']
Counts: [40000 56000]


## 9. PyTorch Dataset and DataLoader (Task 1)
converting the preprocessed NumPy arrays into PyTorch tensors.
- Features are stored as float32 tensors.
- Labels are stored as long tensors (required for CrossEntropyLoss).

then creating a
- A training DataLoader (with shuffling)
- A test DataLoader (without shuffling)

Batch size is set to 256.


In [24]:
import torch
from torch.utils.data import TensorDataset, DataLoader

batch_size = 256

train_ds1 = TensorDataset(
    torch.tensor(X1_train_np, dtype=torch.float32),
    torch.tensor(y1_train, dtype=torch.long)
)
test_ds1 = TensorDataset(
    torch.tensor(X1_test_np, dtype=torch.float32),
    torch.tensor(y1_test, dtype=torch.long)
)

train_loader1 = DataLoader(train_ds1, batch_size=batch_size, shuffle=True)
test_loader1  = DataLoader(test_ds1, batch_size=batch_size, shuffle=False)



## 10. Training Function (Task 1)

defining a generic training function for a single task.

Key components:
- Optimizer: Adam
- Loss function: CrossEntropyLoss
- Class weighting to mitigate class imbalance
- Evaluation using Accuracy and Macro-F1

Class weights are computed from label frequencies to reduce bias
towards dominant classes (e.g., Normal traffic).
Why Macro-F1?

Macro-F1 gives equal importance to all classes and is more appropriate for imbalanced multi-class problems


In [25]:
import torch.nn as nn
import numpy as np

def train_one_task(model, train_loader, test_loader, device, y_train_for_weights, epochs=3, lr=1e-4):
    model.train()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)

    # Class weights to reduce imbalance bias
    counts = np.bincount(y_train_for_weights)
    weights = (counts.sum() / (len(counts) * counts)).astype(np.float32)
    weights = torch.tensor(weights, dtype=torch.float32).to(device)

    loss_fn = nn.CrossEntropyLoss(weight=weights)

    for epoch in range(1, epochs + 1):
        model.train()
        total_loss = 0.0

        for xb, yb in train_loader:
            xb, yb = xb.to(device), yb.to(device)
            optimizer.zero_grad()
            logits = model(xb)
            loss = loss_fn(logits, yb)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        avg_loss = total_loss / len(train_loader)
        acc, f1 = evaluate(model, test_loader, device)
        print(f"Epoch {epoch}/{epochs} | Loss: {avg_loss:.4f} | Test Acc: {acc:.4f} | Test Macro-F1: {f1:.4f}")

    return model


## 11. Task 1 Training Results

 initialize the neural network with 2 output classes
(Generic and Normal) and train on Task 1.

After training, we evaluate performance on Task 1 test data
using Accuracy and Macro-F1.


In [26]:
model = MLPIDS(input_dim=input_dim, num_classes=len(le1.classes_)).to(device)

model = train_one_task(
    model=model,
    train_loader=train_loader1,
    test_loader=test_loader1,
    device=device,
    y_train_for_weights=y1_train,
    epochs=3,
    lr=1e-4
)


Epoch 1/3 | Loss: 1.8928 | Test Acc: 0.9508 | Test Macro-F1: 0.9460
Epoch 2/3 | Loss: 2.2059 | Test Acc: 0.9759 | Test Macro-F1: 0.9731
Epoch 3/3 | Loss: 1.9823 | Test Acc: 0.9841 | Test Macro-F1: 0.9821


Loss

CrossEntropyLoss (with class weights)

Measures prediction error

Lower is better

Test Accuracy
, Percentage of correctly classified samples

reached ~98.4%- Model distinguishes Generic vs Normal very well.

Macro-F1 = average of F1 for each class equally.

Since Task1 has: Generic (40k), Normal (56k)

Macro-F1 avoids bias toward the larger class.

reached:

â†’ 0.9821

That means: Both classes are classified well,Model is not biased heavily toward one class

 Task 2 WITHOUT ER (to observe forgetting )

## 12. Task 2 Data Transformation

Task 2 introduces a new attack class (Exploits) while keeping previously
seen classes (Normal, Generic).

We transform Task 2 using the SAME preprocessing pipeline fitted on Task 1.
This avoids leaking information from future tasks into preprocessing and
matches the continual learning setting.

The output feature dimension remains 186 (same input space for all tasks).


In [27]:
# Transform Task2 (same preprocessor fitted on Task1)
X2_train_np = to_dense_array(preprocessor.transform(task2_train[feature_cols]))
X2_test_np  = to_dense_array(preprocessor.transform(task2_test[feature_cols]))

print("Task2 X train shape:", X2_train_np.shape)

#Task2 has more samples because it includes an additional class (Exploits).


Task2 X train shape: (129393, 186)


## 13. Label Encoding for Task 2 (Expanded Class Set)

Task 2 introduces a new class: Exploits.

We create a new LabelEncoder that maps:
- Exploits
- Generic
- Normal

to integer labels in the range [0, 2].

This reflects the expanded classification space for Task 2.


In [28]:
from sklearn.preprocessing import LabelEncoder
import numpy as np

#
le2 = LabelEncoder()
le2.fit(task2_train["attack_cat"])

y2_train = le2.transform(task2_train["attack_cat"])
y2_test  = le2.transform(task2_test["attack_cat"])

print("Task2 classes:", list(le2.classes_))
print("Unique y2 labels:", np.unique(y2_train))


Task2 classes: ['Exploits', 'Generic', 'Normal']
Unique y2 labels: [0 1 2]


## 14. Expanding the Classifier for Incremental Learning

When Task 2 introduces a new class, the model's output layer must expand
from 2 classes to 3 classes.

i defined a function that:

1. Creates a new output layer with the updated number of classes.
2. Copies the learned weights and biases from the old classes.
3. Initializes the new class weights randomly.

This allows the model to retain knowledge of previous classes
while enabling learning of the new class.


In [29]:
import torch.nn as nn
import torch

def expand_output_layer(model, new_num_classes):
    old_out = model.out
    old_num_classes = old_out.out_features

    if new_num_classes <= old_num_classes:
        return model  # nothing to do

    # New layer
    new_out = nn.Linear(old_out.in_features, new_num_classes)

    # Copy old weights/bias into first part
    with torch.no_grad():
        new_out.weight[:old_num_classes] = old_out.weight
        new_out.bias[:old_num_classes] = old_out.bias

    model.out = new_out.to(next(model.parameters()).device)
    return model


Before Task2:

Output layer size = 2

Classes = [Generic, Normal]

After Task2:

Output layer size = 3

Classes = [Exploits, Generic, Normal]

In [30]:
model = expand_output_layer(model, new_num_classes=len(le2.classes_))
print("New output classes:", model.out.out_features)


New output classes: 3


## 15. PyTorch Dataset and DataLoader (Task 2)

We convert Task 2 features and labels into PyTorch tensors and create
DataLoaders for training and evaluation.

This enables batch-based training on Task 2 while maintaining the
same preprocessing and input feature dimension as Task 1.


In [31]:
from torch.utils.data import TensorDataset, DataLoader
import torch

train_ds2 = TensorDataset(
    torch.tensor(X2_train_np, dtype=torch.float32),
    torch.tensor(y2_train, dtype=torch.long)
)
test_ds2 = TensorDataset(
    torch.tensor(X2_test_np, dtype=torch.float32),
    torch.tensor(y2_test, dtype=torch.long)
)

train_loader2 = DataLoader(train_ds2, batch_size=256, shuffle=True)
test_loader2  = DataLoader(test_ds2, batch_size=256, shuffle=False)


## 16. Baseline Performance Before Learning Task 2

Before training on Task 2, we evaluate the current model on Task 1 test data.

This provides the reference performance used to compute forgetting after
Task 2 training.


In [32]:
acc_t1_before, f1_t1_before = evaluate(model, test_loader1, device)
print("Before Task2 training | Task1 Test Acc:", acc_t1_before, "| Macro-F1:", f1_t1_before)


Before Task2 training | Task1 Test Acc: 0.9820121350969196 | Macro-F1: 0.6542721802050965


We first build Task2 DataLoaders. Then, before training on Task2, we evaluate the model on Task1 to record baseline performance. After Task2 training, we evaluate again on Task1 â€” the difference is the forgetting metric.

Accuracy remains high, but Macro-F1 shifts because the label mapping is updated when the class set expands. In the final integrated codebase we will use a consistent global label mapping to keep metric comparisons strictly aligned across tasks.

## 17. Task 2 Training WITHOUT Experience Replay (Baseline)

now to train the expanded model on Task 2 data
without using any replay mechanism.

This allows us to observe the extent of catastrophic forgetting
on Task 1.


In [33]:
model = train_one_task(
    model=model,
    train_loader=train_loader2,
    test_loader=test_loader2,
    device=device,
    y_train_for_weights=y2_train,
    epochs=3,
    lr=1e-4
)


Epoch 1/3 | Loss: 5.1485 | Test Acc: 0.8005 | Test Macro-F1: 0.7853
Epoch 2/3 | Loss: 3.9804 | Test Acc: 0.8145 | Test Macro-F1: 0.8042
Epoch 3/3 | Loss: 3.1529 | Test Acc: 0.8146 | Test Macro-F1: 0.8061


## 18. Measuring Catastrophic Forgetting

After training on Task 2, we evaluate the model again
on Task 1 test data.

Forgetting is computed as:

Forgetting = (Task1 Macro-F1 before Task2) âˆ’ (Task1 Macro-F1 after Task2)

A large drop indicates severe catastrophic forgetting.


In [34]:
acc_t1_after, f1_t1_after = evaluate(model, test_loader1, device)
print("After Task2 training | Task1 Test Acc:", acc_t1_after, "| Macro-F1:", f1_t1_after)

print("\nForgetting (Macro-F1 drop):", f1_t1_before - f1_t1_after)


After Task2 training | Task1 Test Acc: 0.03071360813302071 | Macro-F1: 0.026749372848815862

Forgetting (Macro-F1 drop): 0.6275228073562806


After training on Task 2 without replay, Task 1 Macro-F1 drops dramatically from ~0.65 to near zero. The forgetting metric is approximately 0.63, confirming severe catastrophic forgetting in the standard fine-tuning setup.

## 19. Experience Replay (ER) â€“ Replay Buffer Implementation

To mitigate catastrophic forgetting, we implement a replay buffer.

The buffer:
- Stores samples from previously learned tasks.
- Has a fixed memory capacity.
- Uses random replacement when full (reservoir-style behavior).
- Allows sampling of past examples during future task training.

This mechanism enables the model to rehearse previous knowledge
while learning new classes.


In [35]:
import random

class ReplayBuffer:
    def __init__(self, capacity):
        self.capacity = capacity
        self.x = []
        self.y = []
#Adds samples from Task1 into memory.
#If buffer is full â†’ randomly replaces existing samples.
#Keeps memory bounded.

def add_dataset(self, X, y):
        for xi, yi in zip(X, y):
            if len(self.y) < self.capacity:
                self.x.append(xi)
                self.y.append(yi)
            else:
                idx = random.randint(0, self.capacity - 1)
                self.x[idx] = xi
                self.y[idx] = yi
                #Randomly samples old examples, Returns them as tensors, These are mixed with new task data.
                #Limits memory size (important in real systems).

 def sample(self, batch_size):
        if len(self.y) == 0:
            return None, None
        idx = random.sample(range(len(self.y)), min(batch_size, len(self.y)))
        x = torch.tensor([self.x[i] for i in idx], dtype=torch.float32)
        y = torch.tensor([self.y[i] for i in idx], dtype=torch.long)
        return x, y


## 20. Initializing Replay Memory

We initialize a replay buffer with a fixed capacity (20,000 samples).

The buffer is populated with Task 1 training data before learning Task 2.
This ensures that past knowledge is available during incremental training.


In [39]:
memory_size = 20000 # you can tune this later
#It is a tunable hyperparameter balancing memory efficiency and performance.
buffer = ReplayBuffer(capacity=memory_size)

buffer.add_dataset(X1_train_np, y1_train)

print("Replay memory size:", len(buffer.y))


Replay memory size: 20000


## 21. Task 2 Training WITH Experience Replay

We define a modified training loop for Task 2 that:

1. Samples a batch from Task 2.
2. Samples a replay batch from the memory buffer.
3. Concatenates both batches.
4. Performs a single optimization step on the combined data.

This allows the model to rehearse previous task samples while
learning the new class.


In [40]:
def train_task2_with_er(model, train_loader, test_loader, device, buffer, y_train_for_weights, epochs=3, lr=1e-4, replay_batch_size=256):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)

    counts = np.bincount(y_train_for_weights)
    weights = (counts.sum() / (len(counts) * counts)).astype(np.float32)
    weights = torch.tensor(weights, dtype=torch.float32).to(device)

    loss_fn = torch.nn.CrossEntropyLoss(weight=weights)

    for epoch in range(1, epochs + 1):
        model.train()
        total_loss = 0.0

        for xb, yb in train_loader:
            xb, yb = xb.to(device), yb.to(device)

            # Sample replay
            xr, yr = buffer.sample(replay_batch_size)
            if xr is not None:
                xr, yr = xr.to(device), yr.to(device)
                xb = torch.cat([xb, xr], dim=0)
                yb = torch.cat([yb, yr], dim=0)

            optimizer.zero_grad()
            logits = model(xb)
            loss = loss_fn(logits, yb)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        avg_loss = total_loss / len(train_loader)
        acc, f1 = evaluate(model, test_loader, device)
        print(f"[ER] Epoch {epoch}/{epochs} | Loss: {avg_loss:.4f} | Task2 Test F1: {f1:.4f}")

    return model


## 22. Full ER Experiment: Re-train + Expand + Replay

To ensure fair comparison:

1.  re-train Task 1 from scratch.
2. expand the classifier to include the new class.
3. record Task 1 performance before Task 2 learning.
4. train Task 2 using Experience Replay.
5. measure Task 1 performance again to compute forgetting.

This allows direct comparison between:
- Standard fine-tuning (without ER)
- Experience Replay


In [41]:
# Re-train Task1 cleanly again
model = MLPIDS(input_dim=input_dim, num_classes=len(le1.classes_)).to(device)
model = train_one_task(model, train_loader1, test_loader1, device, y1_train, epochs=3, lr=1e-4)

# Expand
model = expand_output_layer(model, new_num_classes=len(le2.classes_))

# Measure before
acc_t1_before, f1_t1_before = evaluate(model, test_loader1, device)
print("Before ER Task2 | Task1 F1:", f1_t1_before)

# Train Task2 WITH ER
model = train_task2_with_er(
    model=model,
    train_loader=train_loader2,
    test_loader=test_loader2,
    device=device,
    buffer=buffer,
    y_train_for_weights=y2_train,
    epochs=3,
    lr=1e-4
)

# Measure forgetting
acc_t1_after, f1_t1_after = evaluate(model, test_loader1, device)
print("After ER Task2 | Task1 F1:", f1_t1_after)
print("Forgetting with ER:", f1_t1_before - f1_t1_after)


Epoch 1/3 | Loss: 1.7547 | Test Acc: 0.9315 | Test Macro-F1: 0.9258
Epoch 2/3 | Loss: 2.0364 | Test Acc: 0.9538 | Test Macro-F1: 0.9493
Epoch 3/3 | Loss: 3.7698 | Test Acc: 0.9834 | Test Macro-F1: 0.9813
Before ER Task2 | Task1 F1: 0.647409589758091
[ER] Epoch 1/3 | Loss: 2.1896 | Task2 Test F1: 0.3545
[ER] Epoch 2/3 | Loss: 2.3881 | Task2 Test F1: 0.4036
[ER] Epoch 3/3 | Loss: 2.0842 | Task2 Test F1: 0.3785
After ER Task2 | Task1 F1: 0.30631000156039284
Forgetting with ER: 0.3410995881976982


that is almost 50% reduction in forgetting

Without replay, catastrophic forgetting is severe (F1 drop â‰ˆ 0.63). With Experience Replay, forgetting is reduced to â‰ˆ 0.34. Although performance is not fully preserved, ER significantly mitigates forgetting by rehearsing past samples during incremental training.