### MIND Dataset + BERT 실험 노트북

- 본 노트북은 Microsoft MIND 데이터셋의 `news.tsv` 파일을 로컬에서 로드하여 뉴스 제목 분류 모델을 학습합니다.
- 사전 학습된 BERT(`bert-base-uncased`)를 사용하여 뉴스 제목을 분류합니다.
- 데이터 로딩, 라벨 매핑 및 전처리는 `mind_bert/README.md`에 자세히 기록되어 있습니다.


In [1]:
# ┌── Cell 1: 필요한 라이브러리 설치
%pip install --upgrade transformers datasets scikit-learn pandas

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
# ┌── Cell 2: 임포트·시드·디바이스 확인
import os, random
import numpy as np
import pandas as pd
import torch

from datasets import Dataset, DatasetDict
from transformers import (
    BertTokenizerFast,
    BertForSequenceClassification,
    Trainer,
    TrainingArguments,
)
from sklearn.metrics import accuracy_score

# 시드 고정
def set_seed(seed: int = 42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

set_seed()

# 디바이스
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"▶ PyTorch: {torch.__version__}, GPU 사용 가능: {device}")

  from .autonotebook import tqdm as notebook_tqdm


▶ PyTorch: 2.7.1+cu126, GPU 사용 가능: cuda


In [3]:
# ┌── Cell 3: MIND news.tsv 로드 + rich-context 전처리
def load_news(path):
    df = pd.read_csv(
        path,
        sep="\t",
        header=0,
        names=["newsID","category","subcategory","title","abstract"],
        quoting=3,            # pandas.QUOTE_NONE
        encoding="utf-8",
        usecols=[0,1,2,3,4],  # newsID,category,subcategory,title,abstract
    )
    df["text"] = df["title"].fillna("") + " " + df["abstract"].fillna("")
    return df[["text","category"]]

train_df = load_news("/home/elicer/MINDlarge_train/news.tsv")
val_df   = load_news("/home/elicer/MINDlarge_dev/news.tsv")
test_df  = load_news("/home/elicer/MINDlarge_test/news.tsv")

print(f"▶ train/dev/test 샘플 수: {len(train_df)}/{len(val_df)}/{len(test_df)}")

▶ train/dev/test 샘플 수: 101526/72022/120960


In [4]:
# ┌── Cell 4: 레이블 매핑 & HF DatasetDict 변환
# 결측치 제거
train_df = train_df.dropna(subset=["category"])
val_df   = val_df.dropna(subset=["category"])
test_df  = test_df.dropna(subset=["category"])

# 카테고리 → 정수 ID
labels   = sorted(train_df["category"].unique())
label2id = {c:i for i,c in enumerate(labels)}
id2label = {i:c for c,i in label2id.items()}
num_labels = len(labels)
print("▶ 클래스 수:", num_labels, "레이블 예시:", labels[:5])

# DataFrame에 int64 타입으로 레이블 추가
for df in (train_df, val_df, test_df):
    df["label"] = df["category"].map(label2id).astype("int64")

# HF DatasetDict 생성
dataset_dict = DatasetDict({
    "train":      Dataset.from_pandas(train_df[["text","label"]],      preserve_index=False),
    "validation": Dataset.from_pandas(val_df[["text","label"]],        preserve_index=False),
    "test":       Dataset.from_pandas(test_df[["text","label"]],       preserve_index=False),
})

▶ 클래스 수: 18 레이블 예시: ['autos', 'entertainment', 'finance', 'foodanddrink', 'games']


In [9]:
# ┌── Cell 4.1: 불균형 클래스 필터링 (최소 샘플 수 기준)
min_count = 100

# train에서 클래스별 샘플 수 계산
counts = train_df["category"].value_counts()
# 유지할 클래스 목록
keep_categories = counts[counts >= min_count].index.tolist()

# train/val/test에서 소수 클래스 제거
train_df = train_df[train_df["category"].isin(keep_categories)]
val_df   = val_df[val_df["category"].isin(keep_categories)]
test_df  = test_df[test_df["category"].isin(keep_categories)]

print(f"▶ 필터링 후 클래스 수: {len(keep_categories)}")
print("▶ 유지된 클래스:", keep_categories)
print(f"▶ train/dev/test 샘플 수: {len(train_df)}/{len(val_df)}/{len(test_df)}")

▶ 필터링 후 클래스 수: 15
▶ 유지된 클래스: ['sports', 'news', 'finance', 'travel', 'video', 'lifestyle', 'foodanddrink', 'weather', 'autos', 'health', 'tv', 'music', 'entertainment', 'movies', 'kids']
▶ train/dev/test 샘플 수: 101522/72019/120956


In [10]:
# ┌── Cell 4.2: 필터 후 클래스별 샘플 분포 확인
from collections import Counter

new_counts = Counter(train_df["category"])
print("▶ 필터 후 클래스별 샘플 수:")
for cat, cnt in new_counts.most_common():
    print(f"  {cat:15s}: {cnt}")
print("▶ 최대/최소 샘플 수 비율: "
      f"{max(new_counts.values())}:{min(new_counts.values())} "
      f"≈ {max(new_counts.values())/min(new_counts.values()):.1f}:1")

▶ 필터 후 클래스별 샘플 수:
  sports         : 32020
  news           : 30478
  finance        : 5916
  travel         : 4955
  lifestyle      : 4569
  video          : 4569
  foodanddrink   : 4418
  weather        : 4255
  autos          : 3071
  health         : 2929
  tv             : 1323
  music          : 1263
  entertainment  : 837
  movies         : 815
  kids           : 104
▶ 최대/최소 샘플 수 비율: 32020:104 ≈ 307.9:1


In [5]:
# ┌── Cell 5: 토크나이저 로드
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
MAX_LEN = 256

In [6]:
# ┌── Cell 6: 토크나이징 (labels 포함)
def preprocess_fn(batch):
    out = tokenizer(
        batch["text"],
        truncation=True,
        padding="max_length",
        max_length=MAX_LEN,
    )
    out["labels"] = batch["label"]
    return out

tokenized = dataset_dict.map(
    preprocess_fn,
    batched=True,
    remove_columns=["text","label"],
)

tokenized.set_format("torch", columns=["input_ids","attention_mask","labels"])
train_ds = tokenized["train"]
eval_ds  = tokenized["validation"]
test_ds  = tokenized["test"]

# 확인
print("▶ train_ds 컬럼:", train_ds.column_names)
print("▶ sample labels:", train_ds[0]["labels"])

Map:   0%|          | 0/101526 [00:00<?, ? examples/s]

Map: 100%|██████████| 101526/101526 [00:21<00:00, 4791.67 examples/s]
Map: 100%|██████████| 72022/72022 [00:15<00:00, 4782.57 examples/s]
Map: 100%|██████████| 120960/120960 [00:24<00:00, 4840.14 examples/s]

▶ train_ds 컬럼: ['input_ids', 'token_type_ids', 'attention_mask', 'labels']
▶ sample labels: tensor(11)





In [7]:
# ┌── Cell 7: 모델 로드 · Trainer 설정 (sklearn.metrics 사용)
from transformers import BertForSequenceClassification, Trainer, TrainingArguments
from sklearn.metrics import accuracy_score
import numpy as np

# 1) 사전학습된 BERT 모델 로드 (레이블 수·매핑 정보 전달)
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id,
)
model.to(device)

# 2) compute_metrics 정의 (sklearn의 accuracy_score 사용)
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {"accuracy": accuracy_score(labels, preds)}

# 3) TrainingArguments 설정
training_args = TrainingArguments(
    output_dir="./baseline_bert_mind",
    do_train=True,
    do_eval=True,
    eval_steps=500,                  # 500 스텝마다 평가
    save_steps=500,                  # 500 스텝마다 체크포인트 저장
    logging_steps=50,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    fp16=True,
)

# 4) Trainer 인스턴스화
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    compute_metrics=compute_metrics,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
# ┌── Cell 8: 학습 & 평가
trainer.train()

Step,Training Loss
50,2.1855
100,1.4254
150,1.2944
200,1.216
250,1.1941
300,1.0725
350,0.9686
400,1.0067
450,0.9063
500,0.92


TrainOutput(global_step=19038, training_loss=0.51278889946908, metrics={'train_runtime': 2262.5378, 'train_samples_per_second': 134.618, 'train_steps_per_second': 8.414, 'total_flos': 4.007467571228467e+16, 'train_loss': 0.51278889946908, 'epoch': 3.0})

In [15]:
print("\n▶ Validation Accuracy:", trainer.evaluate(eval_ds))
print("▶ Test Accuracy:      ", trainer.evaluate(test_ds))


▶ Validation Accuracy: {'eval_loss': 0.22817420959472656, 'eval_accuracy': 0.928952264585821, 'eval_runtime': 119.5214, 'eval_samples_per_second': 602.587, 'eval_steps_per_second': 18.833, 'epoch': 3.0}
▶ Test Accuracy:       {'eval_loss': 0.3253803551197052, 'eval_accuracy': 0.8999917328042328, 'eval_runtime': 203.8844, 'eval_samples_per_second': 593.277, 'eval_steps_per_second': 18.54, 'epoch': 3.0}
