[KoBERT 감정분류](https://bbarry-lee.github.io/ai-tech/KoBERT%EB%A5%BC-%ED%99%9C%EC%9A%A9%ED%95%9C-%EA%B0%90%EC%A0%95%EB%B6%84%EB%A5%98-%EB%AA%A8%EB%8D%B8-%EA%B5%AC%ED%98%84.html)



# 환경 설정

In [None]:
!pip install transformers

In [None]:
import os
data_path = "drive/MyDrive/2025/KW/Data/감정 분류를 위한 대화 음성 데이터셋/"
os.listdir(data_path)

['5차년도.csv', '5차년도_2차.csv', '4차년도.csv']

In [None]:
# Seed 고정
import random
import numpy as np
import torch
import os

def set_seed(seed: int = 42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)

    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)  # multi-gpu

    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

    os.environ["PYTHONHASHSEED"] = str(seed)

set_seed(42)

# 데이터

## 로드

In [None]:
import pandas as pd
df = pd.read_csv(data_path + "5차년도_2차.csv", encoding="cp949")
df

Unnamed: 0,wav_id,발화문,상황,1번 감정,1번 감정세기,2번 감정,2번 감정세기,3번 감정,3번 감정세기,4번 감정,4번감정세기,5번 감정,5번 감정세기,나이,성별
0,5f4141e29dd513131eacee2f,헐! 나 이벤트에 당첨 됐어.,happiness,angry,2,surprise,2,happiness,2,happiness,2,happiness,2,48,female
1,5f4141f59dd513131eacee30,내가 좋아하는 인플루언서가 이벤트를 하더라고. 그래서 그냥 신청 한번 해봤지.,happiness,neutral,0,happiness,2,happiness,2,happiness,2,happiness,2,48,female
2,5f4142119dd513131eacee31,"한 명 뽑는 거였는데, 그게 바로 내가 된 거야.",happiness,angry,2,happiness,2,happiness,2,happiness,2,happiness,2,48,female
3,5f4142279dd513131eacee32,"당연히 마음에 드는 선물이니깐, 이벤트에 내가 신청 한번 해본 거지. 비싼 거야. ...",happiness,angry,2,happiness,2,happiness,2,happiness,2,happiness,1,48,female
4,5f3c9ed98a3c1005aa97c4bd,에피타이저 정말 좋아해. 그 것도 괜찮은 생각인 것 같애.,neutral,happiness,2,happiness,1,happiness,2,happiness,1,happiness,1,48,female
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19369,5fbe313c44697678c497c05a,나 엘리베이터에 갇혔어.,fear,happiness,1,sadness,1,sadness,2,sadness,1,sadness,1,23,female
19370,5fbe251044697678c497bfb8,하지만 기분이 나쁜 걸 어떡해?,angry,sadness,1,fear,1,sadness,2,sadness,1,neutral,0,23,female
19371,5fbe31584c55eb78bd7cee7f,자취방 엘리베이턴데 정전인가봐.,fear,sadness,1,neutral,0,sadness,2,fear,1,sadness,1,23,female
19372,5fbe2f8544697678c497c047,나 드디어 프로젝트 끝났어!,happiness,disgust,1,sadness,1,neutral,0,happiness,1,sadness,1,23,female


## 전처리

X: 입력 문장

y: 입력 문장이 표현하는 감정

In [None]:
df[['발화문', '상황']]

Unnamed: 0,발화문,상황
0,헐! 나 이벤트에 당첨 됐어.,happiness
1,내가 좋아하는 인플루언서가 이벤트를 하더라고. 그래서 그냥 신청 한번 해봤지.,happiness
2,"한 명 뽑는 거였는데, 그게 바로 내가 된 거야.",happiness
3,"당연히 마음에 드는 선물이니깐, 이벤트에 내가 신청 한번 해본 거지. 비싼 거야. ...",happiness
4,에피타이저 정말 좋아해. 그 것도 괜찮은 생각인 것 같애.,neutral
...,...,...
19369,나 엘리베이터에 갇혔어.,fear
19370,하지만 기분이 나쁜 걸 어떡해?,angry
19371,자취방 엘리베이턴데 정전인가봐.,fear
19372,나 드디어 프로젝트 끝났어!,happiness


레이블 인코딩

In [None]:
df['상황'].value_counts()

Unnamed: 0_level_0,count
상황,Unnamed: 1_level_1
happiness,4548
angry,3263
neutral,3253
sadness,2848
disgust,2321
surprise,1755
fear,1386


In [None]:
# Label Encoding
label_map = {
    "fear"     : 0,
    "surprise" : 1,
    "angry"    : 2,
    "sadness"  : 3,
    "neutral"  : 4,
    "happiness": 5,
    "disgust"  : 6
}

df["y"] = df["상황"].map(label_map)
df['y'].unique()

array([5, 4, 3, 2, 1, 6, 0])

In [None]:
num_classes = len(df['y'].unique())
num_classes

7

In [None]:
x_col = '발화문'
y_col = 'y'
input_data = df[[x_col] + [y_col]]
input_data

Unnamed: 0,발화문,y
0,헐! 나 이벤트에 당첨 됐어.,5
1,내가 좋아하는 인플루언서가 이벤트를 하더라고. 그래서 그냥 신청 한번 해봤지.,5
2,"한 명 뽑는 거였는데, 그게 바로 내가 된 거야.",5
3,"당연히 마음에 드는 선물이니깐, 이벤트에 내가 신청 한번 해본 거지. 비싼 거야. ...",5
4,에피타이저 정말 좋아해. 그 것도 괜찮은 생각인 것 같애.,4
...,...,...
19369,나 엘리베이터에 갇혔어.,0
19370,하지만 기분이 나쁜 걸 어떡해?,2
19371,자취방 엘리베이턴데 정전인가봐.,0
19372,나 드디어 프로젝트 끝났어!,5


Train/Valid/Test Split

In [None]:
from sklearn.model_selection import train_test_split
trval_X, test_X, trval_y, test_y = train_test_split(
    input_data[x_col].tolist(), input_data[y_col].tolist(),
    test_size=0.05, stratify=input_data[y_col], random_state=42)

In [None]:
from sklearn.model_selection import train_test_split
train_X, valid_X, train_y, valid_y = train_test_split(
    trval_X, trval_y, test_size=0.05,
    stratify=trval_y, random_state=42)

In [None]:
print(f"            x      y")
print(f"train size: {len(train_X):<5}  {len(train_y):<5}")
print(f"valid size: {len(valid_X):<5}  {len(valid_y):<5}")
print(f"test size : {len(test_X):<5}  {len(test_y):<5}")

            x      y
train size: 17484  17484
valid size: 921    921  
test size : 969    969  


In [None]:
train_X[:5]

['그렇지. 경찰분들 진짜 고생 많으신 것 같애.',
 '무서워서 잠도 못 잤어.',
 '에이 그 정도는 아니야.',
 '아, 진짜? 기대된다. 뭐 찾아올지.',
 '날은 아주 좋았어.']

In [None]:
train_y[:5]

[4, 1, 5, 3, 4]

# 모델

In [None]:
model_path = "drive/MyDrive/2025/KW/Model/"
model_id = "monologg/kobert"

## Input

In [None]:
# from transformers import BertTokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=model_path, trust_remote_code=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
!ls -ahl {model_path}

total 8.0K
drwx------ 2 root root 4.0K May  7 01:13 .locks
drwx------ 6 root root 4.0K May  7 01:13 models--monologg--kobert


In [None]:
text = "나는 학생입니다."

encoded = tokenizer(
    text,
    return_tensors='pt',
    padding='max_length',
    truncation=True,
    max_length=20
)

print("Input IDs:", encoded['input_ids'])
print("Token Type IDs:", encoded['token_type_ids'])
print("Attention Mask:", encoded['attention_mask'])
# Input IDs: tensor([[2, 1375, 4952, 7139, 54, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 , 1]])
# Token Type IDs: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
# Attention Mask: tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

Input IDs: tensor([[   2, 1375, 4952, 7139,   54,    3,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1]])
Token Type IDs: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Attention Mask: tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])


In [None]:
decoded_text = tokenizer.decode(
    encoded['input_ids'][0],
    skip_special_tokens=True
)

# Optional: view the individual tokens
tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'][0])

print("Encoded Input IDs:", encoded['input_ids'][0])
print("Tokens:", tokens)
print("Decoded Text:", decoded_text)
# Encoded Input IDs: tensor([2, 1375, 4952, 7139, 54, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
# Tokens: ['[CLS]', '▁나는', '▁학생', '입니다', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
# Decoded Text: 나는 학생입니다.

Encoded Input IDs: tensor([   2, 1375, 4952, 7139,   54,    3,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1])
Tokens: ['[CLS]', '▁나는', '▁학생', '입니다', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
Decoded Text: 나는 학생입니다.


In [None]:
from torch.utils.data import Dataset, DataLoader

class KoBERTDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=64):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        inputs = self.tokenizer(
            text,
            padding="max_length",
            truncation=True,
            return_tensors="pt",
            max_length=self.max_len
        )
        return {
            'input_ids': inputs['input_ids'].squeeze(),
            # 'token_type_ids': inputs['token_type_ids'].squeeze(),
            'attention_mask': inputs['attention_mask'].squeeze(),
            'label': torch.tensor(label, dtype=torch.long)
        }

In [None]:
train_dataset = KoBERTDataset(train_X, train_y, tokenizer)
valid_dataset = KoBERTDataset(valid_X, valid_y, tokenizer)
test_dataset = KoBERTDataset(test_X, test_y, tokenizer)

In [None]:
from torch.utils.data import Dataset, DataLoader
batch_size = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

## Model

In [None]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
from transformers import BertModel
bert = BertModel.from_pretrained(model_id, cache_dir=model_path, trust_remote_code=True)

In [None]:
import torch.nn as nn

class KoBERTClassifier(nn.Module):
    def __init__(self, bert, num_classes, hidden_size=768, dropout=0.2):
        super(KoBERTClassifier, self).__init__()
        self.bert = bert
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(hidden_size, num_classes)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled = outputs.pooler_output
        dropped = self.dropout(pooled)
        return self.classifier(dropped)

In [None]:
model = KoBERTClassifier(bert, num_classes=num_classes).to(device)

## Train


In [None]:
from torch.optim import AdamW
from tqdm.notebook import tqdm
from sklearn.metrics import accuracy_score

# Optimizer and loss
optimizer = AdamW(model.parameters(), lr=2e-5)
criterion = nn.CrossEntropyLoss()

epochs = 3

for epoch in range(epochs):
    # Training
    model.train()
    train_loss = 0
    train_preds = []
    train_labels = []

    for batch in tqdm(train_loader, desc=f"Epoch {epoch+1} - Training"):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        train_loss += loss.item()
        preds = torch.argmax(outputs, dim=1)
        train_preds.extend(preds.cpu().numpy())
        train_labels.extend(labels.cpu().numpy())

    train_loss = train_loss / len(train_loader)
    train_acc = accuracy_score(train_labels, train_preds)

    # Validation
    model.eval()
    valid_loss = 0
    valid_preds = []
    valid_labels = []

    with torch.no_grad():
        for batch in valid_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)

            outputs = model(input_ids, attention_mask)
            loss = criterion(outputs, labels)

            valid_loss += loss.item()
            preds = torch.argmax(outputs, dim=1)
            valid_preds.extend(preds.cpu().numpy())
            valid_labels.extend(labels.cpu().numpy())

    valid_loss = valid_loss / len(valid_loader)
    valid_acc = accuracy_score(valid_labels, valid_preds)

    print(f"Epoch {epoch+1} - loss: {train_loss:.4f}  acc: {train_acc:.4f} | "
          f"val loss: {valid_loss:.4f}  val acc: {valid_acc:.4f}")

Epoch 1 - Training:   0%|          | 0/274 [00:00<?, ?it/s]

Epoch 1 - loss: 0.8497  acc: 0.7469 | val loss: 0.3369  val acc: 0.8903


Epoch 2 - Training:   0%|          | 0/274 [00:00<?, ?it/s]

Epoch 2 - loss: 0.2766  acc: 0.9150 | val loss: 0.2818  val acc: 0.9110


Epoch 3 - Training:   0%|          | 0/274 [00:00<?, ?it/s]

Epoch 3 - loss: 0.1858  acc: 0.9413 | val loss: 0.2729  val acc: 0.9131


In [None]:
# Epoch 1 - loss: 0.8497  acc: 0.7469 | val loss: 0.3369  val acc: 0.8903
# Epoch 2 - loss: 0.2766  acc: 0.9150 | val loss: 0.2818  val acc: 0.9110
# Epoch 3 - loss: 0.1858  acc: 0.9413 | val loss: 0.2729  val acc: 0.9131

## Evaluation

In [None]:
# Evaluation
model.eval()
all_preds = []
all_labels = []

with torch.no_grad():
    for batch in test_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

        outputs = model(input_ids, attention_mask)
        preds = torch.argmax(outputs, dim=1)

        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

acc = accuracy_score(all_labels, all_preds)
print(f"Test Accuracy: {acc:.4f}")

Test Accuracy: 0.9309


In [None]:
# Test Accuracy: 0.9309