<a href="https://colab.research.google.com/github/deerking0923/Sentiment-Classifier/blob/main/sentiment-analysis/notebooks/finetune_kobert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install transformers datasets evaluate torch streamlit

Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting streamlit
  Downloading streamlit-1.44.0-py3-none-any.whl.metadata (8.9 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading

In [6]:
# Kaggle API 설정
import os
os.environ["KAGGLE_CONFIG_DIR"] = "/content"

# 데이터 다운로드
!kaggle datasets download -d soohyun/naver-movie-review-dataset -p /content/data
!unzip -o /content/data/naver-movie-review-dataset.zip -d /content/data


Dataset URL: https://www.kaggle.com/datasets/soohyun/naver-movie-review-dataset
License(s): unknown
Archive:  /content/data/naver-movie-review-dataset.zip
  inflating: /content/data/ratings_test.txt  
  inflating: /content/data/ratings_train.txt  


In [7]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

# 데이터 로드
dataset = load_dataset(
    "csv",
    data_files={"train": "/content/data/ratings_train.txt", "test": "/content/data/ratings_test.txt"},
    sep="\t"
)

# 데이터 샘플 축소 (예: train 10,000개, test 2,000개)
dataset["train"] = dataset["train"].select(range(10000))
dataset["test"] = dataset["test"].select(range(2000))

print("Train:", len(dataset["train"]), "Test:", len(dataset["test"]))
print("Sample:", dataset["train"][0])


Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Train: 10000 Test: 2000
Sample: {'id': 9976970, 'document': '아 더빙.. 진짜 짜증나네요 목소리', 'label': 0}


In [12]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import evaluate

# Tokenizer & Model
tokenizer = AutoTokenizer.from_pretrained("monologg/kobert")
model = AutoModelForSequenceClassification.from_pretrained(
    "monologg/kobert",
    num_labels=2,
    trust_remote_code=True
)

# Preprocessing
def preprocess(batch):
    texts = [str(x) for x in batch["document"]]
    return tokenizer(texts, truncation=True, padding="max_length", max_length=128)

dataset = dataset.map(preprocess, batched=True)
dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

# Metric
accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(axis=-1)
    return accuracy.compute(predictions=preds, references=labels)

# TrainingArguments
training_args = TrainingArguments(
    output_dir="/content/outputs",
    eval_strategy="epoch",
    logging_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=1,
    per_device_train_batch_size=16,
    save_total_limit=1,
    report_to="none"
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    compute_metrics=compute_metrics
)

# Train & Evaluate
trainer.train()
results = trainer.evaluate()

print(f"Final eval accuracy: {results['eval_accuracy']:.4f}")
print("Sample train row:", dataset["train"][0])


The repository for monologg/kobert contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/monologg/kobert.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at monologg/kobert and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Epoch,Training Loss,Validation Loss,Accuracy
1,0.4477,0.338717,0.8595


Final eval accuracy: 0.8595
Sample train row: {'labels': tensor(0), 'input_ids': tensor([   2, 3093, 1698, 6456,   54,   54, 4368, 4396, 7316, 5655, 5703, 2073,
           3,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,