<a href="https://colab.research.google.com/github/ash-hun/Sentiment_Analysis/blob/main/BERT_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **요구사항**

1. 공개된 영화 리뷰 IMDb 데이터셋에 대해 Sentiment Analysis 하는 분류(Binary Classification) 모델을 훈련하고, 모델의 성능을 평가해주세요.

- IMDb데이터셋은 HuggingFace Hub, 'datasets' 모듈에서 load하여 사용 (dataset = load_dataset("imdb"))
  [HuggingFace Hub의 IMDb 설명 참고] https://huggingface.co/datasets/imdb
- PyTorch, HuggingFace의 Transfomers 딥러닝 프레임워크를 사용
- Transformer 기반 BERT 계열의 Pre-trained 모델로 사용하여 Fine-tuning
  (HuggingFace에서 Pre-trained 모델 다운로드하여 사용, 모델 선정의 이유 제시)
- 분류 모델 평가 지표 제시 및 지표별 평가
- Google Colab 사용 (Jupyter Notebook, GPU)

2. 모델 성능을 향상시킬 수 있는 아이디어, Future work에 대해서도 간략하게 제시해주세요.

## **필요 패키지 설치**

In [1]:
!pip install transformers[torch]
!pip install datasets
!pip install scikit-learn
!pip install accelerate -U
!pip install torch -U

Collecting accelerate>=0.21.0 (from transformers[torch])
  Downloading accelerate-0.28.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->transformers[torch])
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m30.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->transformers[torch])
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->transformers[torch])
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
[2K     

In [2]:
# 데이터셋 로드 & GPU 확인
import torch
from datasets import load_dataset

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"My Device is {device}")

imdb = load_dataset('imdb')
imdb # 데이터 체크

My Device is cuda


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [3]:
from transformers import AutoTokenizer

MODEL_NAME = 'distilbert-base-uncased' # 일반 BERT에 비해 더 가볍고 빠른 학습이 가능하여 Distilbert를 선택
DIR_NAME = "IMDB_Result" # 결과물 디렉터리명

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def tokenize(target):
    return tokenizer(target['text'], padding='max_length', truncation=True)

tokenized_imdb = imdb.map(tokenize, batched=True)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [4]:
import numpy as np
from transformers import DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {'accuracy': acc, 'f1': f1, 'precision': precision, 'recall': recall}

model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)

training_args = TrainingArguments(
    output_dir=DIR_NAME,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
    no_cuda=False if torch.cuda.is_available() else True,
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


## **학습**

In [6]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.2251,0.210986,0.91984,0.916888,0.951946,0.88432
2,0.1454,0.233229,0.93168,0.931762,0.930646,0.93288


TrainOutput(global_step=3126, training_loss=0.20687381319715972, metrics={'train_runtime': 3229.8916, 'train_samples_per_second': 15.48, 'train_steps_per_second': 0.968, 'total_flos': 6623369932800000.0, 'train_loss': 0.20687381319715972, 'epoch': 2.0})

## **평가**

In [7]:
trainer.evaluate()

{'eval_loss': 0.2109861671924591,
 'eval_accuracy': 0.91984,
 'eval_f1': 0.9168878566688786,
 'eval_precision': 0.9519462624870824,
 'eval_recall': 0.88432,
 'eval_runtime': 428.8858,
 'eval_samples_per_second': 58.291,
 'eval_steps_per_second': 3.644,
 'epoch': 2.0}

## **Inference**

In [8]:
def inference(model_path, text):
    input_text = tokenizer(text, return_tensors="pt")
    model = AutoModelForSequenceClassification.from_pretrained(model_path, num_labels=2)

    with torch.no_grad():
        logits = model(**input_text).logits
        label = logits.argmax().item()

    return "Positive" if label==1 else "Negative"

In [9]:
inference('./IMDB_Result/checkpoint-3126/', "That's so Funny!!")

'Positive'

In [10]:
inference('./IMDB_Result/checkpoint-3126/', "It's so dirty..")

'Negative'

## **Future Work**

모델의 성능향상을 위한 아이디어로 다음과 같은 방법론을 제시해볼 수 있을 것 같습니다.

- 더 좋은 퀄리티의 데이터셋(데이터셋 양, Pos/Neg 라벨의 균형성) 확보
- 더 좋은 성능의 Pretrain 모델 사용