<a href="https://colab.research.google.com/github/fukagai-takuya/gifu-ai/blob/main/DistilBERT_SST_2_Fine_Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 1. fine-tuning した DistilBERT のモデル等を保存するため Google Drive をマウントします。

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### 2. WandB を無効にします。
- WandB を使用する場合はこのコードセルは実行しないようにします。使用する場合は後のコードセルで fine-tuning を実行する際に WandB の API キーをセットする必要があります。

In [None]:
import os
os.environ["WANDB_MODE"] = "disabled"

### 3. SST-2 のデータセットを読み込みます。
- SST-2 のデータセットは映画レビューの文章が positive か negative かに分類したデータセットです。
- train, validation データセットを pandas のライブラリで読み込みます。
- test データセットの label は全て -1 となっているため、このノートブックでは train データセットを training 用と test 用のデータに分割して使用します。

In [None]:
import pandas as pd

splits = {'train': 'data/train-00000-of-00001.parquet', 'validation': 'data/validation-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
df_train = pd.read_parquet("hf://datasets/stanfordnlp/sst2/" + splits["train"])
df_validation = pd.read_parquet("hf://datasets/stanfordnlp/sst2/" + splits["validation"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
print(df_train)

         idx                                           sentence  label
0          0       hide new secretions from the parental units       0
1          1               contains no wit , only labored gags       0
2          2  that loves its characters and communicates som...      1
3          3  remains utterly satisfied to remain the same t...      0
4          4  on the worst revenge-of-the-nerds clichés the ...      0
...      ...                                                ...    ...
67344  67344                               a delightful comedy       1
67345  67345                   anguish , anger and frustration       0
67346  67346  at achieving the modest , crowd-pleasing goals...      1
67347  67347                                  a patient viewer       1
67348  67348  this new jangle of noise , mayhem and stupidit...      0

[67349 rows x 3 columns]


In [None]:
print(df_validation)

     idx                                           sentence  label
0      0    it 's a charming and often affecting journey .       1
1      1                 unflinchingly bleak and desperate       0
2      2  allows us to hope that nolan is poised to emba...      1
3      3  the acting , costumes , music , cinematography...      1
4      4                  it 's slow -- very , very slow .       0
..   ...                                                ...    ...
867  867              has all the depth of a wading pool .       0
868  868              a movie with a real anarchic flair .       1
869  869  a subject like this should inspire reaction in...      0
870  870  ... is an arthritic attempt at directing by ca...      0
871  871  looking aristocratic , luminous yet careworn i...      1

[872 rows x 3 columns]


### 4. train データセットを training 用と test 用データに分割します。
- 下記のコードセルの例では、test 用データセットのサイズを training 用データセットの 1/100 にしています。

In [None]:
from sklearn.model_selection import train_test_split

df_splitted_train, df_splitted_test = train_test_split(df_train, test_size=0.01)

In [None]:
print(df_splitted_train)

         idx                                           sentence  label
13894  13894                         certainly clever in spots       1
41804  41804                          none of the happily-ever       0
11667  11667  is a monumental achievement in practically eve...      0
36365  36365                                 the funniest film       1
43855  43855  serves as a painful elegy and sobering caution...      1
...      ...                                                ...    ...
12362  12362                                         terrorist       0
44105  44105                                          astounds       1
27403  27403           succeed merrily at their noble endeavor       1
36686  36686                          with a zippy jazzy score       1
51194  51194  a sweet , laugh-a-minute crowd pleaser that li...      1

[66675 rows x 3 columns]


In [None]:
print(df_splitted_test)

         idx                                           sentence  label
44247  44247  is that i truly enjoyed most of mostly martha ...      1
57135  57135  rife with nutty cliches and far too much dialo...      0
20836  20836  satisfy the boom-bam crowd without a huge sacr...      1
50268  50268                           devoid of wit and humor       0
24155  24155  , to the adrenaline jolt of a sudden lunch rus...      1
...      ...                                                ...    ...
26892  26892                     dialogue , 30 seconds of plot       0
56759  56759  an engrossing portrait of uncompromising artis...      1
55062  55062                not a trace of humanity or empathy       0
11289  11289  a sense of light-heartedness , that makes it a...      1
40090  40090  we 've seen the hippie-turned-yuppie plot befo...      0

[674 rows x 3 columns]


### 5. fine-tuning 前の DistilBERT のモデルをダウンロードし、SST-2 のテキストの分類ができるように fine-tuning します。
- train データセットを分割して得られたデータセットを training に使用します。
- validation データセットを使用して、training に使用していないテキストの分類性能が向上しているかを評価します。
- SST-2 のテキストの token の長さは 128 以下のようなので max_length は 128 にしています (後で確認したところ最大でも token 数は 66 でした)。
- token の長さが 128 に満たない場合は padding するようにしています。
- training の learning rate は 1e-6 にしました。
  - learning rate に 2e-5, 1e-5 等、一桁大きな値もセットしましたが、1e-6 にしたほうが test データの正解率は高くなりました。
- learning rate 以外にも下記のコードの TrainingArguments, Trainer のパラメータを変えて試しました。
  - test データを対象とした正解率は 0.8947 になりましたが、パラメータを調整したらもう少し高い値になるかもしれません。

In [None]:
from transformers import DistilBertForSequenceClassification, DistilBertTokenizerFast, Trainer, TrainingArguments
from datasets import load_dataset
from datasets import Dataset

# Load tokenizer and model
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

# Convert from pandas to Dataset
train_dataset = Dataset.from_pandas(df_splitted_train)
validation_dataset = Dataset.from_pandas(df_validation)

# Tokenize
def tokenize(batch):
    return tokenizer(batch["sentence"], padding="max_length", truncation=True, max_length=128)

# Convert datasets to tokenized format
tokenized_train_dataset = train_dataset.map(tokenize, batched=True)
tokenized_validation_dataset = validation_dataset.map(tokenize, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir="./distilbert-sst2",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    learning_rate=1e-6,
    weight_decay=0.01,
    eval_strategy="epoch",
    logging_strategy="steps",
    logging_steps=100,
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    warmup_steps=500,
    gradient_accumulation_steps=1,
    fp16=True  # if available
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_validation_dataset,
    processing_class=tokenizer,
)

# Train
trainer.train()

# Save the model and tokenizer
trainer.save_model('./fine_tuned_model_lr_1e_minus6_batch16')

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/66675 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss
1,0.2935,0.30615
2,0.2128,0.284342
3,0.2305,0.284407


### 6. fine-tuning で得られた SST-2 の分類モデルを Google Drive に保存します。

In [None]:
%%bash
mkdir /content/drive/MyDrive/Gifu-AI-2025-06-22/2025-05-26-v3
cp -r /content/fine_tuned_model_lr_1e_minus6_batch16/ /content/drive/MyDrive/Gifu-AI-2025-06-22/2025-05-26-v3

### 7. 上記 5. の fine-tuning で得られたモデルで test データセットを分類し、正解率を確認します。
- 乱数を使用しているため、上記 5. までを再実行すると異なる正解率になるかと思いますが、私が確認した際には 0.8947 になりました。

In [None]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
import torch
from sklearn.metrics import accuracy_score

# Load model and tokenizer
fine_tuned_tokenizer = DistilBertTokenizer.from_pretrained("./fine_tuned_model_lr_1e_minus6_batch16")
fine_tuned_model = DistilBertForSequenceClassification.from_pretrained("./fine_tuned_model_lr_1e_minus6_batch16")
fine_tuned_model.eval()

# Convert from pandas to Dataset
test_dataset = Dataset.from_pandas(df_splitted_test)

# Predict in batches
preds, labels = [], []
for example in test_dataset:
    inputs = tokenizer(example["sentence"], return_tensors="pt", padding="max_length", truncation=True, max_length=128)
    with torch.no_grad():
        outputs = fine_tuned_model(**inputs)
    logits = outputs.logits
    prediction = torch.argmax(logits, dim=-1).item()
    preds.append(prediction)
    labels.append(example["label"])

# Compute accuracy
acc = accuracy_score(labels, preds)
print(f"Validation Accuracy: {acc:.4f}")


Validation Accuracy: 0.8947


### 8. Token 数が 128 以下になっているか気になったため、下記のコードで確認しました。
- Token 数の最大値は 66 で、128 以下になっていました。

In [None]:
from transformers import DistilBertForSequenceClassification, DistilBertTokenizerFast, Trainer, TrainingArguments
from datasets import load_dataset
from datasets import Dataset

# Load tokenizer and model
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

# Convert from pandas to Dataset
train_check_dataset = Dataset.from_pandas(df_train)
validation_check_dataset = Dataset.from_pandas(df_validation)

# Tokenize
def tokenize_without_truncation(batch):
    return tokenizer(batch["sentence"], truncation=False)

# Convert datasets to tokenized format
tokenized_train_dataset_without_truncation = train_check_dataset.map(tokenize_without_truncation, batched=True)
tokenized_validation_dataset_without_truncation = validation_check_dataset.map(tokenize_without_truncation, batched=True)

max_len = max(len(example["input_ids"]) for example in tokenized_train_dataset_without_truncation)
print(f"Maximum token length (train): {max_len}")

max_len = max(len(example["input_ids"]) for example in tokenized_validation_dataset_without_truncation)
print(f"Maximum token length (validation): {max_len}")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Maximum token length (train): 66
Maximum token length (validation): 55


In [None]:
print(test_dataset)

Dataset({
    features: ['idx', 'sentence', 'label', '__index_level_0__'],
    num_rows: 674
})


In [None]:
print(validation_dataset)

Dataset({
    features: ['idx', 'sentence', 'label'],
    num_rows: 872
})


In [None]:
print(train_dataset)

Dataset({
    features: ['idx', 'sentence', 'label', '__index_level_0__'],
    num_rows: 66675
})


In [None]:
print(preds)

[1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 

In [None]:
print(labels)

[1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 