<h1>10장 텍스트 임베딩 모델 만들기</h1>
<i>임베딩 모델을 훈련하고 미세 튜닝하는 방법 살펴 보기</i>

<a href="https://github.com/rickiepark/handson-llm"><img src="https://img.shields.io/badge/GitHub%20Repository-black?logo=github"></a>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/rickiepark/handson-llm/blob/main/chapter10.ipynb)

---

이 노트북은 <[핸즈온 LLM](https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961)> 책 10장의 코드를 담고 있습니다.

---

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961">
<img src="https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/images/book_cover.png" width="350"/></a>


### [선택사항] - <img src="https://colab.google/static/images/icons/colab.png" width=100>에서 패키지 선택하기


이 노트북을 구글 코랩에서 실행한다면 다음 코드 셀을 실행하여 이 노트북에서 필요한 패키지를  설치하세요.

---

💡 **NOTE**: 이 노트북의 코드를 실행하려면 GPU를 사용하는 것이 좋습니다. 구글 코랩에서는 **런타임 > 런타임 유형 변경 > 하드웨어 가속기 > T4 GPU**를 선택하세요.

---

In [None]:
%%capture
!pip install datasets mteb

## 임베딩 모델 만들기

### 데이터

In [None]:
from datasets import load_dataset

# GLUE에서 MNLI 데이터셋을 로드합니다.
# 0 = 수반, 1 = 중립, 2 = 모순
train_dataset = load_dataset("glue", "mnli", split="train").select(range(50_000))
train_dataset = train_dataset.remove_columns("idx")

Downloading readme:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/52.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.21M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.25M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.26M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/392702 [00:00<?, ? examples/s]

Generating validation_matched split:   0%|          | 0/9815 [00:00<?, ? examples/s]

Generating validation_mismatched split:   0%|          | 0/9832 [00:00<?, ? examples/s]

Generating test_matched split:   0%|          | 0/9796 [00:00<?, ? examples/s]

Generating test_mismatched split:   0%|          | 0/9847 [00:00<?, ? examples/s]

In [None]:
train_dataset[2]

{'premise': 'One of our number will carry out your instructions minutely.',
 'hypothesis': 'A member of my team will execute your orders with immense precision.',
 'label': 0}

### 모델

In [None]:
from sentence_transformers import SentenceTransformer

# BERT 베이스 모델을 사용합니다.
embedding_model = SentenceTransformer('bert-base-uncased')



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

### 손실 함수

In [None]:
from sentence_transformers import losses

# 손실 함수를 정의합니다. 소프트맥스 손실을 위해 명시적으로 레이블의 개수를 지정해야 합니다.
train_loss = losses.SoftmaxLoss(
    model=embedding_model,
    sentence_embedding_dimension=embedding_model.get_sentence_embedding_dimension(),
    num_labels=3
)

### 평가

In [None]:
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

# STSB를 위해 임베딩 유사도 평가자를 만듭니다.
val_sts = load_dataset('glue', 'stsb', split='validation')
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts["sentence1"],
    sentences2=val_sts["sentence2"],
    scores=[score/5 for score in val_sts["label"]],
    main_similarity="cosine",
    similarity_fn_names=["cosine", "euclidean", "manhattan", "dot"]
)

Downloading data:   0%|          | 0.00/502k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/151k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/114k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5749 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1379 [00:00<?, ? examples/s]

### 훈련

In [None]:
from sentence_transformers.training_args import SentenceTransformerTrainingArguments

# 훈련 매개변수를 정의합니다.
args = SentenceTransformerTrainingArguments(
    output_dir="base_embedding_model",
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
    report_to=[]
)

In [None]:
from sentence_transformers.trainer import SentenceTransformerTrainer

# 임베딩 모델을 훈련합니다.
trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    loss=train_loss,
    evaluator=evaluator
)
trainer.train()

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

dataset = dataset.select_columns(['hypothesis', 'entailment', 'contradiction'])


Step,Training Loss
100,1.0831
200,0.9522
300,0.8949
400,0.8458
500,0.815
600,0.8324
700,0.8094
800,0.7955
900,0.7727
1000,0.7659


TrainOutput(global_step=1563, training_loss=0.8154379612195972, metrics={'train_runtime': 334.6179, 'train_samples_per_second': 149.424, 'train_steps_per_second': 4.671, 'total_flos': 0.0, 'train_loss': 0.8154379612195972, 'epoch': 1.0})

In [None]:
# 훈련된 모델을 평가합니다.
evaluator(embedding_model)

{'pearson_cosine': 0.5145110035729868,
 'spearman_cosine': 0.5820019817169536,
 'pearson_euclidean': 0.5478269569036163,
 'spearman_euclidean': 0.5769576970970489,
 'pearson_manhattan': 0.5592396718260839,
 'spearman_manhattan': 0.5824084143817787,
 'pearson_dot': 0.47808957940087127,
 'spearman_dot': 0.5126266313491807,
 'pearson_max': 0.5592396718260839,
 'spearman_max': 0.5824084143817787}

### MTEB

In [None]:
from mteb import MTEB

# 평가 작업을 선택합니다.
evaluation = MTEB(tasks=["Banking77Classification"])

# 결과를 계산합니다.
results = evaluation.run(embedding_model)
results



[TaskResult(task_name=Banking77Classification, scores=...)]

⚠️ **VRAM 비우기** - 다음 코드를 사용해 VRAM(GPU RAM)을 비우세요. 만약 비워지지 않으면 노트북을 재시작해야 합니다. 코랩을 사용하는 경우 오른쪽의 리소스 탭에서 VRAM이 줄어 들었는지 확인할 수 있습니다. 또는 `!nvidia-smi` 명령을 실행하여 현재 사용량을 확인할 수 있습니다.

In [None]:
import gc
import torch

gc.collect()
torch.cuda.empty_cache()

### 손실 함수

#### 코사인 유사도 손실

In [None]:
from datasets import Dataset, load_dataset

# GLUE로부터 MNLI 데이터셋을 로드합니다.
# 0 = 수반, 1 = 중립, 2 = 모순
train_dataset = load_dataset("glue", "mnli", split="train").select(range(50_000))
train_dataset = train_dataset.remove_columns("idx")

# (중립/모순)=0, (수반)=1
mapping = {2: 0, 1: 0, 0:1}
train_dataset = Dataset.from_dict({
    "sentence1": train_dataset["premise"],
    "sentence2": train_dataset["hypothesis"],
    "label": [float(mapping[label]) for label in train_dataset["label"]]
})

In [None]:
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

# STSB를 위한 임베딩 유사도 평가자를 만듭니다.
val_sts = load_dataset('glue', 'stsb', split='validation')
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts["sentence1"],
    sentences2=val_sts["sentence2"],
    scores=[score/5 for score in val_sts["label"]],
    main_similarity="cosine",
    similarity_fn_names=["cosine", "euclidean", "manhattan", "dot"]
)

In [None]:
from sentence_transformers import losses, SentenceTransformer
from sentence_transformers.trainer import SentenceTransformerTrainer
from sentence_transformers.training_args import SentenceTransformerTrainingArguments

# 모델
embedding_model = SentenceTransformer('bert-base-uncased')

# 손실 함수
train_loss = losses.CosineSimilarityLoss(model=embedding_model)

# 훈련 매개변수
args = SentenceTransformerTrainingArguments(
    output_dir="cosineloss_embedding_model",
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
    report_to=[]
)

# 모델 훈련
trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    loss=train_loss,
    evaluator=evaluator
)
trainer.train()



Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
100,0.2325
200,0.1706
300,0.1722
400,0.16
500,0.1522
600,0.1582
700,0.1509
800,0.156
900,0.1477
1000,0.1461


TrainOutput(global_step=1563, training_loss=0.15731354211281295, metrics={'train_runtime': 340.1148, 'train_samples_per_second': 147.009, 'train_steps_per_second': 4.596, 'total_flos': 0.0, 'train_loss': 0.15731354211281295, 'epoch': 1.0})

In [None]:
# 훈련된 모델을 평가합니다.
evaluator(embedding_model)

{'pearson_cosine': 0.7275967886448113,
 'spearman_cosine': 0.7302487370485328,
 'pearson_euclidean': 0.7374555372418545,
 'spearman_euclidean': 0.7353715715447097,
 'pearson_manhattan': 0.7369253702119677,
 'spearman_manhattan': 0.7348856726328731,
 'pearson_dot': 0.6619838708371203,
 'spearman_dot': 0.6628641033570658,
 'pearson_max': 0.7374555372418545,
 'spearman_max': 0.7353715715447097}

⚠️ **VRAM 비우기**

In [None]:
import gc
import torch

gc.collect()
torch.cuda.empty_cache()

#### MNR 손실

In [None]:
import random
from tqdm import tqdm
from datasets import Dataset, load_dataset

# GLUE에서 MNLI 데이터셋을 로드합니다.
mnli = load_dataset("glue", "mnli", split="train").select(range(50_000))
mnli = mnli.remove_columns("idx")
mnli = mnli.filter(lambda x: True if x['label'] == 0 else False)

# 데이터를 준비합니다.
train_dataset = {"anchor": [], "positive": [], "negative": []}
soft_negatives = mnli["hypothesis"]
random.shuffle(soft_negatives)
for row, soft_negative in tqdm(zip(mnli, soft_negatives)):
    train_dataset["anchor"].append(row["premise"])
    train_dataset["positive"].append(row["hypothesis"])
    train_dataset["negative"].append(soft_negative)
train_dataset = Dataset.from_dict(train_dataset)
len(train_dataset)

16875it [00:01, 13136.32it/s]


16875

In [None]:
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

# STSB를 위해 임베딩 유사도 평가자를 만듭니다.
val_sts = load_dataset('glue', 'stsb', split='validation')
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts["sentence1"],
    sentences2=val_sts["sentence2"],
    scores=[score/5 for score in val_sts["label"]],
    main_similarity="cosine",
    similarity_fn_names=["cosine", "euclidean", "manhattan", "dot"]
)

In [None]:
from sentence_transformers import losses, SentenceTransformer
from sentence_transformers.trainer import SentenceTransformerTrainer
from sentence_transformers.training_args import SentenceTransformerTrainingArguments

# 모델
embedding_model = SentenceTransformer('bert-base-uncased')

# 손실 함수
train_loss = losses.MultipleNegativesRankingLoss(model=embedding_model)

# 훈련 매개변수
args = SentenceTransformerTrainingArguments(
    output_dir="mnrloss_embedding_model",
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
    report_to=[]
)

# 모델 훈련
trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    loss=train_loss,
    evaluator=evaluator
)
trainer.train()



Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
100,0.3293
200,0.1065
300,0.0766
400,0.0645
500,0.0688


TrainOutput(global_step=528, training_loss=0.1252814301035621, metrics={'train_runtime': 148.4509, 'train_samples_per_second': 113.674, 'train_steps_per_second': 3.557, 'total_flos': 0.0, 'train_loss': 0.1252814301035621, 'epoch': 1.0})

In [None]:
# 훈련된 모델을 평가합니다.
evaluator(embedding_model)

{'pearson_cosine': 0.8100485746036094,
 'spearman_cosine': 0.8117771634189315,
 'pearson_euclidean': 0.8239129125476379,
 'spearman_euclidean': 0.8181278119296956,
 'pearson_manhattan': 0.8237578162778533,
 'spearman_manhattan': 0.8181237905746074,
 'pearson_dot': 0.7422897545627833,
 'spearman_dot': 0.7290839266361437,
 'pearson_max': 0.8239129125476379,
 'spearman_max': 0.8181278119296956}

## 미세 튜닝

⚠️ **VRAM 비우기**

In [None]:
import gc
import torch

gc.collect()
torch.cuda.empty_cache()

### 지도 학습 방법

In [None]:
from datasets import load_dataset
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

# GLUE에서 MNLI 데이터셋을 로드합니다.
# 0 = 수반, 1 = 중립, 2 = 모순
train_dataset = load_dataset("glue", "mnli", split="train").select(range(50_000))
train_dataset = train_dataset.remove_columns("idx")

# STSB를 위해 임베딩 유사도 평가자를 만듭니다.
val_sts = load_dataset('glue', 'stsb', split='validation')
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts["sentence1"],
    sentences2=val_sts["sentence2"],
    scores=[score/5 for score in val_sts["label"]],
    main_similarity="cosine",
    similarity_fn_names=["cosine", "euclidean", "manhattan", "dot"]
)

In [None]:
from sentence_transformers import losses, SentenceTransformer
from sentence_transformers.trainer import SentenceTransformerTrainer
from sentence_transformers.training_args import SentenceTransformerTrainingArguments

# 모델
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# 손실 함수
train_loss = losses.MultipleNegativesRankingLoss(model=embedding_model)

# 훈련 매개변수
args = SentenceTransformerTrainingArguments(
    output_dir="finetuned_embedding_model",
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
    report_to=[]
)

# 모델 훈련
trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    loss=train_loss,
    evaluator=evaluator
)
trainer.train()

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

dataset = dataset.select_columns(['hypothesis', 'entailment', 'contradiction'])


Step,Training Loss
100,0.1573
200,0.1105
300,0.1199
400,0.1188
500,0.1083
600,0.1011
700,0.1196
800,0.0986
900,0.1041
1000,0.1052


TrainOutput(global_step=1563, training_loss=0.10938199757766967, metrics={'train_runtime': 105.4988, 'train_samples_per_second': 473.939, 'train_steps_per_second': 14.815, 'total_flos': 0.0, 'train_loss': 0.10938199757766967, 'epoch': 1.0})

In [None]:
# 훈련된 모델을 평가합니다.
evaluator(embedding_model)

{'pearson_cosine': 0.8495184624326722,
 'spearman_cosine': 0.8489051232050339,
 'pearson_euclidean': 0.8525644886383396,
 'spearman_euclidean': 0.8489051232050339,
 'pearson_manhattan': 0.8516683274910766,
 'spearman_manhattan': 0.8481842472627098,
 'pearson_dot': 0.8495184636437516,
 'spearman_dot': 0.8489051232050339,
 'pearson_max': 0.8525644886383396,
 'spearman_max': 0.8489051232050339}

In [None]:
# 사전 훈련된 모델을 평가합니다.
original_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
evaluator(original_model)

{'pearson_cosine': 0.8696194518832261,
 'spearman_cosine': 0.8671631197908374,
 'pearson_euclidean': 0.8678715924178552,
 'spearman_euclidean': 0.8671631197908374,
 'pearson_manhattan': 0.8670399003909525,
 'spearman_manhattan': 0.8663946139224048,
 'pearson_dot': 0.8696194534675574,
 'spearman_dot': 0.8671631197908374,
 'pearson_max': 0.8696194534675574,
 'spearman_max': 0.8671631197908374}

⚠️ **VRAM 비우기**

In [None]:
import gc
import torch

gc.collect()
torch.cuda.empty_cache()

### 증식 SBERT

**단계 1:** 크로스 인코더를 미세 튜닝합니다.

In [None]:
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset, Dataset
from sentence_transformers import InputExample
from sentence_transformers.datasets import NoDuplicatesDataLoader

# 크로스 인코더를 위해 10,000개의 문서로 구성된 데이터셋을 만듭니다.
dataset = load_dataset("glue", "mnli", split="train").select(range(10_000))
mapping = {2: 0, 1: 0, 0:1}

# 데이터 로더
gold_examples = [
    InputExample(texts=[row["premise"], row["hypothesis"]], label=mapping[row["label"]])
    for row in tqdm(dataset)
]
gold_dataloader = NoDuplicatesDataLoader(gold_examples, batch_size=32)

# 데이터 처리를 쉽게 하기 위해 판다스 데이터프레임을 만듭니다.
gold = pd.DataFrame(
    {
    'sentence1': dataset['premise'],
    'sentence2': dataset['hypothesis'],
    'label': [mapping[label] for label in dataset['label']]
    }
)

100%|██████████| 10000/10000 [00:00<00:00, 25390.39it/s]


In [None]:
from sentence_transformers.cross_encoder import CrossEncoder

# 골드 데이터셋에서 크로스 인코더를 훈련합니다.
cross_encoder = CrossEncoder('bert-base-uncased', num_labels=2)
cross_encoder.fit(
    train_dataloader=gold_dataloader,
    epochs=1,
    show_progress_bar=True,
    warmup_steps=100,
    use_amp=False
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/312 [00:00<?, ?it/s]

**단계 2:** 새로운 문장 쌍을 만듭니다.

In [None]:
# 크로스 인코더로 레이블을 예측할 실버 데이터셋을 만듭니다.
silver = load_dataset("glue", "mnli", split="train").select(range(10_000, 50_000))
pairs = list(zip(silver['premise'], silver['hypothesis']))

**단계 3:** 미세 튜닝된 크로스 인코더로 새로운 문장 쌍(실버 데이터셋)에 레이블을 할당합니다.

In [None]:
import numpy as np

# 미세 튜닝된 크로스 인코더를 사용해 문장 쌍에 레이블을 할당합니다.
output = cross_encoder.predict(pairs, apply_softmax=True,
                               show_progress_bar=True)
silver = pd.DataFrame(
    {
        "sentence1": silver["premise"],
        "sentence2": silver["hypothesis"],
        "label": np.argmax(output, axis=1)
    }
)

Batches:   0%|          | 0/1250 [00:00<?, ?it/s]

**단계 4:** 확장된 데이터셋(골드 데이터셋 + 실버 데이터셋)으로 바이 인코더(SBERT)를 훈련합니다.

In [None]:
# 골드 데이터셋과 실버 데이터셋을 합칩니다.
data = pd.concat([gold, silver], ignore_index=True, axis=0)
data = data.drop_duplicates(subset=['sentence1', 'sentence2'], keep="first")
train_dataset = Dataset.from_pandas(data, preserve_index=False)

In [None]:
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

# STSB를 위한 임베딩 유사도 평가자를 만듭니다.
val_sts = load_dataset('glue', 'stsb', split='validation')
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts["sentence1"],
    sentences2=val_sts["sentence2"],
    scores=[score/5 for score in val_sts["label"]],
    main_similarity="cosine",
    similarity_fn_names=["cosine", "euclidean", "manhattan", "dot"]
)

In [None]:
from sentence_transformers import losses, SentenceTransformer
from sentence_transformers.trainer import SentenceTransformerTrainer
from sentence_transformers.training_args import SentenceTransformerTrainingArguments

# 모델
embedding_model = SentenceTransformer('bert-base-uncased')

# 손실 함수
train_loss = losses.CosineSimilarityLoss(model=embedding_model)

# 훈련 매개변수
args = SentenceTransformerTrainingArguments(
    output_dir="augmented_embedding_model",
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
    report_to=[]
)

# 모델 훈련
trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    loss=train_loss,
    evaluator=evaluator
)
trainer.train()



Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
100,0.2145
200,0.1562
300,0.1402
400,0.1393
500,0.1394
600,0.1318
700,0.1316
800,0.1323
900,0.1305
1000,0.1284


TrainOutput(global_step=1563, training_loss=0.13815005071179956, metrics={'train_runtime': 348.9955, 'train_samples_per_second': 143.263, 'train_steps_per_second': 4.479, 'total_flos': 0.0, 'train_loss': 0.13815005071179956, 'epoch': 1.0})

In [None]:
# 훈련된 모델을 평가합니다.
evaluator(embedding_model)

{'pearson_cosine': 0.6990833135054513,
 'spearman_cosine': 0.70907615660081,
 'pearson_euclidean': 0.7230757772585417,
 'spearman_euclidean': 0.7199173544738955,
 'pearson_manhattan': 0.7230121993047374,
 'spearman_manhattan': 0.7195879503319793,
 'pearson_dot': 0.6543758257803958,
 'spearman_dot': 0.6543595796944088,
 'pearson_max': 0.7230757772585417,
 'spearman_max': 0.7199173544738955}

In [None]:
trainer.accelerator.clear()

[]

**단계 5**: 실버 데이터셋을 사용하지 않고 평가합니다.

In [None]:
# 골드 데이터셋만 사용합니다.
data = pd.concat([gold], ignore_index=True, axis=0)
data = data.drop_duplicates(subset=['sentence1', 'sentence2'], keep="first")
train_dataset = Dataset.from_pandas(data, preserve_index=False)

# 모델
embedding_model = SentenceTransformer('bert-base-uncased')

# 손실 함수
train_loss = losses.CosineSimilarityLoss(model=embedding_model)

# 훈련 매개변수
args = SentenceTransformerTrainingArguments(
    output_dir="gold_only_embedding_model",
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
    report_to=[]
)

# 모델 훈련
trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    loss=train_loss,
    evaluator=evaluator
)
trainer.train()



Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
100,0.2268
200,0.1714
300,0.16


TrainOutput(global_step=313, training_loss=0.18524520465741143, metrics={'train_runtime': 68.8646, 'train_samples_per_second': 145.212, 'train_steps_per_second': 4.545, 'total_flos': 0.0, 'train_loss': 0.18524520465741143, 'epoch': 1.0})

In [None]:
# 훈련된 모델을 평가합니다.
evaluator(embedding_model)

{'pearson_cosine': 0.620910162485254,
 'spearman_cosine': 0.6476035145337555,
 'pearson_euclidean': 0.6507245363626306,
 'spearman_euclidean': 0.6601656406407354,
 'pearson_manhattan': 0.6525128583104902,
 'spearman_manhattan': 0.6615903177805038,
 'pearson_dot': 0.548425287456532,
 'spearman_dot': 0.5462672292980125,
 'pearson_max': 0.6525128583104902,
 'spearman_max': 0.6615903177805038}

실버 데이터셋과 골드 데이터셋을 모두 사용했을 때와 비교하면 골드 데이터셋만 사용한 경우 모델의 성능이 감소합니다!

⚠️ **VRAM 비우기**

In [None]:
import gc
import torch

gc.collect()
torch.cuda.empty_cache()

## 비지도 학습

### TSDAE

In [None]:
# 추가적인 토크나이저를 다운로드합니다.
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
from tqdm import tqdm
from datasets import Dataset, load_dataset
from sentence_transformers.datasets import DenoisingAutoEncoderDataset

# 전제와 가설을 하나의 문장으로 연결합니다.
mnli = load_dataset("glue", "mnli", split="train").select(range(25_000))
flat_sentences = mnli["premise"] + mnli["hypothesis"]

# 입력 데이터에 잡음을 추가합니다.
damaged_data = DenoisingAutoEncoderDataset(list(set(flat_sentences)))

# 데이터셋을 만듭니다.
train_dataset = {"damaged_sentence": [], "original_sentence": []}
for data in tqdm(damaged_data):
    train_dataset["damaged_sentence"].append(data.texts[0])
    train_dataset["original_sentence"].append(data.texts[1])
train_dataset = Dataset.from_dict(train_dataset)

100%|██████████| 48353/48353 [00:16<00:00, 2945.72it/s]


In [None]:
train_dataset[0]

{'damaged_sentence': 'Abbey is in that area',
 'original_sentence': 'The Abbey is the only religious site in that area.'}

In [None]:
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

# STSB를 위한 임베딩 유사도 평가자를 만듭니다.
val_sts = load_dataset('glue', 'stsb', split='validation')
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts["sentence1"],
    sentences2=val_sts["sentence2"],
    scores=[score/5 for score in val_sts["label"]],
    main_similarity="cosine",
    similarity_fn_names=["cosine", "euclidean", "manhattan", "dot"]
)

In [None]:
from sentence_transformers import models, SentenceTransformer

# 임베딩 모델을 만듭니다.
word_embedding_model = models.Transformer('bert-base-uncased')
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), 'cls')
embedding_model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

In [None]:
from sentence_transformers import losses

# 잡음제거 오토 인코더 손실
train_loss = losses.DenoisingAutoEncoderLoss(
    embedding_model, tie_encoder_decoder=True
)
train_loss.decoder = train_loss.decoder.to("cuda")

Some weights of BertLMHeadModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['bert.encoder.layer.0.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.0.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.0.crossattention.output.dense.bias', 'bert.encoder.layer.0.crossattention.output.dense.weight', 'bert.encoder.layer.0.crossattention.self.key.bias', 'bert.encoder.layer.0.crossattention.self.key.weight', 'bert.encoder.layer.0.crossattention.self.query.bias', 'bert.encoder.layer.0.crossattention.self.query.weight', 'bert.encoder.layer.0.crossattention.self.value.bias', 'bert.encoder.layer.0.crossattention.self.value.weight', 'bert.encoder.layer.1.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.1.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.1.crossattention.output.dense.bias', 'bert.encoder.layer.1.crossattention.output.dense.weight', 'bert.encoder.layer.1.crossattention.self.key.bias', 'bert.e

In [None]:
from sentence_transformers.trainer import SentenceTransformerTrainer
from sentence_transformers.training_args import SentenceTransformerTrainingArguments

# 훈련 매개변수
args = SentenceTransformerTrainingArguments(
    output_dir="tsdae_embedding_model",
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
    report_to=[]
)

# 모델 훈련
trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    loss=train_loss,
    evaluator=evaluator
)
trainer.train()

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


Step,Training Loss
100,7.0921
200,4.9492
300,4.6317
400,4.5004
500,4.3804
600,4.282
700,4.1987
800,4.1827
900,4.0584
1000,4.0558


TrainOutput(global_step=3023, training_loss=4.040039362830119, metrics={'train_runtime': 944.9583, 'train_samples_per_second': 51.169, 'train_steps_per_second': 3.199, 'total_flos': 0.0, 'train_loss': 4.040039362830119, 'epoch': 1.0})

In [None]:
# 훈련된 모델을 평가합니다.
evaluator(embedding_model)

{'pearson_cosine': 0.7385436893075473,
 'spearman_cosine': 0.7458753581529155,
 'pearson_euclidean': 0.7377358822487955,
 'spearman_euclidean': 0.7409439490557472,
 'pearson_manhattan': 0.7380002368859765,
 'spearman_manhattan': 0.7413490525381604,
 'pearson_dot': 0.6471544260499531,
 'spearman_dot': 0.6445149946087406,
 'pearson_max': 0.7385436893075473,
 'spearman_max': 0.7458753581529155}