<a href="https://colab.research.google.com/github/ttogle918/NLU_3-/blob/main/%EC%9D%B4%EC%88%98%EC%B2%A0_sts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP

## KLUE-STS - Semantic Textual Similarity

- 입력으로 주어진 두 문장간의 의미 동등성을 수치로 표현하는 것이 목표.
- 두 입력 문장 간 유사도를 0 ~ 5 점으로 유사도 3.0을 기준으로 유사하다, 유사하지 않다로 라벨링 후 평가함.
- 평가방법 : F1 score

sts
An example of 'train' looks as follows.

```
{'guid': 'klue-sts-v1_train_00000',
 'labels': {'label': 3.7, 'real-label': 3.714285714285714, 'binary-label': 1},
 'sentence1': '숙소 위치는 찾기 쉽고 일반적인 한국의 반지하 숙소입니다.',
 'sentence2': '숙박시설의 위치는 쉽게 찾을 수 있고 한국의 대표적인 반지하 숙박시설입니다.',
 'source': 'airbnb-rtt'}

```



In [None]:
!pip install transformers
!pip install datasets

In [2]:
import os
import sys
import pandas as pd
import numpy as np 
import torch
import random

import logging
from datetime import datetime
from datasets import load_dataset

In [None]:
# seed
seed = 42
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

# device type
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"# available GPUs : {torch.cuda.device_count()}")
    print(f"GPU name : {torch.cuda.get_device_name()}")
else:
    device = torch.device("cpu")
print(device)

In [5]:
dataset = load_dataset('klue', 'sts')

Downloading builder script:   0%|          | 0.00/5.21k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.93k [00:00<?, ?B/s]

Downloading and preparing dataset klue/sts (download: 1.29 MiB, generated: 2.82 MiB, post-processed: Unknown size, total: 4.11 MiB) to /root/.cache/huggingface/datasets/klue/sts/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e...


Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/11668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/519 [00:00<?, ? examples/s]

Dataset klue downloaded and prepared to /root/.cache/huggingface/datasets/klue/sts/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [6]:
dataset

DatasetDict({
    train: Dataset({
        features: ['guid', 'source', 'sentence1', 'sentence2', 'labels'],
        num_rows: 11668
    })
    validation: Dataset({
        features: ['guid', 'source', 'sentence1', 'sentence2', 'labels'],
        num_rows: 519
    })
})

In [7]:
dataset['train'][0]

{'guid': 'klue-sts-v1_train_00000',
 'labels': {'binary-label': 1, 'label': 3.7, 'real-label': 3.714285714285714},
 'sentence1': '숙소 위치는 찾기 쉽고 일반적인 한국의 반지하 숙소입니다.',
 'sentence2': '숙박시설의 위치는 쉽게 찾을 수 있고 한국의 대표적인 반지하 숙박시설입니다.',
 'source': 'airbnb-rtt'}

In [8]:
import torch
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("klue/bert-base")
tokenizer = AutoTokenizer.from_pretrained('klue/bert-base')

Downloading:   0%|          | 0.00/425 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/424M [00:00<?, ?B/s]

Some weights of the model checkpoint at klue/bert-base were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/289 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/243k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [9]:
# 분류 대상 문장이 포함된 열 'sentence1', 'sentence2'을 모델 입력으로 사용

print(f"Sentence1: {dataset['train'][0]['sentence1']}\nSentence2: {dataset['train'][0]['sentence2']}")

Sentence1: 숙소 위치는 찾기 쉽고 일반적인 한국의 반지하 숙소입니다.
Sentence2: 숙박시설의 위치는 쉽게 찾을 수 있고 한국의 대표적인 반지하 숙박시설입니다.


In [None]:
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler, random_split

In [1]:
# Custom Dataset

class Custom_Dataset(Dataset):

    def __init__(self, dataset):
        self.sentence1, self.sentence2, self.labels = self.make_dataset(dataset)

    def __len__(self):
        return len(self.target)

    def __getitem__(self, index):
        # encode
        token_ids = self.tokenizer.encode(
        text = self.text[index],
        truncation = True,
        )
        
        # tensorize
        return torch.tensor(token_ids), torch.tensor([self.target[index]])

NameError: ignored

In [None]:
dataset = CustomDataset(sample_df.document.to_list(), sample_df.label.to_list())

In [None]:
n_sample = sample_df.shape[0]
n_train = int(n_sample*0.9)
n_valid = int(n_sample*0.1)

In [None]:
train_dataset, valid_dataset = random_split(dataset, [n_train, n_valid])

In [None]:
# 중복 확인
#.duplicated(['sentence1', 'sentence2']).sum()