<a href="https://colab.research.google.com/github/byeongdon/hanghae99/blob/main/3%EC%A3%BC%EC%B0%A8_%EC%8B%AC%ED%99%94%EA%B3%BC%EC%A0%9C.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# READMe


## Q1) 어떤 task를 선택하셨나요?
> MNLI


## Q2) 모델은 어떻게 설계하셨나요? 설계한 모델의 입력과 출력 형태가 어떻게 되나요?
> 모델의 입력과 출력 형태 또는 shape을 정확하게 기술


## Q3) 실제로 pre-trained 모델을 fine-tuning했을 때 loss curve은 어떻게 그려지나요? 그리고 pre-train 하지 않은 Transformer를 학습했을 때와 어떤 차이가 있나요?
> 비교 metric은 loss curve, accuracy, 또는 test data에 대한 generalization 성능 등을 활용.
> +)이외에도 기계 번역 같은 문제에서 활용하는 BLEU 등의 metric을 마음껏 활용 가능
-
-  
-  
- 이미지 첨부시 : ![이미지 설명](경로) / 예시: ![poster](./image.png)

### 위의 사항들을 구현하고 나온 결과들을 정리한 보고서를 README.md 형태로 업로드
### 코드 및 실행 결과는 jupyter notebook 형태로 같이 public github repository에 업로드하여 공유해주시면 됩니다. 반드시 출력 결과가 남아있어야 합니다.


In [6]:
!pip install tqdm boto3 requests regex sentencepiece sacremoses datasets transformers kagglehub --upgrade



In [7]:
import kagglehub

# 데이터셋 다운로드
path = kagglehub.dataset_download("thedevastator/unlocking-language-understanding-with-the-multin")
print("Path to dataset files:", path)

Path to dataset files: /kaggle/input/unlocking-language-understanding-with-the-multin


In [8]:
import random
import pandas as pd

def load_data(path, nrows=None):
    df = pd.read_csv(path, nrows=nrows, keep_default_na=False)
    data = []
    for _, row in df.iterrows():
        if len(row['premise']) * len(row['hypothesis']) != 0:
            data.append({'premise': row['premise'], 'hypothesis': row['hypothesis'], 'label': row['label']})
    return data

# 데이터 로드
train_data = load_data(path + '/train.csv', nrows=1000)
test_data = load_data(path + '/validation_matched.csv', nrows=1000)

In [9]:
import torch
from torch.utils.data import Dataset, DataLoader

class MNLIDataset(Dataset):
    def __init__(self, data, tokenizer, max_len=400):
        self.data = data
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        premise = self.data[idx]['premise']
        hypothesis = self.data[idx]['hypothesis']
        label = self.data[idx]['label']

        tokens = self.tokenizer(premise, hypothesis, padding='max_length', truncation=True, max_length=self.max_len)

        input_ids = torch.LongTensor(tokens.input_ids)
        attention_mask = torch.LongTensor(tokens.attention_mask)
        label = torch.LongTensor([label])  # label을 LongTensor로 변환

        return input_ids, attention_mask, label



In [10]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch import nn

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=3)


# 데이터셋 및 데이터 로더 생성
train_dataset = MNLIDataset(train_data, tokenizer)
test_dataset = MNLIDataset(test_data, tokenizer)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
from torch.optim import Adam
import numpy as np
import matplotlib.pyplot as plt

# 모델을 GPU로 이동
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# 학습 설정
lr = 0.001
loss_fn = nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=lr)
n_epochs = 5

# 학습 루프
for epoch in range(n_epochs):
    total_loss = 0.
    model.train()  # 학습 모드 설정

    for input_ids, attention_mask, labels in train_loader:
        model.zero_grad()  # 이전 gradient 초기화

        input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)  # GPU 이동 및 float 변환

        preds = model(input_ids, attention_mask=attention_mask).logits  # 출력 차원 맞추기 (batch_size,)

        loss = loss_fn(preds, labels.squeeze(1))  # 손실 계산
        loss.backward()  # 역전파
        optimizer.step()  # 파라미터 업데이트

        total_loss += loss.item()  # loss 누적

    print(f"Epoch {epoch:3d} | Train Loss: {total_loss}")

Epoch   0 | Train Loss: 19.122448563575745
Epoch   1 | Train Loss: 17.752817034721375
Epoch   2 | Train Loss: 17.66719913482666
Epoch   3 | Train Loss: 17.569453358650208
Epoch   4 | Train Loss: 17.633403420448303


In [12]:
# 정확도 계산 함수
def accuracy(model, dataloader):
    cnt = 0      # 전체 샘플 수
    acc = 0      # 정답 개수 누적

    with torch.no_grad():
        model.eval()  # 평가 모드로 전환 (계산 비활성화)
        for input_ids, attention_mask, labels in dataloader:
            input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)  # GPU 이동 및 float 변환

            preds = model(input_ids, attention_mask=attention_mask).logits
            preds = torch.argmax(preds, dim=-1)

            cnt += labels.shape[0]  # 총 샘플 수 누적
            acc += (labels.squeeze(1) == preds).sum().item()  # 예측이 맞은 수 누적

    return acc / cnt  # 정확도 반환

# 평가
train_acc = accuracy(model, train_loader)
test_acc = accuracy(model, test_loader)

print(f"=========> Train acc: {train_acc:.3f} | Test acc: {test_acc:.3f}")

