# [Fine-tune a pretrained model](https://huggingface.co/docs/transformers/training)

1. [Prepare a dataset](#prepare-a-dataset)
2. [Train](#train)

    2.1. [with PyTorch Trainer](#1-train-with-pytorch-trainer)

    2.2. [in native PyTorch](#2-train-in-native-pytorch)

## Prepare a dataset

Begin by loading the Yelp Reviews dataset. You need a tokenizer to process the text.


In [1]:
from datasets import load_dataset

dataset = load_dataset('yelp_review_full')
#dataset['train']

In [2]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('google-bert/bert-base-cased')

def tokenize_function(examples):
    return tokenizer(examples['text'],padding='max_length',truncation=True)

tokenized_datasets = dataset.map(tokenize_function,batched=True)

In [3]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})

You can create a smaller subset of the full dataset to fine-tune on to reduce the time it takes

In [4]:
small_train_dataset = tokenized_datasets['train'].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets['test'].shuffle(seed=42).select(range(1000))

small_train_dataset

Dataset({
    features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 1000
})

## Train

### 1. Train with PyTorch Trainer

Huggingface Transformers provide a Trainer class optimized for training Huggingface Transformers model. Start by loading your model and specify the number of expected labels. From the Yelp Review dataset card, you know there are five labels


In [5]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased",num_labels=5)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Training hyperparameters

Create a TrainingArguments class which contains all the hyperparameters you can tune as well as flags for activating different options. If you'd like to monitor your evaluation metrics during fine-tuning, specify the evaluation_strategy parameter to report the evaluation metric at the end of each epoch.

In [6]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir = 'test_trainer', evaluation_strategy="epoch")

#### Evaluate

In [7]:
import numpy as np
import evaluate

metric = evaluate.load('accuracy')

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [8]:
from transformers import Trainer

trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = small_train_dataset,
    eval_dataset = small_eval_dataset,
    compute_metrics = compute_metrics,
)

In [None]:
trainer.train()

(로컬 환경에서 transformer의 trainer 이용해 진행해보려는데 커널 오류 해결에 실패했음. pytorch를 직접 쓰는 것이 더 필요한 상황이라 넘어가기로 함.)

### 2. Train in native PyTorch

Before training, you may need to execute the following code to free some memory.

In [9]:
import torch
del model
del trainer

torch.cuda.empty_cache()

Next, manually postprocess tokenized_dataset to prepare it for training.

1. Remove the text column because the model does not accept raw text as an input

In [10]:
tokenized_datasets = tokenized_datasets.remove_columns(["text"])

tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})

2. Rename the label column to labels because the model expects the argument to be named labels

In [11]:
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")

3. Set the format of the dataset to return PyTorch tensors instead of lists

In [13]:
tokenized_datasets.set_format("torch")

Then create a smaller subset of the dataset as previously shown to speed up the fine-tuning

In [14]:
small_train_dataset = tokenized_datasets['train'].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets['test'].shuffle(seed=42).select(range(1000))

small_train_dataset

Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 1000
})

In [32]:
temp = small_train_dataset[0]

print(temp.keys())
for k,v in temp.items() :
    print (f'{k}:') 
    print(f'{type(v)}, {v.shape}')

dict_keys(['labels', 'input_ids', 'token_type_ids', 'attention_mask'])
labels:
<class 'torch.Tensor'>, torch.Size([])
input_ids:
<class 'torch.Tensor'>, torch.Size([512])
token_type_ids:
<class 'torch.Tensor'>, torch.Size([512])
attention_mask:
<class 'torch.Tensor'>, torch.Size([512])


#### DataLoader

Create a DataLoader for your training and test datasets so you can iterate over batches of data

In [33]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)

Load your model with the number of expected labels

In [34]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Optimizer and learning rate scheduler

Create an optimizer and learning rate scheduler to fine-tune the model.

In [35]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

Create the default learning rate scheduler from Trainer

In [37]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

Lastly, specify device to use a GPU if you have access to one.

In [38]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [39]:
for batch in train_dataloader:
    print(type(batch))
    for k,v in batch.items() :
        print (f'{k}:{v}') 
        print(f'Shape {(v.shape)} / type {type(v)}')
        print(type(v[0]))
    break
    

<class 'dict'>
labels:tensor([2, 1, 2, 0, 1, 0, 3, 1])
Shape torch.Size([8]) / type <class 'torch.Tensor'>
<class 'torch.Tensor'>
input_ids:tensor([[ 101, 4960, 1111,  ..., 1115, 1134,  102],
        [ 101, 1109, 1211,  ...,    0,    0,    0],
        [ 101,  146, 3983,  ...,    0,    0,    0],
        ...,
        [ 101,  146, 1274,  ...,    0,    0,    0],
        [ 101, 3006, 1715,  ...,    0,    0,    0],
        [ 101, 5091, 1254,  ...,    0,    0,    0]])
Shape torch.Size([8, 512]) / type <class 'torch.Tensor'>
<class 'torch.Tensor'>
token_type_ids:tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]])
Shape torch.Size([8, 512]) / type <class 'torch.Tensor'>
<class 'torch.Tensor'>
attention_mask:tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1,

#### Training loop

In [40]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/375 [00:00<?, ?it/s]

: 

: 