<a href="https://colab.research.google.com/github/Yayawak/15React-projects/blob/main/HW10_BERT_finetuing_student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##  HW10: BERT fintuning.

In this exercise, you are going to learn how to perform fine-tuning on a transformer-based model. First, we will provide a tutorial on fine-tuning the Large Movie Review Dataset (IMDB dataset) using distilBERT (https://arxiv.org/abs/1910.01108). After that, you have to complete the exercise by fine-tuning on the TRUE call-center dataset (HW6). This homework is based on the Hugging Face tutorial (https://huggingface.co/transformers/custom_datasets.html).

### 1. Install transformers library form Hugging Face

In [None]:
!pip install transformers
!pip install pythainlp
!pip install sentencepiece

Collecting pythainlp
  Downloading pythainlp-5.1.1-py3-none-any.whl.metadata (8.0 kB)
Downloading pythainlp-5.1.1-py3-none-any.whl (19.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.3/19.3 MB[0m [31m39.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pythainlp
Successfully installed pythainlp-5.1.1


### 2. Download Large Movie Review Dataset

In [None]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

--2025-03-31 07:26:18--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2025-03-31 07:26:25 (12.3 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



### 3. Preprocess the dataset  
Large Movie Review Dataset  is a dataset for binary sentiment classification. The input of this dataset is a movie review with its sentiment as a ground truth

In [None]:
from pathlib import Path
from sklearn.model_selection import train_test_split
import numpy as np

def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir is "neg" else 1)

    return texts, labels

train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

  labels.append(0 if label_dir is "neg" else 1)


In [None]:
print("Unique label is {}, nb. of train data = {}, test_data = {}".format(np.unique(train_labels), len(train_texts), len(test_texts)))
for i in range(5):
  print("Data = {}".format(train_texts[i]))
  print("Label = {}".format(train_labels[i]))

Unique label is [0 1], nb. of train data = 20000, test_data = 25000
Data = Think of this pilot as "Hawaii Five-O Lite". It's set in Hawaii, it's an action/adventure crime drama, lots of scenes feature boats and palm trees and polyester fabrics and garish shirts...it even stars the character actor "Zulu" in a supporting role. Oh, there are some minor differences - Roy Thinnes is supposed to be some front-line undercover agent, and the supporting cast is much smaller (and less interesting), but basically the atmosphere is still the same. Problem is, "Hawaii Five-O" (another QM product) already existed at the time and had run for years. It filled the market demand for Hawaii-based crime dramas quite adequately. Code Name: Diamond Head may have been intended as the hier to H50 as the older series eventually dwindled away...but it comes across as a superfluous, 2nd rate copy. It doesn't suck, but it's completely derivative and doesn't do anything as well as the original.<br /><br />There is

After the dataset is processed, we tokenize each input sentence. This tokenizer has a start token of '[CLS'] (id 101) and a seperator token '[SEP]' (id 102) at the end of each sentence. If the word is an Out-of-vocabulary word (OOV), the token id is 100. The tokenized output has the following format :

```python
{
  'input_ids': List[List[Int]]. List of tokenized input sentence.
  'attention_mask' : List[List[Int]].  List of masked token. See cell [7] for example.
}
```

In [None]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [None]:
tokenizer([ '[CLS] a' ], truncation=True, padding=True)

{'input_ids': [[101, 101, 1037, 102]], 'attention_mask': [[1, 1, 1, 1]]}

In [None]:
tokenizer( ['Pine apple apple pen  หมา ไก่', 'a b'], truncation=True, padding=True)

{'input_ids': [[101, 7222, 6207, 6207, 7279, 100, 100, 102], [101, 1037, 1038, 102, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 0, 0, 0, 0]]}

In [None]:
a = tokenizer(train_texts[:2], truncation=True, padding=True)
print(a)

{'input_ids': [[101, 1037, 2307, 12883, 16291, 9476, 2013, 1996, 3041, 2086, 2038, 12883, 2004, 1037, 9256, 1999, 2019, 3332, 4653, 2012, 1037, 2334, 2533, 3573, 1012, 2044, 2002, 1005, 1055, 2589, 2005, 1996, 2154, 1996, 3208, 3310, 1999, 2000, 2425, 2032, 2008, 2002, 1005, 2222, 2022, 14391, 2574, 1012, 12883, 2003, 3407, 2000, 27885, 3669, 3351, 2046, 2002, 4481, 2041, 2008, 1996, 2047, 3105, 2003, 1999, 10095, 4063, 8029, 1012, 1012, 1012, 1998, 2008, 10095, 4063, 8029, 2038, 2000, 2079, 2007, 28652, 4176, 1012, 4176, 2066, 2360, 1010, 1037, 3056, 10442, 1012, 2023, 5320, 1037, 2645, 1997, 25433, 2090, 1996, 20710, 9289, 2135, 10442, 1998, 2010, 2085, 2280, 11194, 1012, 1045, 2179, 2023, 2460, 2000, 2022, 26380, 1998, 5791, 2028, 1997, 1996, 2488, 3924, 1997, 1996, 2220, 3878, 1005, 1055, 1012, 2009, 2145, 3464, 2004, 6057, 3053, 3438, 1009, 2086, 2101, 1012, 2023, 6579, 2460, 2064, 2022, 2464, 2006, 5860, 1015, 1997, 1996, 8840, 17791, 13281, 3585, 3074, 3872, 1016, 1012, 1026, 79

In [None]:
train_encodings = tokenizer(train_texts, add_special_tokens=True,
            max_length=512,
            pad_to_max_length=True,
            return_token_type_ids=True,truncation=True
        )
val_encodings = tokenizer(val_texts, add_special_tokens=True,
            max_length=512,
            pad_to_max_length=True,
            return_token_type_ids=True,truncation=True
        )
test_encodings = tokenizer(test_texts, add_special_tokens=True,
            max_length=512,
            pad_to_max_length=True,
            return_token_type_ids=True,truncation=True
        )



Convert the dataset into training format. You can see the training input format of distilBERT is in https://huggingface.co/transformers/model_doc/distilbert.html.

In [None]:
train_data = [np.array(train_encodings['input_ids']), np.array(train_encodings['attention_mask'])]
val_data = [np.array(val_encodings['input_ids']), np.array(val_encodings['attention_mask'])]
test_data = [np.array(test_encodings['input_ids']), np.array(test_encodings['attention_mask'])]

In [None]:
train_data

[array([[  101,  1037,  2307, ...,     0,     0,     0],
        [  101,  1999,  1037, ...,     0,     0,     0],
        [  101,  9686, 21138, ...,     0,     0,     0],
        ...,
        [  101,  1996,  3819, ...,     0,     0,     0],
        [  101,  2045,  2024, ...,  2006,  2008,   102],
        [  101,  1996,  3145, ...,  2420, 13012,   102]]),
 array([[1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        ...,
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 1, 1, 1],
        [1, 1, 1, ..., 1, 1, 1]])]

### 4. Model fine-tuning
The model we used for fine-tuning is distilBERT (https://arxiv.org/abs/1910.01108), which is a smaller model distilled from the original BERT. Knowledge distillation is a well-known trick for improving the performance of a small model by learning an estimated uncertainty from a larger model instead of using a hard-label. If you want to know more about knowledge distillation, read https://arxiv.org/abs/1503.02531.

#### Model Initialization

In [None]:
from transformers import DistilBertForSequenceClassification
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels= 2).to(device)

LEARNING_RATE =  1e-5
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Set up training generator

In contrast to model.fit which you have used in the previous lab. A more common way to feed the data is to use a generator. It is more memory-efficient than model.fit as the data is only quired when the iterator executes. For example, you can set the generator to load the image from the folder when called instead of storing all of them in the RAM. An example below is a way to create a simple generator, which aggregate the data points into a batch. Both PyTorch and TensorFlow also has a utility module for creating a generator (torch.utils.data.DataLoader for Torch and tf.data.Dataset for Tensorflow)

In [None]:
def batch_data_generator(data, label, bs = 8, training = True):
  while(True):
    X1= []
    X2 = []
    Y = []
    from sklearn.utils import shuffle
    ids, masks = data[0], data[1]
    if(training):
      ids, masks, label = shuffle(ids, masks, label, random_state = 42)
    for a, b, c in zip(ids, masks, label):
      X1.append(a)
      X2.append(b)
      Y.append(c)
      if(len(X1) == bs):
        yield [np.array(X1), np.array(X2)], np.array(Y)
        X1= []
        X2 = []
        Y = []
    if(len(X1) > 0):
      yield [np.array(X1), np.array(X2)], np.array(Y)
    if(not training):
      yield None
      break


In [None]:
# train_generator = batch_data_generator(train_data, np.array(train_labels, dtype = np.int), training = True)
train_generator = batch_data_generator(train_data, np.array(train_labels, dtype = int), training = True)

In [None]:
dummy_generator = batch_data_generator(train_data, np.array(train_labels, dtype = int), training = True)
X_dummy, Y_dummy = next(dummy_generator)
print(X_dummy[0].shape, X_dummy[1].shape, Y_dummy.shape)

(8, 512) (8, 512) (8,)


#### Start Fine-tuning

In [None]:
from tqdm import tqdm_notebook
from sklearn.metrics import accuracy_score
from collections import deque

train_acc_stat =  deque(maxlen = 100)
train_loss_stat =  deque(maxlen = 100)



for step in  tqdm_notebook(range(1000)):
    X, Y = next(train_generator)
    ids = torch.tensor(X[0], dtype = torch.long).to(device)
    mask = torch.tensor(X[1], dtype = torch.long).to(device)
    targets = torch.tensor(Y, dtype = torch.long).to(device)

    optimizer.zero_grad()
    outputs = model(ids, mask)
    loss = loss_fn(outputs['logits'], targets)

    loss.backward()
    optimizer.step()

    with torch.no_grad():
      train_acc = accuracy_score(Y, outputs['logits'].argmax(axis = 1).cpu().detach().numpy() )
      train_loss = loss.cpu().detach().numpy()
      train_acc_stat.append(train_acc)
      train_loss_stat.append(train_loss)

    if (step + 1) %100==0:
      print("iter = {} train_acc = {}".format(step, np.array(train_acc_stat).mean()))
      print("iter = {} train_loss = {}".format(step, np.array(train_loss_stat).mean()))


    if (step + 1) %500==0:
      #validation step
      with torch.no_grad():
        val_generator = batch_data_generator(val_data, np.array(val_labels, dtype = np.int), training = False)
        y_true = []
        y_pred = []
        while(True):
          d = next(val_generator)
          if(d is None): break
          X, Y = d
          ids = torch.tensor(X[0], dtype = torch.long, device = device)
          mask = torch.tensor(X[1], dtype = torch.long, device = device)
          outputs_cls = model(ids, mask)['logits'].argmax(axis = 1).cpu().detach().numpy()
          y_true.append(Y)
          y_pred.append(outputs_cls)
        y_true = np.concatenate(y_true)
        y_pred = np.concatenate(y_pred)
        print("val acc", accuracy_score(y_true, y_pred))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for step in  tqdm_notebook(range(1000)):


  0%|          | 0/1000 [00:00<?, ?it/s]

iter = 99 train_acc = 0.70625
iter = 99 train_loss = 0.5702405571937561
iter = 199 train_acc = 0.85125
iter = 199 train_loss = 0.3435094952583313


In [None]:
## ตัวอย่างการใข้งาน WangchanBERTa ทำ Sentiment Analysis
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "airesearch/wangchanberta-base-att-spm-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = 3)

text = "ไม่อร่อยเลย"
tokens = tokenizer(text, return_tensors='pt')

with torch.no_grad():
  output = model(**tokens)
  sentiment = torch.argmax(output.logits, dim=1).item()

sentiment_labels = ["Negative", "Neutral", "Positive"]
print(f"Sentiment: {sentiment_labels[sentiment]} {sentiment}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/282 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/546 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/905k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/423M [00:00<?, ?B/s]

Some weights of CamembertForSequenceClassification were not initialized from the model checkpoint at airesearch/wangchanberta-base-att-spm-uncased and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Sentiment: Neutral 1


## TODO
 WangchanBERTa (https://arxiv.org/abs/2101.09635) is RoBERTa (https://arxiv.org/abs/1907.11692) trained on thai texts. RoBERTa is also supported in Hugging Face (https://huggingface.co/transformers/model_doc/roberta.html).

ตัวย่างการใช้ WangchanBERTa: See (https://colab.research.google.com/drive/1Kbk6sBspZLwcnOE61adAQo30xxqOQ9ko?usp=sharing&fbclid=IwAR23b8ZEoP6YxlUx7wWEu7dRCrVcyTFrZb3YSgI-nsxe_t4gy-bh8Rv5R9E#scrollTo=kAcpAdkddVQ8)

ให้นักศึกษาทำ QA ในภาษาไทยด้วย WangChanBERTa
ตัวอย่าง

context : "ทรูมีแพ็กเกจอินเทอร์เน็ต 10GB ราคา 299 บาทต่อเดือน"

question : "แพ็กเกจอินเทอร์เน็ตมีกี่ GB?"

ได้คำตอบ: "10GB"

In [None]:
# from transformers import AutoTokenizer, AutoModelForQuestionAnswering
# import torch
# # โหลด tokenizer และ model
# model_name = "pythainlp/wangchanglm-7.5B-sft-enth"
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = AutoModelForQuestionAnswering.from_pretrained(model_name)

# # ข้อมูล context และคำถาม
# context = "ทรูมีแพ็กเกจอินเทอร์เน็ต 10GB ราคา 299 บาทต่อเดือน"
# question = "แพ็กเกจอินเทอร์เน็ตมีกี่ GB?"

# # เข้ารหัสข้อมูลสำหรับโมเดล
# inputs = tokenizer(question, context, return_tensors='pt')

# # ทำการประมวลผล
# with torch.no_grad():
#     outputs = model(**inputs)
#     start_idx = torch.argmax(outputs.start_logits)
#     end_idx = torch.argmax(outputs.end_logits)

# # แปลง token กลับเป็นคำตอบ
# answer_tokens = inputs['input_ids'][0][start_idx:end_idx + 1]
# answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)

# print(f"คำตอบ: {answer}")


In [None]:
!pip install -U bitsandbytes

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.4-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-

In [None]:
from transformers import *

model_name = "pythainlp/wangchanglm-7.5B-sft-en"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    return_dict=True,
    load_in_8bit=True ,
    device_map="auto",
    torch_dtype=torch.float16,
    offload_folder="./",
    low_cpu_mem_usage=True,
)
text = "เล่นหุ้นยังไงให้รวย"
tokenizer = AutoTokenizer.from_pretrained(model_name)
batch = tokenizer(text, return_tensors="pt")
with torch.cuda.amp.autocast():
  output_tokens = model.generate(
      input_ids=batch["input_ids"],
      max_new_tokens=max_gen_len, # 512
      begin_suppress_tokens = exclude_ids,
      no_repeat_ngram_size=2,

      #oasst k50
      top_k=50,
      top_p=top_p, # 0.95
      typical_p=1.,
      temperature=temperature, # 0.9

      # #oasst typical3
      # typical_p = 0.3,
      # temperature = 0.8,
      # repetition_penalty = 1.2,
  )
tokenizer.decode(output_tokens[0], skip_special_tokens=True)


loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--pythainlp--wangchanglm-7.5B-sft-en/snapshots/6dab5fdcedca7719a4663be67ff57e5e39194f71/config.json
Model config XGLMConfig {
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "architectures": [
    "XGLMForCausalLM"
  ],
  "attention_dropout": 0.1,
  "attention_heads": 32,
  "bos_token_id": 0,
  "d_model": 4096,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "eos_token_id": 2,
  "ffn_dim": 16384,
  "init_std": 0.02,
  "layerdrop": 0.0,
  "max_position_embeddings": 2048,
  "model_type": "xglm",
  "num_layers": 32,
  "pad_token_id": 1,
  "scale_embedding": true,
  "torch_dtype": "float16",
  "transformers_version": "4.50.0",
  "use_cache": true,
  "vocab_size": 256008
}

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


ImportError: Using `bitsandbytes` 8-bit quantization requires the latest version of bitsandbytes: `pip install -U bitsandbytes`

In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained("cstorm125/wangchanberta-base-att-spm-uncased-finetune-qa")
model = AutoModelForQuestionAnswering.from_pretrained("cstorm125/wangchanberta-base-att-spm-uncased-finetune-qa")


# ข้อมูล context และคำถาม
context = "ทรูมีแพ็กเกจอินเทอร์เน็ต 10GB ราคา 299 บาทต่อเดือน"
question = "แพ็กเกจอินเทอร์เน็ตมีกี่ GB?"

# เข้ารหัสข้อมูลสำหรับโมเดล
inputs = tokenizer(question, context, return_tensors='pt')

# ทำการประมวลผล
with torch.no_grad():
    outputs = model(**inputs)
    start_idx = torch.argmax(outputs.start_logits)
    end_idx = torch.argmax(outputs.end_logits)

# แปลง token กลับเป็นคำตอบ
answer_tokens = inputs['input_ids'][0][start_idx:end_idx + 1]
answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)

print(f"คำตอบ: {answer}")


loading file sentencepiece.bpe.model from cache at /root/.cache/huggingface/hub/models--cstorm125--wangchanberta-base-att-spm-uncased-finetune-qa/snapshots/2cd542e8d17dc3c60392eed3e86f9bc6bcb6b49e/sentencepiece.bpe.model
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--cstorm125--wangchanberta-base-att-spm-uncased-finetune-qa/snapshots/2cd542e8d17dc3c60392eed3e86f9bc6bcb6b49e/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--cstorm125--wangchanberta-base-att-spm-uncased-finetune-qa/snapshots/2cd542e8d17dc3c60392eed3e86f9bc6bcb6b49e/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--cstorm125--wangchanberta-base-att-spm-uncased-finetune-qa/snapshots/2cd542e8d17dc3c60392eed3e86f9bc6bcb6b49e/tokenizer_config.json
loading file chat_template.jinja from cache at None
loading configuration file config.js

คำตอบ: 10


In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained("phoner45/finetune-Question-Answer-thaiqa")
model = AutoModelForQuestionAnswering.from_pretrained("phoner45/finetune-Question-Answer-thaiqa")


# ข้อมูล context และคำถาม
context = "ทรูมีแพ็กเกจอินเทอร์เน็ต 10GB ราคา 299 บาทต่อเดือน"
question = "แพ็กเกจอินเทอร์เน็ตมีกี่ GB?"

# เข้ารหัสข้อมูลสำหรับโมเดล
inputs = tokenizer(question, context, return_tensors='pt')

# ทำการประมวลผล
with torch.no_grad():
    outputs = model(**inputs)
    start_idx = torch.argmax(outputs.start_logits)
    end_idx = torch.argmax(outputs.end_logits)

# แปลง token กลับเป็นคำตอบ
answer_tokens = inputs['input_ids'][0][start_idx:end_idx + 1]
answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)

print(f"คำตอบ: {answer}")


tokenizer_config.json:   0%|          | 0.00/1.77k [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/905k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.18M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/365 [00:00<?, ?B/s]

loading file sentencepiece.bpe.model from cache at /root/.cache/huggingface/hub/models--phoner45--finetune-Question-Answer-thaiqa/snapshots/881b671cbb4fd8412d30612f537ab16e3bed763d/sentencepiece.bpe.model
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--phoner45--finetune-Question-Answer-thaiqa/snapshots/881b671cbb4fd8412d30612f537ab16e3bed763d/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--phoner45--finetune-Question-Answer-thaiqa/snapshots/881b671cbb4fd8412d30612f537ab16e3bed763d/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--phoner45--finetune-Question-Answer-thaiqa/snapshots/881b671cbb4fd8412d30612f537ab16e3bed763d/tokenizer_config.json
loading file chat_template.jinja from cache at None


config.json:   0%|          | 0.00/781 [00:00<?, ?B/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--phoner45--finetune-Question-Answer-thaiqa/snapshots/881b671cbb4fd8412d30612f537ab16e3bed763d/config.json
Model config CamembertConfig {
  "architectures": [
    "CamembertForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "camembert",
  "num_attention_head": 12,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.50.0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 25005
}



model.safetensors:   0%|          | 0.00/419M [00:00<?, ?B/s]

loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--phoner45--finetune-Question-Answer-thaiqa/snapshots/881b671cbb4fd8412d30612f537ab16e3bed763d/model.safetensors
All model checkpoint weights were used when initializing CamembertForQuestionAnswering.

All the weights of CamembertForQuestionAnswering were initialized from the model checkpoint at phoner45/finetune-Question-Answer-thaiqa.
If your task is similar to the task the model of the checkpoint was trained on, you can already use CamembertForQuestionAnswering for predictions without further training.


คำตอบ: 10 ราคา 299 บาทต่อเดือน
