AI Programming - SW Lee

# Lab 06: GPT2 Model for Language Understanding
## Exercise: Building a Korean Chatbot
This exercise is taken from Github Storage for "What is Natural Language Processing?" by Wonjoon Yu.<br>
https://github.com/ukairia777/tensorflow-nlp-tutorial

In [None]:
RunningInCOLAB = 'google.colab' in str(get_ipython()) # 구글 코랩에서 실행하는지 여부를 확인하는 코드

if RunningInCOLAB: # 구글 코랩에서 실행하는 경우 
    from tqdm.notebook import tqdm # 코랩에서는 tqdm의 노트북용 함수를 사용
else:
    from tqdm import tqdm # 코랩이 아닌 경우에는 일반 tqdm을 사용

import os # 파일 및 폴더와 관련된 작업을 위한 라이브러리
os.environ["KERAS_BACKEND"] = "tensorflow" # 케라스 백엔드를 텐서플로우로 설정
 
import tensorflow as tf # 텐서플로우 라이브러리
import keras # 케라스 라이브러리
from transformers import AutoTokenizer # 허깅페이스의 토크나이저 불러오기
from transformers import TFGPT2LMHeadModel # 허깅페이스의 GPT2 모델 불러오기

The GPT2 Model transformer for TensorFlow with a language modeling head on top (linear layer with weights tied to the input embeddings).

If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument :

a single Tensor with input_ids only and nothing else: `model(inputs_ids)`

a list of varying length with one or several input Tensors IN THE ORDER given in the docstring: `model([input_ids, attention_mask])` or `model([input_ids, attention_mask, token_type_ids])`

a dictionary with one or several input Tensors associated to the input names given in the docstring: `model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`

https://huggingface.co/transformers/v3.0.2/index.html

In [None]:
### START CODE HERE ###

# find & assign tokenizer and model; 'skt/kogpt2-base-v2'

# 토크나이저와 모델을 불러오기
# 토크나이저는 'skt/kogpt2-base-v2'를 사용
# 모델은 TFGPT2LMHeadModel을 사용
# 모델은 from_pt=True로 설정
# 토크나이저는 bos_token, eos_token, unk_token, pad_token, mask_token을 설정
# bos_token='</s>', eos_token='</s>', unk_token='<unk>', pad_token='<pad>', mask_token='<mask>'
tokenizer = AutoTokenizer.from_pretrained('skt/kogpt2-base-v2', bos_token='</s>', eos_token='</s>', unk_token='<unk>',
                                          pad_token='<pad>', mask_token='<mask>')

# 모델은 TFGPT2LMHeadModel을 사용
# 모델은 from_pt=True로 설정, from_pt는 파이토치 모델을 텐서플로우 모델로 변환할 때 사용
# 'skt/kogpt2-base-v2'를 사용
model = TFGPT2LMHeadModel.from_pretrained('skt/kogpt2-base-v2', from_pt=True)

### END CODE HERE ###

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFGPT2LMHeadModel: ['transformer.h.8.attn.masked_bias', 'transformer.h.9.attn.masked_bias', 'transformer.h.5.attn.masked_bias', 'transformer.h.2.attn.masked_bias', 'transformer.h.4.attn.masked_bias', 'transformer.h.10.attn.masked_bias', 'transformer.h.6.attn.masked_bias', 'lm_head.weight', 'transformer.h.11.attn.masked_bias', 'transformer.h.7.attn.masked_bias', 'transformer.h.1.attn.masked_bias', 'transformer.h.3.attn.masked_bias', 'transformer.h.0.attn.masked_bias']
- This IS expected if you are ini

In [None]:
model.summary() # 모델의 요약 정보 출력

Model: "tfgpt2lm_head_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transformer (TFGPT2MainLay  multiple                  125164032 
 er)                                                             
                                                                 
Total params: 125164032 (477.46 MB)
Trainable params: 125164032 (477.46 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [None]:
model.config # 모델의 설정 정보 출력

GPT2Config {
  "_name_or_path": "skt/kogpt2-base-v2",
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "author": "Heewon Jeon(madjakarta@gmail.com)",
  "bos_token_id": 0,
  "created_date": "2021-04-28",
  "embd_pdrop": 0.1,
  "eos_token_id": 1,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "license": "CC-BY-NC-SA 4.0",
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "pad_token_id": 3,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
  

In [None]:
print(tokenizer.bos_token_id) # 토크나이저의 bos_token_id 출력
print(tokenizer.eos_token_id) # 토크나이저의 eos_token_id 출력
print(tokenizer.pad_token_id) # 토크나이저의 pad_token_id 출력
print(tokenizer.unk_token_id) # 토크나이저의 unk_token_id 출력

print('-' * 10) # 구분선 출력

for i in range(10): # 0부터 9까지 반복
    print(i, tokenizer.decode(i)) # i에 대한 디코딩 결과 출력
print(tokenizer.decode(51200)) # 51200에 대한 디코딩 결과 출력

1
1
3
5
----------
0 <s>
1 </s>
2 <usr>
3 <pad>
4 <sys>
5 <unk>
6 <mask>
7 <d>
8 </d>
9 <unused0>



In [None]:
import pandas as pd # 데이터프레임을 다루기 위한 라이브러리
import urllib.request # URL을 다루기 위한 라이브러리

Import a Korean chatbot dataset made by songys: <br>
https://github.com/songys/Chatbot_data <br>
To find more Korean dataset, check this site: <br>
https://github.com/ko-nlp/Korpora

In [None]:
urllib.request.urlretrieve("https://raw.githubusercontent.com/songys/Chatbot_data/master/ChatbotData.csv",
                           filename="ChatBotData.csv") # 챗봇 데이터 다운로드
train_data = pd.read_csv('ChatBotData.csv') # 챗봇 데이터를 데이터프레임으로 읽어오기

In [None]:
display(train_data) # 챗봇 데이터 출력

Unnamed: 0,Q,A,label
0,12시 땡!,하루가 또 가네요.,0
1,1지망 학교 떨어졌어,위로해 드립니다.,0
2,3박4일 놀러가고 싶다,여행은 언제나 좋죠.,0
3,3박4일 정도 놀러가고 싶다,여행은 언제나 좋죠.,0
4,PPL 심하네,눈살이 찌푸려지죠.,0
...,...,...,...
11818,훔쳐보는 것도 눈치 보임.,티가 나니까 눈치가 보이는 거죠!,2
11819,훔쳐보는 것도 눈치 보임.,훔쳐보는 거 티나나봐요.,2
11820,흑기사 해주는 짝남.,설렜겠어요.,2
11821,힘든 연애 좋은 연애라는게 무슨 차이일까?,잘 헤어질 수 있는 사이 여부인 거 같아요.,2


In [None]:
def get_chat_data(): # 챗봇 데이터를 토크나이저로 인코딩하는 함수

    bos_token = tokenizer.bos_token_id          # begin of sentence token
    eos_token = tokenizer.eos_token_id          # end of sentence token
    unk_token = tokenizer.unk_token_id          # unknown word token
    max_token_value = model.config.vocab_size # maximum token value

    conversations = [] # 대화 데이터를 저장할 리스트
    for question, answer in zip(train_data.Q.to_list(), train_data.A.to_list()): # 질문과 답변을 하나씩 가져오기

        ### START CODE HERE ###

        qna_line = tokenizer.encode('<usr>' + question + '<sys>' + answer)  # encode q & a dialog line 

        dialog = [bos_token]        # replace overshooting tokens with unk and enclose with bos and eos
        for token in qna_line:      # dialog line은 bos_token으로 시작하고, qna_line의 token을 하나씩 가져와서
            if token<max_token_value:  # token이 max_token_value보다 작으면
                dialog.append(token) # dialog에 token을 추가
            else:
                dialog.append(unk_token) # token이 max_token_value보다 크면 unk_token을 추가
        dialog.append(eos_token) # dialog에 eos_token을 추가

        ### END CODE HERE ###

        conversations.append(dialog) # 대화 데이터에 dialog를 추가
    return conversations # 대화 데이터 반환


In [None]:
chat_data = keras.utils.pad_sequences(get_chat_data(), padding='post', value=tokenizer.pad_token_id) # 챗봇 데이터를 패딩하여 가져오기
# 패딩된 챗봇 데이터를 chat_data에 저장
# 패딩은 post 방식으로 하고, 패딩 값은 tokenizer.pad_token_id로 설정
# 패딩은 문장의 길이를 맞추기 위해 사용
# 패딩 값은 pad_token_id로 설정

In [None]:
buffer = 500 # 버퍼 크기 설정
batch_size = 32 # 배치 크기 설정

dataset = tf.data.Dataset.from_tensor_slices(chat_data) # chat_data를 텐서로 변환하여 데이터셋으로 만들기
dataset = dataset.shuffle(buffer).batch(batch_size,drop_remainder=True) # 데이터셋을 섞고, 배치 크기로 나누기

In [None]:
for batch in dataset.take(1): # 데이터셋에서 1개의 배치를 가져와서
    print(batch.shape) # 배치의 크기 출력
    print(batch[0]) # 첫 번째 데이터 출력

(32, 47)
tf.Tensor(
[    1     2 18875  9169 14278 10811     4 24721  6975  7098 25856 29104
  9782 16130 12026  7661 25856     1     3     3     3     3     3     3
     3     3     3     3     3     3     3     3     3     3     3     3
     3     3     3     3     3     3     3     3     3     3     3], shape=(47,), dtype=int32)


In [None]:
str = tokenizer.decode(batch[0]) # 첫 번째 데이터를 디코딩하여 str에 저장
print(str) # str 출력

</s><usr> 결정적인 물증이 없어<sys> 안타깝네요. 증거를 지금이라도 모아봐요.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>


In [None]:
print(tokenizer.encode(str)) # str을 인코딩한 결과 출력

[1, 2, 18875, 9169, 14278, 10811, 4, 24721, 6975, 7098, 25856, 29104, 9782, 16130, 12026, 7661, 25856, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]


In [None]:
adam = keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08) # Adam 옵티마이저 생성

In [None]:
adam = keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08) # Adam 옵티마이저 생성
 
steps = len(train_data) // batch_size + 1 # 스텝 수 계산
print(steps) # 스텝 수 출력

370


In standard text generation fine-tuning, since we are predicting the next token given the text we have seen thus far, the labels are just the shifted encoded tokenized input. However, GPT's CLM (causal language model) uses look-ahead masks to hide the next tokens, which has the same effect as the labels are automatically shifted inside the model. Therefore, we can set as `labels=input_ids`.

In [None]:
EPOCHS = 3 # 에폭 수 설정

for epoch in range(EPOCHS): # 에폭 수만큼 반복
    epoch_loss = 0 # 에폭 손실 초기화

    for batch in tqdm(dataset, total=steps): # 데이터셋에서 배치를 가져와서
        with tf.GradientTape() as tape: # 그래디언트 테이프를 사용하여

            ### START CODE HERE ###

            result = model(input_ids=batch, labels=batch, training=True) # model에 input_ids와 labels를 넣어서 결과를 가져오기
            loss = result.loss # 결과에서 손실 가져오기
            batch_loss = tf.reduce_mean(loss)  # batch_loss의 평균 계산

            ### END CODE HERE ###

        grads = tape.gradient(batch_loss, model.trainable_variables) # 그래디언트 계산
        adam.apply_gradients(zip(grads, model.trainable_variables)) # 그래디언트를 적용하여 모델 업데이트
        epoch_loss += batch_loss / steps # 에폭 손실에 배치 손실을 더하기

    print('[Epoch: {:>4}] cost = {:>.9}'.format(epoch + 1, epoch_loss)) # 에폭과 손실 출력

  0%|          | 0/370 [00:00<?, ?it/s]

[Epoch:    1] cost = 1.25280488


  0%|          | 0/370 [00:00<?, ?it/s]

[Epoch:    2] cost = 1.00508964


  0%|          | 0/370 [00:00<?, ?it/s]

[Epoch:    3] cost = 0.888379931


In [None]:
text = '오늘도 좋은 하루!' # 텍스트 설정
sent = '<usr>' + text + '<sys>' # 텍스트에 사용자와 시스템을 추가

In [None]:
input_ids = [tokenizer.bos_token_id] + tokenizer.encode(sent) # 입력 토큰 설정
input_ids = tf.convert_to_tensor([input_ids]) # 입력 토큰을 텐서로 변환

In [None]:
# 모델을 사용하여 텍스트 생성
# input_ids를 입력으로 넣어서 텍스트를 생성
# max_length는 생성할 텍스트의 최대 길이
# do_sample은 샘플링 여부, True로 설정하면 샘플링을 사용
# eos_token_id는 텍스트 생성을 끝낼 토큰의 ID
# tokenizer.eos_token_id로 설정
output = model.generate(input_ids, max_length=50, do_sample=True, eos_token_id=tokenizer.eos_token_id)

In [None]:
# 생성된 텍스트를 디코딩하여 출력
# output은 텐서이므로 numpy()로 변환하여 리스트로 만들고, 그 리스트를 tokenizer.decode()로 디코딩
decoded_sentence = tokenizer.decode(output[0].numpy().tolist())

# 디코딩된 텍스트는 '<usr>'과 '<sys>'로 나누어져 있으므로, '<sys>' 뒤의 텍스트만 가져와서 출력
# 출력된 텍스트는 '</s>'로 끝나므로, '</s>'를 ''로 대체하여 출력
decoded_sentence.split('<sys> ')[1].replace('</s>', '')

'하루하루를 마음껏 즐기세요.'

In [None]:
# 텍스트 생성 함수
# text: 생성할 텍스트
# max_length: 생성할 텍스트의 최대 길이
# do_sample: 샘플링 여부
# top_k: top_k 샘플링
output = model.generate(input_ids, max_length=50, do_sample=True, top_k=10)
# 생성된 텍스트를 디코딩하여 출력
tokenizer.decode(output[0].numpy().tolist())

'</s><usr> 오늘도 좋은 하루!<sys> 오늘은 오늘보다 나은 하루를 보내보세요.</s>'

In [None]:
# 챗봇의 답변을 반환하는 함수
def return_answer_by_chatbot(user_text):
  sent = '<usr>' + user_text + '<sys>' # 사용자 텍스트에 사용자와 시스템을 추가
  input_ids = [tokenizer.bos_token_id] + tokenizer.encode(sent) # 입력 토큰 설정
  input_ids = tf.convert_to_tensor([input_ids]) # 입력 토큰을 텐서로 변환
  output = model.generate(input_ids, max_length=50, do_sample=True, top_k=20) # 모델을 사용하여 텍스트 생성, top_k는 20으로 설정, 샘플링 사용, 최대 길이는 50
  sentence = tokenizer.decode(output[0].numpy().tolist()) # 생성된 텍스트를 디코딩하여 출력, 텐서이므로 numpy()로 변환하여 리스트로 만들고, 그 리스트를 tokenizer.decode()로 디코딩
  chatbot_response = sentence.split('<sys> ')[1].replace('</s>', '') # 시스템 텍스트만 가져와서 출력, '</s>'를 ''로 대체하여 출력
  return chatbot_response # 챗봇의 답변 반환

In [24]:
return_answer_by_chatbot('안녕! 반가워~')

'안녕! 반가워요!'

In [25]:
return_answer_by_chatbot('너는 누구야?')

'당신은 누구예요.'

In [26]:
return_answer_by_chatbot('나랑 영화보자')

'영화보세요.'

In [27]:
return_answer_by_chatbot('너무 심심한데 나랑 놀자')

'심심해서 그럴 수도 있으니까요.'

In [28]:
return_answer_by_chatbot('영화 해리포터 재밌어?')

'영화 재미있게 보세요.'

In [29]:
return_answer_by_chatbot('너 딥 러닝 잘해?')

'바쁘거나 감정의 변화가 생겼을 수도 있어요.'

In [30]:
return_answer_by_chatbot('커피 한 잔 할까?')

'저도 커피 좋아해요.'

(c) 2024 SW Lee