AI Programming - SW Lee

# Lab 06: GPT2 Model for Language Understanding
## Exercise: Building a Korean Chatbot
This exercise is taken from Github Storage for "What is Natural Language Processing?" by Wonjoon Yu.<br>
https://github.com/ukairia777/tensorflow-nlp-tutorial

In [1]:
# Check Colab Environment
RunningInCOLAB = 'google.colab' in str(get_ipython())

# Import tqdm for notbook if Colab
if RunningInCOLAB:
    from tqdm.notebook import tqdm
else:
    from tqdm import tqdm

# Set the Keras backend to TensorFlow
import os
os.environ["KERAS_BACKEND"] = "tensorflow"

# Importing TensorFlow and Keras
import tensorflow as tf
import keras
# Importing required modules from the Transformers library
from transformers import AutoTokenizer
from transformers import TFGPT2LMHeadModel

The GPT2 Model transformer for TensorFlow with a language modeling head on top (linear layer with weights tied to the input embeddings).

If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument :

a single Tensor with input_ids only and nothing else: `model(inputs_ids)`

a list of varying length with one or several input Tensors IN THE ORDER given in the docstring: `model([input_ids, attention_mask])` or `model([input_ids, attention_mask, token_type_ids])`

a dictionary with one or several input Tensors associated to the input names given in the docstring: `model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`

https://huggingface.co/transformers/v3.0.2/index.html

In [3]:
### START CODE HERE ###

# find & assign tokenizer and model; 'skt/kogpt2-base-v2'

# Initialize tokenizer for KoGPT-2 model with special tokens
tokenizer = AutoTokenizer.from_pretrained('skt/kogpt2-base-v2',
                                          bos_token='<s>',
                                          eos_token='</s>',
                                          pad_token='<pad>',
                                          unk_token='<unk>')                    # define various tokens while loading tokenizer
model = TFGPT2LMHeadModel.from_pretrained('skt/kogpt2-base-v2', from_pt=True)   # Load the KoGPT-2 model

### END CODE HERE ###

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFGPT2LMHeadModel: ['transformer.h.3.attn.masked_bias', 'transformer.h.6.attn.masked_bias', 'transformer.h.2.attn.masked_bias', 'lm_head.weight', 'transformer.h.9.attn.masked_bias', 'transformer.h.0.attn.masked_bias', 'transformer.h.10.attn.masked_bias', 'transformer.h.4.attn.masked_bias', 'transformer.h.5.attn.masked_bias', 'transformer.h.8.attn.masked_bias', 'transformer.h.1.attn.masked_bias', 'transformer.h.7.attn.masked_bias', 'transformer.h.11.attn.masked_bias']
- This IS expected if you are initializing TFGPT2LMHeadModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFGPT2LMHeadModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassifica

In [4]:
# Display a summary of the model
model.summary()

Model: "tfgpt2lm_head_model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transformer (TFGPT2MainLay  multiple                  125164032 
 er)                                                             
                                                                 
Total params: 125164032 (477.46 MB)
Trainable params: 125164032 (477.46 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [5]:
# Access the configuration settings of the model
model.config

GPT2Config {
  "_name_or_path": "skt/kogpt2-base-v2",
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "author": "Heewon Jeon(madjakarta@gmail.com)",
  "bos_token_id": 0,
  "created_date": "2021-04-28",
  "embd_pdrop": 0.1,
  "eos_token_id": 1,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "license": "CC-BY-NC-SA 4.0",
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "pad_token_id": 3,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
  

In [6]:
# Print the token IDs for special tokens
print(tokenizer.bos_token_id)
print(tokenizer.eos_token_id)
print(tokenizer.pad_token_id)
print(tokenizer.unk_token_id)

print('-' * 10)

# Print the first 10 token IDs
for i in range(10):
    print(i, tokenizer.decode(i))
print(tokenizer.decode(51200)) # Print token ID 51200

0
1
3
5
----------
0 <s>
1 </s>
2 <usr>
3 <pad>
4 <sys>
5 <unk>
6 <mask>
7 <d>
8 </d>
9 <unused0>



In [7]:
# Import library
import pandas as pd
import urllib.request

Import a Korean chatbot dataset made by songys: <br>
https://github.com/songys/Chatbot_data <br>
To find more Korean dataset, check this site: <br>
https://github.com/ko-nlp/Korpora

In [8]:
# Download the dataset from the provided URL and save it as ChatBotData.csv
urllib.request.urlretrieve("https://raw.githubusercontent.com/songys/Chatbot_data/master/ChatbotData.csv",
                           filename="ChatBotData.csv")
# Load the downloaded CSV file into a pandas DataFrame
train_data = pd.read_csv('ChatBotData.csv')

In [9]:
# Display the contents of the train_data DataFrame
display(train_data)

Unnamed: 0,Q,A,label
0,12시 땡!,하루가 또 가네요.,0
1,1지망 학교 떨어졌어,위로해 드립니다.,0
2,3박4일 놀러가고 싶다,여행은 언제나 좋죠.,0
3,3박4일 정도 놀러가고 싶다,여행은 언제나 좋죠.,0
4,PPL 심하네,눈살이 찌푸려지죠.,0
...,...,...,...
11818,훔쳐보는 것도 눈치 보임.,티가 나니까 눈치가 보이는 거죠!,2
11819,훔쳐보는 것도 눈치 보임.,훔쳐보는 거 티나나봐요.,2
11820,흑기사 해주는 짝남.,설렜겠어요.,2
11821,힘든 연애 좋은 연애라는게 무슨 차이일까?,잘 헤어질 수 있는 사이 여부인 거 같아요.,2


In [10]:
def get_chat_data():

    # Get the token ID for the beginning of sentence (BOS) token
    bos_token = tokenizer.bos_token_id
    # Get the token ID for the end of sentence (EOS) token
    eos_token = tokenizer.eos_token_id
    # Get the token ID for the unknown of sentence (UNK) token
    unk_token = tokenizer.unk_token_id
    # Get the maximum token value based on the model's vocab size
    max_token_value = model.config.vocab_size

    conversations = [] # Initialize an empty list to store conversations

    # Iterate over each question and answer pair in training data
    for question, answer in zip(train_data.Q.to_list(), train_data.A.to_list()):

        ### START CODE HERE ###

        # Encode the question and answer as a tokenized sequence
        qna_line = tokenizer.encode('<usr>' + question + '<sys>' + answer)  # encode q & a dialog line

        # Initialize a list to store the dialog tokens with the BOS token
        dialog = [bos_token]

        # Loop through the encoded tokens
        for token in qna_line:
            if token<max_token_value:     # If the token is within the model's vocabulary size
                dialog.append(token)      # Append the token to the dialog list
            else:                         # If the token is out of vocabulary
                dialog.append(unk_token)  # Replace it with the unknown token

        dialog.append(eos_token)          # Append the EOS token to the dialog
        ### END CODE HERE ###

        conversations.append(dialog)      # Add the dialog to the conversations list
    return conversations                  # Return the list of conversations


In [11]:
# Pad the sequences of chat data using the pad token ID to ensure uniform length
chat_data = keras.utils.pad_sequences(get_chat_data(), padding='post', value=tokenizer.pad_token_id)

In [12]:
buffer = 500     # Set buffer size for shuffling
batch_size = 32  # Set batch size for the dataset

# Create a TensorFlow dataset
dataset = tf.data.Dataset.from_tensor_slices(chat_data)
dataset = dataset.shuffle(buffer).batch(batch_size,drop_remainder=True)

In [13]:
# Take one batch from the dataset and print its shape and the first sequence
for batch in dataset.take(1):
    print(batch.shape)
    print(batch[0])

(32, 47)
tf.Tensor(
[    0     2 25883 14701  7991     4 10586  9802  9846  9122  8046  8084
   376     1     3     3     3     3     3     3     3     3     3     3
     3     3     3     3     3     3     3     3     3     3     3     3
     3     3     3     3     3     3     3     3     3     3     3], shape=(47,), dtype=int32)


In [14]:
# Decode the first sequence in the batch and print it
str = tokenizer.decode(batch[0])
print(str)

<s><usr> 공시생이야<sys> 좋은 결과 있을 거예요!</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>


In [15]:
# Encode the string back into token IDs and print it
print(tokenizer.encode(str))

[0, 2, 25883, 14701, 7991, 4, 10586, 9802, 9846, 9122, 8046, 8084, 376, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]


In [16]:
# Initialize the Adam optimizer
adam = keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08)

In [17]:
# Initialize the Adam optimizer
adam = keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08)

# Calculate the number of training steps
steps = len(train_data) // batch_size + 1
print(steps)

370


In standard text generation fine-tuning, since we are predicting the next token given the text we have seen thus far, the labels are just the shifted encoded tokenized input. However, GPT's CLM (causal language model) uses look-ahead masks to hide the next tokens, which has the same effect as the labels are automatically shifted inside the model. Therefore, we can set as `labels=input_ids`.

In [18]:
EPOCHS = 3 # Set the number of training epochs

# Loop over the number of epochs
for epoch in range(EPOCHS):
    epoch_loss = 0 # Initialize loss

    # Iterate through the dataset
    for batch in tqdm(dataset, total=steps):
        # Record operations
        with tf.GradientTape() as tape:

            ### START CODE HERE ###
            # Forward pass through the model and computing the loss
            result = model(batch, labels=batch)
            # Extract the loss value
            loss=result[0]
            # Calculate the average loss
            batch_loss=tf.reduce_mean(loss)
            ### END CODE HERE ###

        # Compute the gradients of the loss
        grads = tape.gradient(batch_loss, model.trainable_variables)
        # Apply the gradients to the model's weights
        adam.apply_gradients(zip(grads, model.trainable_variables))
        # Accumulate the average loss
        epoch_loss += batch_loss / steps

    # Print the average loss
    print('[Epoch: {:>4}] cost = {:>.9}'.format(epoch + 1, epoch_loss))

  0%|          | 0/370 [00:00<?, ?it/s]

[Epoch:    1] cost = 1.19391048


  0%|          | 0/370 [00:00<?, ?it/s]

[Epoch:    2] cost = 0.923893332


  0%|          | 0/370 [00:00<?, ?it/s]

[Epoch:    3] cost = 0.746905565


In [19]:
# Add user and sys tags to the text
text = '오늘도 좋은 하루!'
sent = '<usr>' + text + '<sys>'

In [20]:
# Encode the sentence with the BOS token
input_ids = [tokenizer.bos_token_id] + tokenizer.encode(sent)
# Convert the result into a TensorFlow tensor
input_ids = tf.convert_to_tensor([input_ids])

In [21]:
# Generate text from the model
output = model.generate(input_ids, max_length=50, do_sample=True, eos_token_id=tokenizer.eos_token_id)

In [22]:
 # Decode the token IDs to a string
decoded_sentence = tokenizer.decode(output[0].numpy().tolist())
# Get the part after <sys> and remove the end token </s>
decoded_sentence.split('<sys> ')[1].replace('</s>', '')

'그 날 하루 조금 더 참고 기다려봐요.'

In [23]:
# Generate text from the model with sampling and top-k sampling
output = model.generate(input_ids, max_length=50, do_sample=True, top_k=10)
# Decode the generated token IDs
tokenizer.decode(output[0].numpy().tolist())

'<s><usr> 오늘도 좋은 하루!<sys> 좋은 하루!</s>'

In [24]:
# Function defined to receive chatbot response based on user input
def return_answer_by_chatbot(user_text):
  sent = '<usr>' + user_text + '<sys>' # Format the input by adding usr and sys tags
  input_ids = [tokenizer.bos_token_id] + tokenizer.encode(sent) # Encode the input sentence and add the BOS token
  input_ids = tf.convert_to_tensor([input_ids]) # Convert the input IDs to a TensorFlow tensor
  output = model.generate(input_ids, max_length=50, do_sample=True, top_k=20) # Generate a response from the model
  sentence = tokenizer.decode(output[0].numpy().tolist()) # Decode the output tokens
  chatbot_response = sentence.split('<sys> ')[1].replace('</s>', '') # Extract the chatbot's response
  return chatbot_response  # Return the chatbot's response

In [25]:
# Get the chatbot's response for the given user input
return_answer_by_chatbot('안녕! 반가워~')

'반갑습니다.'

In [26]:
# Get the chatbot's response for the given user input
return_answer_by_chatbot('너는 누구야?')

'누구야?'

In [27]:
# Get the chatbot's response for the given user input
return_answer_by_chatbot('나랑 영화보자')

'영화도 보고 뮤지컬도 볼 수 있는 곳이 좋겠어요.'

In [28]:
# Get the chatbot's response for the given user input
return_answer_by_chatbot('너무 심심한데 나랑 놀자')

'좋은 소식도 많을 거 같아요.'

In [29]:
# Get the chatbot's response for the given user input
return_answer_by_chatbot('영화 해리포터 재밌어?')

'좋아하는 영화 해봐도 될까요.'

In [30]:
# Get the chatbot's response for the given user input
return_answer_by_chatbot('너 딥 러닝 잘해?')

'딥 러닝은 스스로 더 잘할 거예요.'

In [31]:
# Get the chatbot's response for the given user input
return_answer_by_chatbot('커피 한 잔 할까?')

'깔끔한걸 좋아하면 할 수 있어요.'

(c) 2024 SW Lee