


Prompt engineering in the work.
Task - text summarization.




First check the versions of basic libraries.
Now import needed packages.

In [140]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
from transformers import GenerationConfig
from transformers import TrainingArguments
from transformers import Trainer

import torch
import time
import evaluate
import pandas as pd
import numpy as np




Load the dataset with the `load_dataset` package. Reference: https://huggingface.co/docs/datasets/v1.11.0/loading_datasets.html 
I am going to use `knkarthick/dialogsum` dataset for this project.

In [5]:
dataset = load_dataset("knkarthick/dialogsum")
dataset

Found cached dataset csv (/Users/sokim/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-c8fac5d84cd35861/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)
100%|██████████| 3/3 [00:00<00:00, 538.70it/s]


DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
})





This dataset has 12460 rows in train, 
                  1500 rows in test,
                   500 rows in validation.

In [12]:
dataset['train'][0]

{'id': 'test_0_1',
 'summary': 'Ms. Dawson helps #Person1# to write a memo to inform every employee that they have to change the communication method and should not use Instant Messaging anymore.',
 'topic': 'communication method'}

In [13]:
dataset['test'][10]

{'id': 'test_3_2',
 'dialogue': "#Person1#: Happy Birthday, this is for you, Brian.\n#Person2#: I'm so happy you remember, please come in and enjoy the party. Everyone's here, I'm sure you have a good time.\n#Person1#: Brian, may I have a pleasure to have a dance with you?\n#Person2#: Ok.\n#Person1#: This is really wonderful party.\n#Person2#: Yes, you are always popular with everyone. and you look very pretty today.\n#Person1#: Thanks, that's very kind of you to say. I hope my necklace goes with my dress, and they both make me look good I feel.\n#Person2#: You look great, you are absolutely glowing.\n#Person1#: Thanks, this is a fine party. We should have a drink together to celebrate your birthday",
 'summary': "#Person1# attends Brian's birthday party. Brian thinks #Person1# looks great and charming.",
 'topic': 'birthday party'}

In [14]:
dataset['validation'][20]

{'id': 'dev_20',
 'dialogue': "#Person1#: Did you know that drinking beer helps you sing better?\n#Person2#: Are you sure? How do you know?\n#Person1#: Well, usually people think I'm a terrible singer, but after we all have a few beers, they say I sound a lot better!\n#Person2#: Well, I heard that if you drink enough beer, you can speak foreign languages better. . .\n#Person1#: Then after a few beers, you'll be singing in Taiwanese?\n#Person2#: Maybe. . .",
 'summary': '#Person1# says drinking beer helps sing better, but #Person2# heard it helps speaking foreign languages.',
 'topic': 'drinking beer'}





So each datapoint has id, dialog, summary and topic.

In [16]:
dataset['test'][100]

{'id': 'test_33_2',
 'dialogue': "#Person1#: OK, that's a cut! Let's start from the beginning, everyone.\n#Person2#: What was the problem that time?\n#Person1#: The feeling was all wrong, Mike. She is telling you that she doesn't want to see you any more, but I want to get more anger from you. You're acting hurt and sad, but that's not how your character would act in this situation.\n#Person2#: But Jason and Laura have been together for three years. Don't you think his reaction would be one of both anger and sadness?\n#Person1#: At this point, no. I think he would react the way most guys would, and then later on, we would see his real feelings.\n#Person2#: I'm not so sure about that.\n#Person1#: Let's try it my way, and you can see how you feel when you're saying your lines. After that, if it still doesn't feel right, we can try something else.",
 'summary': "#Person1# and Mike have a disagreement on how to act out a scene. #Person1# proposes that Mike can try to act in #Person1#'s way




The dialog is a conversation between two people,
and the summary is the summary labeled(generated) by human. 
Now we can load the FLAN-T5 model and generate the summary from the model.




Bring the model FLAN-T5. The model name is 'google/flan-t5-base'. Reference: https://huggingface.co/google/flan-t5-base 

In [18]:
model_name = 'google/flan-t5-base'
flant5_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

In [None]:




How the model looks like:

In [146]:
flant5_model

T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=768, out_features=2048, bias=False)
              (wi_1): Linear(in_features=768, out_features=2048, bias=False)
              (wo):




Bring the tokenizer. 
This is to encode words into numbers. 

In [19]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading (…)okenizer_config.json: 100%|██████████| 2.54k/2.54k [00:00<00:00, 7.44MB/s]
Downloading spiece.model: 100%|██████████| 792k/792k [00:00<00:00, 19.4MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 2.42M/2.42M [00:00<00:00, 15.7MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 2.20k/2.20k [00:00<00:00, 6.16MB/s]






To test how the tokenizer works,

In [21]:
example = "Hi Soo, how are you?, hope great things will happen to you."

encoded = tokenizer(example, return_tensors='pt')
encoded

{'input_ids': tensor([[2018,  264,   32,    6,  149,   33,   25,   58,    6,  897,  248,  378,
           56, 1837,   12,   25,    5,    1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}





The encoded token has input_id and attention_mask. 

In [22]:
encoded['input_ids']

tensor([[2018,  264,   32,    6,  149,   33,   25,   58,    6,  897,  248,  378,
           56, 1837,   12,   25,    5,    1]])

In [23]:
encoded['input_ids'][0]

tensor([2018,  264,   32,    6,  149,   33,   25,   58,    6,  897,  248,  378,
          56, 1837,   12,   25,    5,    1])

In [None]:




To get the encoded example back to the text example:

In [25]:
decoded = tokenizer.decode(encoded['input_ids'][0])
decoded

'Hi Soo, how are you?, hope great things will happen to you.</s>'

In [None]:



To remove the end of sentence token, 

In [34]:
decoded = tokenizer.decode(encoded['input_ids'][0], skip_special_tokens = True)
decoded

'Hi Soo, how are you?, hope great things will happen to you.'





Now I will see how well the FLAN-T5 model is performing with the summarization. 

In [42]:
dialog = dataset['test'][10]['dialogue']
dialog

"#Person1#: Happy Birthday, this is for you, Brian.\n#Person2#: I'm so happy you remember, please come in and enjoy the party. Everyone's here, I'm sure you have a good time.\n#Person1#: Brian, may I have a pleasure to have a dance with you?\n#Person2#: Ok.\n#Person1#: This is really wonderful party.\n#Person2#: Yes, you are always popular with everyone. and you look very pretty today.\n#Person1#: Thanks, that's very kind of you to say. I hope my necklace goes with my dress, and they both make me look good I feel.\n#Person2#: You look great, you are absolutely glowing.\n#Person1#: Thanks, this is a fine party. We should have a drink together to celebrate your birthday"

In [43]:
human_summary = dataset['test'][10]['summary']
human_summary

"#Person1# attends Brian's birthday party. Brian thinks #Person1# looks great and charming."



To see the model summary:
Tokenize the dialog and feed it into the model

In [45]:
inputs = tokenizer(dialog, return_tensors='pt')
inputs["input_ids"]

tensor([[ 1713,   345, 13515,   536,  4663,    10,  5574, 13753,     6,    48,
            19,    21,    25,     6,  7798,     5,  1713,   345, 13515,   357,
          4663,    10,    27,    31,    51,    78,  1095,    25,  1423,     6,
           754,   369,    16,    11,   777,     8,  1088,     5,  6656,    31,
             7,   270,     6,    27,    31,    51,   417,    25,    43,     3,
             9,   207,    97,     5,  1713,   345, 13515,   536,  4663,    10,
          7798,     6,   164,    27,    43,     3,     9,  5565,    12,    43,
             3,     9,  2595,    28,    25,    58,  1713,   345, 13515,   357,
          4663,    10,  8872,     5,  1713,   345, 13515,   536,  4663,    10,
           100,    19,   310,  1627,  1088,     5,  1713,   345, 13515,   357,
          4663,    10,  2163,     6,    25,    33,   373,  1012,    28,   921,
             5,    11,    25,   320,   182,  1134,   469,     5,  1713,   345,
         13515,   536,  4663,    10,  1333,     6,  

In [49]:
output = flant5_model.generate(inputs["input_ids"])
output

tensor([[   0, 7798,    6, 2763,   25,   21, 1107,   12,   69, 1088,    5,    1]])

In [53]:
decoded_output = tokenizer.decode(output[0], skip_special_tokens=True)
decoded_output

'Brian, thank you for coming to our party.'



So this is the model's summary of the input dialog. 
Which I think is a pretty good summary. 
Now I will test zero-shot inference. 

In [64]:
def zero_shot(index):
    '''
    index is the index in the dataset, user can choose the index. 
    returns the dialog, human summary, and model summary
    '''
    
    dialog = dataset['test'][index]['dialogue']
    human_summary = dataset['test'][index]['summary']
    
    prompt = 'Summarize this converation: {}'.format(dialog)
    
    encoded_input = tokenizer(prompt, return_tensors='pt')
    decoded_output = tokenizer.decode(
                              flant5_model.generate(
                              encoded_input['input_ids']
                              )[0], 
                              skip_special_tokens=True
        
    )
    
    print('dialog:')
    print(dialog)
    print('human summary: {}'.format(human_summary))
    print('model summary: {}'.format(decoded_output))

In [65]:
zero_shot(10)

dialog:
#Person1#: Happy Birthday, this is for you, Brian.
#Person2#: I'm so happy you remember, please come in and enjoy the party. Everyone's here, I'm sure you have a good time.
#Person1#: Brian, may I have a pleasure to have a dance with you?
#Person2#: Ok.
#Person1#: This is really wonderful party.
#Person2#: Yes, you are always popular with everyone. and you look very pretty today.
#Person1#: Thanks, that's very kind of you to say. I hope my necklace goes with my dress, and they both make me look good I feel.
#Person2#: You look great, you are absolutely glowing.
#Person1#: Thanks, this is a fine party. We should have a drink together to celebrate your birthday
human summary: #Person1# attends Brian's birthday party. Brian thinks #Person1# looks great and charming.
model summary: Brian's birthday is today.





Comparing human summary with model summary. 
I personally like model summary better. 
Now I will see one-shot inference, and see how the summarization performance changes. 

In [68]:
def one_shot(index):
    '''
    index is the index in the dataset, user can choose the index. 
    returns the dialog, human summary and model summary for comparison
    '''
    
    dialog = dataset['test'][index]['dialogue']
    human_summary = dataset['test'][index]['summary']
    
    prompt = 'Summarize this converation: {}. Summary: {}'.format(dialog, human_summary)
    
    encoded_input = tokenizer(prompt, return_tensors='pt')
    decoded_output = tokenizer.decode(
                              flant5_model.generate(
                              encoded_input['input_ids']
                              )[0], 
                              skip_special_tokens=True
        
    )
    
    print('dialog:')
    print(dialog)
    print('human summary: {}'.format(human_summary))
    print('model summary: {}'.format(decoded_output))

In [69]:
one_shot(10)

dialog:
#Person1#: Happy Birthday, this is for you, Brian.
#Person2#: I'm so happy you remember, please come in and enjoy the party. Everyone's here, I'm sure you have a good time.
#Person1#: Brian, may I have a pleasure to have a dance with you?
#Person2#: Ok.
#Person1#: This is really wonderful party.
#Person2#: Yes, you are always popular with everyone. and you look very pretty today.
#Person1#: Thanks, that's very kind of you to say. I hope my necklace goes with my dress, and they both make me look good I feel.
#Person2#: You look great, you are absolutely glowing.
#Person1#: Thanks, this is a fine party. We should have a drink together to celebrate your birthday
human summary: #Person1# attends Brian's birthday party. Brian thinks #Person1# looks great and charming.
model summary: Brian thinks #Person1# looks great and charming.





So in the one_shot function, the prompt has changed to include one human example, hence one-shot inference. 
With the one-shot inference, the model summary is taking the human style, like in #Person1#. 




Create a prompt that includes one or a few human examples. 

In [209]:
def prompt_generator(indices, to_summarize):
    '''
    indices is a list with one or more indices in the dataset. the indices' dialog and human summary will be given as example in the prompt. 
    to_summarize is the index of the dialog (from the dataset) to summarize.
    returns the prompt with the example based on the example(s) in the prompt. 
    '''
    
    #start with an empty prompt
    prompt = ''
    
    for i in indices: 
        dialog = dataset['test'][i]['dialogue']
        human_summary = dataset['test'][i]['summary']
    
        prompt += 'Conversation: {} Summarize: {}'.format(dialog, human_summary)
    
    
    dialog_to_sum = dataset['test'][to_summarize]['dialogue']
    
    prompt += 'Conversation: {} Summarize: '.format(dialog_to_sum)
    
    return prompt 




To create a one-shot inference prompt: 

In [210]:
prompt_generator([5], 10)

"Conversation: #Person1#: You're finally here! What took so long?\n#Person2#: I got stuck in traffic again. There was a terrible traffic jam near the Carrefour intersection.\n#Person1#: It's always rather congested down there during rush hour. Maybe you should try to find a different route to get home.\n#Person2#: I don't think it can be avoided, to be honest.\n#Person1#: perhaps it would be better if you started taking public transport system to work.\n#Person2#: I think it's something that I'll have to consider. The public transport system is pretty good.\n#Person1#: It would be better for the environment, too.\n#Person2#: I know. I feel bad about how much my car is adding to the pollution problem in this city.\n#Person1#: Taking the subway would be a lot less stressful than driving as well.\n#Person2#: The only problem is that I'm going to really miss having the freedom that you have with a car.\n#Person1#: Well, when it's nicer outside, you can start biking to work. That will give 

In [None]:




To create a few-shot inference prompt:

In [132]:
prompt_generator([5, 20, 30], 10)

"Conversation: #Person1#: You're finally here! What took so long?\n#Person2#: I got stuck in traffic again. There was a terrible traffic jam near the Carrefour intersection.\n#Person1#: It's always rather congested down there during rush hour. Maybe you should try to find a different route to get home.\n#Person2#: I don't think it can be avoided, to be honest.\n#Person1#: perhaps it would be better if you started taking public transport system to work.\n#Person2#: I think it's something that I'll have to consider. The public transport system is pretty good.\n#Person1#: It would be better for the environment, too.\n#Person2#: I know. I feel bad about how much my car is adding to the pollution problem in this city.\n#Person1#: Taking the subway would be a lot less stressful than driving as well.\n#Person2#: The only problem is that I'm going to really miss having the freedom that you have with a car.\n#Person1#: Well, when it's nicer outside, you can start biking to work. That will give 




Use a few-shot prompt and generate the model summary. 

In [133]:
prompt = prompt_generator([5, 20], 10)
prompt

"Conversation: #Person1#: You're finally here! What took so long?\n#Person2#: I got stuck in traffic again. There was a terrible traffic jam near the Carrefour intersection.\n#Person1#: It's always rather congested down there during rush hour. Maybe you should try to find a different route to get home.\n#Person2#: I don't think it can be avoided, to be honest.\n#Person1#: perhaps it would be better if you started taking public transport system to work.\n#Person2#: I think it's something that I'll have to consider. The public transport system is pretty good.\n#Person1#: It would be better for the environment, too.\n#Person2#: I know. I feel bad about how much my car is adding to the pollution problem in this city.\n#Person1#: Taking the subway would be a lot less stressful than driving as well.\n#Person2#: The only problem is that I'm going to really miss having the freedom that you have with a car.\n#Person1#: Well, when it's nicer outside, you can start biking to work. That will give 

In [134]:
inputs = tokenizer(prompt, return_tensors='pt')
inputs['input_ids'][0][0:10]

tensor([28941,    10,  1713,   345, 13515,   536,  4663,    10,   148,    31])

In [135]:
decoded_outputs = tokenizer.decode(
                  flant5_model.generate(
                        inputs['input_ids']
                  )[0], 
                  skip_special_tokens=True
)


decoded_outputs

"#Person1#: Happy birthday, Brian. #Person2#: I'"

In [136]:
dataset['test'][10]['summary']

"#Person1# attends Brian's birthday party. Brian thinks #Person1# looks great and charming."



Interestingly, a few shot inference was not performing well compared to the zero shot or one shot inference. 

In [150]:
flant5_model.named_parameters()

<generator object Module.named_parameters at 0x2abd74d60>

In [182]:
def find_trainable_param(model):
    trainable_param = 0
    total_param = 0
    for _, param in flant5_model.named_parameters():
        total_param += param.numel()
        if param.requires_grad:
            trainable_param += param.numel()
    return "trainable model parameter: {} total model param: {}".format(trainable_param, total_param)

In [180]:
find_trainable_param(flant5_model)

'trainable model parameter: 247577856\ntotal model param: 247577856'

In [213]:
index = 100

dialog = dataset['test'][index]['dialogue']
human_summary = dataset['test'][index]['summary']

In [214]:
prompt = f'''
Summarize the conversation:
{dialog}

Summary:
'''

In [215]:
inputs = tokenizer(prompt, return_tensors='pt')
outputs = tokenizer.decode(
            flant5_model.generate(
            inputs["input_ids"]
            )[0], 
            skip_special_tokens=True
)

In [216]:
print(prompt)


Summarize the conversation:
#Person1#: OK, that's a cut! Let's start from the beginning, everyone.
#Person2#: What was the problem that time?
#Person1#: The feeling was all wrong, Mike. She is telling you that she doesn't want to see you any more, but I want to get more anger from you. You're acting hurt and sad, but that's not how your character would act in this situation.
#Person2#: But Jason and Laura have been together for three years. Don't you think his reaction would be one of both anger and sadness?
#Person1#: At this point, no. I think he would react the way most guys would, and then later on, we would see his real feelings.
#Person2#: I'm not so sure about that.
#Person1#: Let's try it my way, and you can see how you feel when you're saying your lines. After that, if it still doesn't feel right, we can try something else.

Summary:



In [217]:
human_summary

"#Person1# and Mike have a disagreement on how to act out a scene. #Person1# proposes that Mike can try to act in #Person1#'s way."

In [218]:
outputs

'The two of them are trying to figure out how to express their feelings.'




Prep for fine tuning 

In [225]:
def tokenize_function(example):
    start_prompt = 'Summarize this conversation:'
    end_prompt = 'Summary:'
    prompt = [start_prompt + dialog + end_prompt for dialog in example["dialogue"]]
    example["input_ids"] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors='pt').input_ids
    example["labels"] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors='pt').input_ids
    
    return example

In [226]:
tokenized_datasets = dataset.map(tokenize_function)

                                                                  

In [227]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'labels'],
        num_rows: 12460
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'labels'],
        num_rows: 1500
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'labels'],
        num_rows: 500
    })
})

In [228]:
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 12460
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 1500
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 500
    })
})

In [229]:
tokenized_datasets = tokenized_datasets.filter(lambda example, index : index % 100 == 0, with_indices=True)
tokenized_datasets

                                                                    

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 125
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 15
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 5
    })
})

In [237]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 125
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 15
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 5
    })
})

In [230]:
tokenized_datasets['train'].shape

(125, 2)

In [231]:
tokenized_datasets['validation'].shape

(5, 2)

In [235]:
tokenized_datasets['test'].shape

(15, 2)

In [239]:
output_dir = f'./dialogue-summary-training-{str(int(time.time()))}'

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_steps=1,
    max_steps=1
)

trainer = Trainer(
    model=flant5_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)