In [1]:
# !pip install --upgrade pip

In [2]:
# !pip install --disable-pip-version-check torch torchdata --quiet

In [3]:
# !pip install transformers datasets

In [4]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM,GenerationConfig

#### Summarize Dialogue without Prompt Engineering

Here, we will be generating a summary of a dialogue with a pre-trained LLM FLAN-T5 from HuggingFace.
Let's upload some simple dialogues from `dialogsum` huggindface dataset. This dataset contains 10k+ dialogues with the
corresponding manually labelled summaries and topics 

In [5]:
huggingface_dataset_name = 'knkarthick/dialogsum'
dataset = load_dataset(huggingface_dataset_name)

In [6]:
# print a couple of dialogues with their baseline summaries
example_indices = [40,200]
dash_line = '-'.join('' for _ in range(100))
for i,index in enumerate(example_indices):
    print(dash_line)
    print('Example',i+1)
    print(dash_line)
    print('INPUT DIALOGUE:')
    print(dataset['test'][index]['dialogue'])
    print(dash_line)
    print('BASELINE HUMAN SUMMARY:')
    print(dataset['test'][index]['summary'])
    print(dash_line)
    print()
    

---------------------------------------------------------------------------------------------------
Example 1
---------------------------------------------------------------------------------------------------
INPUT DIALOGUE:
#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there.
---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.
---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------
Exam

Now, we will improve the summary by our model

Load the `Flan-T5` model and create an instance of the `AutoModelForSeq2SeqLM` class with `.from_pretrained()` method.

In [7]:
model_name = 'google/flan-t5-base'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)



model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/282 [00:00<?, ?it/s]



generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

To perform encoding and decoding, you need the text in tokenized form. Tokenization is the process of splitting text
into smaller units that can be processed by LLM models 

Download the tokenizer for `Flan-T5` model using `AutoTokenizer.from_pretrained()` method. Parameter `use_fast` 
switches on the fast tokenizer.

In [8]:
tokenizer = AutoTokenizer.from_pretrained(model_name,use_fast=True)

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

These are all from the huggingface transformers library.

#### Test the tokenizer encoding and decoding a sentence

In [9]:
sentence = 'What time is it, Tom?'
sentence_encoded = tokenizer(sentence,return_tensors='pt')

sentence_decoded = tokenizer.decode(sentence_encoded['input_ids'][0],skip_special_tokens=True)
print("Encoded Sentence: ")
print(sentence_encoded['input_ids'][0])
print('\n Decoded Sentence: ')
print(sentence_decoded)

Encoded Sentence: 
tensor([ 363,   97,   19,   34,    6, 3059,   58,    1])

 Decoded Sentence: 
What time is it, Tom?


Now, it's time to explore how well the base LLM summarizes a dialogue without any prompt engineering. Prompt
engineering is an act of human changing the prompt to improve the response for a given task 

In [None]:
for i,index in enumerate(example_indices):
    dialogue = dataset['test'][index]['dialogue']
    summary =  dataset['test'][index]['summary']
    inputs = tokenizer(dialogue,return_tensors = 'pt')
    outputs = tokenizer.decod