<a href="https://colab.research.google.com/github/Yazanjian/text-summarization/blob/master/Text_Summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Summarization
In this notebook, we will discover some prompt engineering methods including zero-shot, one-shot and few-shot approaches

### Step 0: Install the required packages

In [None]:
pip install -q huggingface_hub transformers datasets torch torchdata

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25h

### Step 1: Import the required libraries   

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig
from datasets import load_dataset

### Step 2: Generate a summary dialogue without a prompt engineering

In this step, we will generate a summary without adding any prompt engineering. For that we will use a dataset with input dialogue and summary.


In [None]:
huggingface_dataset_name = 'knkarthick/dialogsum'
dataset= load_dataset(huggingface_dataset_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

#### Checking the dataset randomly:

In [None]:
indices = [40, 50]
seperator = "".join("=" for i in range(100))

for i, index in enumerate(indices):
  print("Data sample number {}: \n\nDialogue is: \n{} \n\n\nSumary is:\n{}\n".format(i+1, dataset['train'][index]['dialogue'], dataset['train'][index]['summary']))
  print(seperator)


Data sample number 1: 

Dialogue is: 
#Person1#: I just bought a new dress. What do you think of it?
#Person2#: You look really great in it. So are you going to a job interview or a party?
#Person1#: No, I was invited to give a talk in my school.
#Person2#: So how much did you pay for it?
#Person1#: I pay just $70 for it. I saved $30.
#Person2#: That's really a bargain.
#Person1#: You're right. Well, what did you do while I was out shopping?
#Person2#: I watched TV for a while and then I did some reading. It wasn't a very interesting book so I just read a few pages. Then I took a shower.
#Person1#: I thought you said you were going to see Mike.
#Person2#: I'll go and visit him at his home tomorrow. He'll return home tomorrow morning.
#Person1#: I'm glad he can finally returned home after that accident. 


Sumary is:
While #Person1# made a bargain to buy a new dress, #Person2# watched TV, read a boring book, and took a shower at home.

Data sample number 2: 

Dialogue is: 
#Person1#: Yo

Setup the model:

In [None]:
model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]



In [None]:
indices = [40, 50]
seperator = "".join("=" for i in range(100))

for i, index in enumerate(indices):
  sentence = dataset['test'][index]['dialogue']
  summary = dataset['test'][index]['summary']
  inputs = tokenizer(sentence, return_tensors="pt")
  print("Data sample number {}: \n\nDialogue is: \n{} \n\n\nHuman Sumary is:\n{}\n".format(i+1, sentence, summary))
  outputs = model.generate(**inputs, max_new_tokens=50)
  print("Model generated summary:\n{}".format(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]))
  print(seperator)

Data sample number 1: 

Dialogue is: 
#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there. 


Human Sumary is:
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.

Model generated summary:
Person1: It's ten to nine.
Data sample number 2: 

Dialogue is: 
#Person1#: Yeah. Just pull on this strip. Then peel off the back.
#Person2#: You might make a few enemies this way.
#Person1#: If they don't think this is fun, they're not meant to be our friends.
#Person2#: You mean your friends. I think it's cruel.
#Person1#: Yeah. But it's fun. Look at those two ugly old ladies. . . or are they men?
#Person2#: Hurry! Get a shot!. . . Hand it over!
#Person1#: I knew y

### Step 3.1: Zero Shot training
We will use zero-shot apprach to summarize the sentences as you can find below.

In [None]:
for i, index in enumerate(indices):
  dialogue = dataset['test'][index]['dialogue']
  summary = dataset['test'][index]['summary']

  prompt = f"""
  Summarize the following conversation.

  {dialogue}

  Summary:
  """

  inputs = tokenizer(prompt, return_tensors="pt")
  print("Data sample number {}: \n\nDialogue is: \n{} \n\n\nHuman Sumary is:\n{}\n".format(i+1, dialogue, summary))
  outputs = model.generate(**inputs, max_new_tokens=50)
  print("Model generated summary (ZERO SHOT):\n{}".format(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]))
  print(seperator)

Data sample number 1: 

Dialogue is: 
#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there. 


Human Sumary is:
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.

Model generated summary (ZERO SHOT):
The train is about to leave.
Data sample number 2: 

Dialogue is: 
#Person1#: Yeah. Just pull on this strip. Then peel off the back.
#Person2#: You might make a few enemies this way.
#Person1#: If they don't think this is fun, they're not meant to be our friends.
#Person2#: You mean your friends. I think it's cruel.
#Person1#: Yeah. But it's fun. Look at those two ugly old ladies. . . or are they men?
#Person2#: Hurry! Get a shot!. . . Hand it over!
#Pers

Now we will try to use different prompt instructions and still use zero shot method:

In [None]:
for i, index in enumerate(indices):
  dialogue = dataset['test'][index]['dialogue']
  summary = dataset['test'][index]['summary']

  prompt = f"""
  Dialogue:

  {dialogue}

  What was going on?
  """

  inputs = tokenizer(prompt, return_tensors="pt")
  print("Data sample number {}: \n\nDialogue is: \n{} \n\n\nHuman Sumary is:\n{}\n".format(i+1, dialogue, summary))
  outputs = model.generate(**inputs, max_new_tokens=50)
  print("Model generated summary (ZERO SHOT):\n{}".format(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]))
  print(seperator)

Data sample number 1: 

Dialogue is: 
#Person1#: What time is it, Tom?
#Person2#: Just a minute. It's ten to nine by my watch.
#Person1#: Is it? I had no idea it was so late. I must be off now.
#Person2#: What's the hurry?
#Person1#: I must catch the nine-thirty train.
#Person2#: You've plenty of time yet. The railway station is very close. It won't take more than twenty minutes to get there. 


Human Sumary is:
#Person1# is in a hurry to catch a train. Tom tells #Person1# there is plenty of time.

Model generated summary (ZERO SHOT):
Tom is late for the train.
Data sample number 2: 

Dialogue is: 
#Person1#: Yeah. Just pull on this strip. Then peel off the back.
#Person2#: You might make a few enemies this way.
#Person1#: If they don't think this is fun, they're not meant to be our friends.
#Person2#: You mean your friends. I think it's cruel.
#Person1#: Yeah. But it's fun. Look at those two ugly old ladies. . . or are they men?
#Person2#: Hurry! Get a shot!. . . Hand it over!
#Person

### Step 3.2: Few Shot training
We will use few-shot apprach to summarize the sentences by providing a prompt with few examples.

In [None]:
def prepare_prompt(example_indices=[0,10,20], index_to_summarize=50):
  prompt = ""
  for index in example_indices:
    dialogue = dataset['test'][index]['dialogue']
    summary = dataset['test'][index]['summary']
    prompt += f"""
    Dialogue:

    {dialogue}

    What was going on?
    {summary}
    """

  dialogue_to_summarize = dataset['test'][index_to_summarize]['dialogue']
  prompt += f"""
    Dialogue:

    {dialogue_to_summarize}

    What was going on?
    """

  return prompt



Now it is time to make a prediction and check the performance of this few-shot approach.

In [None]:
index_to_summarize = 66
few_shot_prompt = prepare_prompt(example_indices=[0, 4, 10, 20, 25], index_to_summarize=index_to_summarize)
inputs = tokenizer(few_shot_prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=1040)
print("Human baseline summary:\n{}\n\n\nModel generated summary (FEW SHOT):\n{}".format(dataset['test'][index_to_summarize]['summary'], tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]))

Human baseline summary:
#Person1# is not satisfied with the steak and #Person2# will change it.


Model generated summary (FEW SHOT):
Person1 wants to have medium rare steak.
