<a href="https://colab.research.google.com/github/elliemci/chatbots/blob/main/summarize_title.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summarization as a Title

In [1]:
!pip install transformers



In [2]:
from transformers import pipeline

In [3]:
# test to summarize
text = "The US has passed the peak on new coronavirus cases, \
according to the director of the US Centers for Disease Control and Prevention. \
The US has over 635,000 confirmed Covid-19 infections and over 30,800 deaths, \
the highest for any country in the world. \
At the daily White House coronavirus briefing on Wednesday, \
Dr. Anthony Fauci said that \"the important issue is what happens next\", \
adding that there is \"vigorous\" work to mitigate the virus in the US. \
\"We have passed the peak,\" he said. \"We're starting to see the leveling off and coming down.\" \
Fauci showed a series of charts at the briefing to support his assessment, \
showing that deaths, hospitalizations and other key metrics were down in New York, \
which has been hit hardest by the virus."

Explore communitiy models on https://huggingface.co/models
Models suitable for summarization and headline generation

* T5-base
* T5-large
* facebook/bart-large-cnn
* google/pegasus-large
* moussaKam/barthez-orangesum-title


In [7]:
# instantiate the pipline class with NLP summarization task  or "text2text-generation" with the default model BART
title_t5hlg = pipeline("summarization",model="rg089/t5-headline-generation")

# Load model directly from HuggingFace model hub
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("rg089/t5-headline-generation")
model = AutoModelForSeq2SeqLM.from_pretrained("rg089/t5-headline-generation")

summary_title = title_t5hlg(text, min_length=5, max_length=40, do_sample=False)[0]['summary_text']
print(summary_title)

US passes peak on new Covid-19 cases, over 635,000 cases, highest in the world


In [8]:
title_hlgsm = pipeline("summarization",model="yair/HeadlineGeneration-sagemaker")

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("yair/HeadlineGeneration-sagemaker")
model = AutoModelForSeq2SeqLM.from_pretrained("yair/HeadlineGeneration-sagemaker")

summary_title = title_hlgsm(text, min_length=10, max_length=40, do_sample=False)[0]['summary_text']
print(summary_title)

U.S. coronavirus cases peak: CDC chief


In [9]:
title_bartlcnn = pipeline("summarization", model="facebook/bart-large-cnn")

# BAR large sized model, fine-tuned on CNN Daily Mail
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")

summary_title = title_bartlcnn(text, min_length=10, max_length=40, do_sample=False)[0]['summary_text']
print(summary_title)

The US has over 635,000 confirmed Covid-19 infections and over 30,800 deaths, the highest for any country in the world. Dr. Anthony Fauci said


In [11]:
title_t5b = pipeline("summarization", model="t5-base")

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("t5-base")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

summary_title = title_t5b(text, min_length=5, max_length=20, do_sample=False)[0]['summary_text']
print(summary_title)

the US has over 635,000 confirmed Covid-19 infections and over 30,800 deaths .


In [None]:
# NB:  Not enough RAM to run  google/pegasus-xsum on google colab free tier
title_t5b = pipeline("summarization", model="google/pegasus-xsum")

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("google/pegasus-xsum")
model = AutoModelForSeq2SeqLM.from_pretrained("google/pegasus-xsum")

summary_title = title_t5b(text, min_length=5, max_length=20, do_sample=False)[0]['summary_text']
print(summary_title)

config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

In [4]:
# Barthez model finetuned on onrangeSum(title generation) moussaKam/barthez-orangesum-title
title_bot = pipeline("summarization", model="moussaKam/barthez-orangesum-title")

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("moussaKam/barthez-orangesum-title")
model = AutoModelForSeq2SeqLM.from_pretrained("moussaKam/barthez-orangesum-title")

summary_title = title_bot(text, min_length=5, max_length=20, do_sample=False)[0]['summary_text']
print(summary_title)

config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/864M [00:00<?, ?B/s]



sentencepiece.bpe.model:   0%|          | 0.00/1.12M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.63M [00:00<?, ?B/s]

Coronavirus: "the important issue is what happens next", said
