# Search for a model for summarization

Recommended Free Language Models for Summarizing Text Extracted from HTML:

These models are generally well-regarded for summarization and have publicly available weights on Hugging Face Transformers:

## BART (e.g., facebook/bart-large-cnn):

Pros: Known for strong abstractive summarization performance, especially the large-cnn variant which was fine-tuned on news articles. Relatively robust and widely used.
Cons: Can be computationally intensive for very long inputs, but manageable for moderate-length text extracted from HTML.
Hugging Face Model Card: facebook/bart-large-cnn

## T5 (e.g., t5-small, t5-base, t5-large):

Pros: A powerful text-to-text transformer that can be used for various tasks, including summarization (by prepending "summarize: "). Different sizes are available, allowing you to choose based on your computational resources. t5-base and t5-large generally offer better quality than t5-small.
Cons: Can sometimes be more prone to generating shorter or less abstractive summaries compared to BART or PEGASUS, depending on the prompt and hyperparameters.
Hugging Face Model Cards:
t5-small
t5-base
t5-large




In [2]:
html_content = """
<!DOCTYPE html>
<html>
<head>
<title>Breaking News: Local Bakery Wins National Award</title>
</head>
<body>
  <h1>Local Bakery Crowned Best in the Nation</h1>
  <p>Sweet Surrender bakery in downtown Willow Creek has just been awarded the prestigious "Golden Whisk" award for the best bakery in the United States at the National Baking Competition held in Chicago this week.</p>
  <p>The bakery, owned and operated by local resident Sarah Miller, has been a community staple for over a decade, known for its delicious pastries, custom cakes, and friendly service.</p>
  <p>Miller expressed her excitement and gratitude, thanking her dedicated team and loyal customers for their support.</p>
  <p>The award is expected to bring increased attention and business to the small town bakery.</p>
</body>
</html>
"""


In [None]:
# About 1.63GB
# 406M params
# https://huggingface.co/facebook/bart-large-cnn

from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")


print(summarizer(html_content, max_length=130, min_length=30, do_sample=False))


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use mps:0


[{'summary_text': 'Sweet Surrender bakery in downtown Willow Creek has just been awarded the prestigious "Golden Whisk" award. The bakery, owned and operated by local resident Sarah Miller, has been a community staple for over a decade.'}]


In [None]:
html_content_with_title_only = """
<!DOCTYPE html>
<html>
<head>
<title>Breaking News: Local Bakery Wins National Award</title>
</head>
</html>
"""

In [5]:
print(summarizer(html_content_with_title_only, max_length=130, min_length=30, do_sample=False))


Your max_length is set to 130, but your input_length is only 45. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=22)


[{'summary_text': 'Local Bakery Wins National Award. Local Bakery wins National Award for Bakery of the Year. Bakery also wins Best Bakery in the State.'}]


# Or if we want a smaller model

In [19]:
# https://huggingface.co/google-t5/t5-small
# Around 250 MB
# 60.5M params

from transformers import pipeline

summarizer = pipeline("summarization", model="google-t5/t5-small")



Device set to use mps:0


In [20]:
summary = summarizer(html_content, max_length=150, min_length=30, do_sample=False)[0]['summary_text']
print(summary)

p>Sweet Surrender bakery in downtown Willow Creek has been awarded the "Golden Whisk" award . the award is expected to bring increased attention and business to the small town bakery .


In [21]:
# Lets try prompt eng
prompted_text = "Summarize the following article in a detailed and descriptive manner suitable for embedding in a database for similarity search: " + html_content
summary = summarizer(
    prompted_text,
    max_length=250,
    min_length=50,
    do_sample=False
)[0]['summary_text']
print(f"Generated Prompted Summary (T5-base): {summary}")

Generated Prompted Summary (T5-base): local bakery in downtown Willow Creek has been awarded the prestigious "Golden Whisk" award for the best bakery in the united states . /p>Miller expressed her excitement and gratitude, thanking her dedicated team and loyal customers for their support .


# Model in between bart and t5 small

In [8]:
# Around 900 MB
# 223 parameters

from transformers import pipeline

summarizer = pipeline("summarization", model="google-t5/t5-base")



config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Device set to use mps:0


In [9]:
summary = summarizer(html_content, max_length=150, min_length=30, do_sample=False)[0]['summary_text']
print(f"Generated Summary (T5-base): {summary}")

Generated Summary (T5-base): "Golden Whisk" award is expected to bring increased attention and business to the small town bakery . sweet surrender bakery in downtown Willow Creek has been a community staple for over a decade .


# OR we use an API

In [16]:
from huggingface_hub import InferenceClient
from dotenv import load_dotenv
import os

load_dotenv()
api_key = os.environ.get("HUGGINGFACE_API_KEY")

client = InferenceClient(
    provider="hf-inference",
    api_key=api_key,
)



In [17]:
result = client.summarization(
    html_content,
    model="facebook/bart-large-cnn",
)

print(result)

SummarizationOutput(summary_text='Sweet Surrender bakery in downtown Willow Creek has just been awarded the prestigious "Golden Whisk" award for the best bakery in the United States. The bakery, owned and operated by local resident Sarah Miller, has been a community staple for over a decade, known for its delicious pastries, custom cakes, and friendly service.')


In [23]:
# Lets try prompt eng
prompted_text = "Summarize the following article in a detailed and descriptive manner suitable for embedding in a database for similarity search: " + html_content


result = client.summarization(prompted_text, model="facebook/bart-large-cnn")

In [None]:
print(result) #took 22 seconds, this can be vary wildy and may get an error

SummarizationOutput(summary_text='Sweet Surrender bakery in downtown Willow Creek has just been awarded the prestigious "Golden Whisk" award for the best bakery in the United States. The bakery, owned and operated by local resident Sarah Miller, has been a community staple for over a decade. The award is expected to bring increased attention and business to the small town bakery.')
