<a href="https://colab.research.google.com/github/aparnaashok2125/Text-Summarization-using-LLMs-T5-Transformer-Model-with-Python/blob/main/Text_Summarization_using_LLMs_(T5_Transformer_Model).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [13]:
# Text Summarization Model using T5 Transformer Model (LLM)
# ---------------------------------------
# Step-by-step implementation using Hugging Face Transformers

**A text summarization model takes a long text and creates a shorter version while preserving its main points. It works by extracting key sentences directly (extractive summarization) or rephrasing the content into a shorter form (abstractive summarization).**

**To build a text summarization model, first, we need to choose a pre-trained language model like T5. Then, we need to tokenize the input text, which converts it into a format the model can process. The next step will be to use the model to generate a summary by specifying parameters like maximum length and beam search for better results. The final step will be to decode the generated tokens back into readable text and adjust parameters to improve the summary quality.**

**Select a Suitable LLM**

*Choose a pre-trained model designed for text generation tasks. The T5 model (Text-to-Text Transfer Transformer) by Google is one such model that is effective for various text-based tasks like translation, question-answering, and summarization.*

In [14]:
# Install Required Libraries

*Install the transformers library by Hugging Face. It provides easy access to various pre-trained models and tokenizers. If you are using Google Colab, you will find it pre-installed in the Colab environment. To install it on your local machine, run the command mentioned below on your terminal or command prompt :*

In [15]:
# !pip install transformers

*Now, we need to import the T5Tokenizer and T5ForConditionalGeneration classes from the transformers library. Select a model like t5-small, t5-base, or t5-large based on the requirement and computational capacity:*

In [16]:
# Load a pre-trained T5 model and tokenizer
# Now, we need to import the T5Tokenizer and T5ForConditionalGeneration classes from the transformers library. Select a model like t5-small, t5-base, or t5-large based on the requirement and computational capacity:

from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = "t5-small"  # can be 't5-base' or 't5-large' based on resources
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

*To select between t5-small, t5-base, or t5-large, consider your computational resources and accuracy needs. t5-small is faster and requires less memory, which makes it suitable for quick tasks or limited hardware. t5-base offers a balance between speed and performance, ideal for general use. t5-large provides the highest accuracy but needs more memory and processing power, which makes it better for scenarios where performance is more important than speed.*

In [17]:
# Input text to summarize
# Next, we need to define the text that needs to be summarized. The text should be prefixed with the keyword “summarize:” for the T5 model to recognize the task properly:
text = """
The COVID-19 pandemic has brought unprecedented challenges to the global economy.
Governments worldwide have implemented strict lockdowns and social distancing measures
to contain the virus spread. Many industries have been severely affected, leading to
widespread job losses and financial instability across countries.
"""
# Prepare the text for the T5 model by adding the "summarize:" prefix
input_text = "summarize: " + text

*The “summarize:” prefix for the T5 model is necessary because T5 is a “text-to-text” model that needs to understand what task it should perform (e.g., summarization, translation, or question-answering). The prefix helps the model identify that it should generate a summary of the input text. Without this instruction, the model might not produce the intended summarization output.*

In [18]:
# Tokenize the input text
# The next step is tokenization. Tokenization is the process of converting text into a sequence of integers that represent the model’s vocabulary. The max_length parameter helps manage large texts by truncating or limiting the input size:

input_ids = tokenizer.encode(input_text,
                             return_tensors="pt", # returns PyTorch tensors
                             max_length=512,     # limit input size
                             truncation=True)    # truncate long inputs

*Now, use the generate method to produce the summary. Important parameters include:*


*   max_length: The maximum number of tokens in the output.
*   num_beams: The number of beams for beam search (higher values improve results but increase computation).
*   length_penalty: Adjusts the length of the summary (penalizes lengthy outputs).




In [19]:
# Generate the summary

summary_ids = model.generate(
    input_ids,
    max_length=50,   # maximum length of the summary
    num_beams=4,     # beam search for better results
    length_penalty=2.0,  # length penalty to avoid lengthy summaries
    early_stopping=True
)

In [20]:
# Decode and display the summary
# The final step is to decode the generated summary using the tokenizer to convert the tokens back to readable text:

summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Summary:", summary)

Summary: governments have implemented strict lockdowns and social distancing measures. many industries have been severely affected, leading to job losses and financial instability across countries.


***Summary :***

*So, to build a text summarization model, first, we need to choose a pre-trained language model like T5. Then, we need to tokenize the input text, which converts it into a format the model can process. The next step will be to use the model to generate a summary by specifying parameters like maximum length and beam search for better results. The final step will be to decode the generated tokens back into readable text and adjust parameters to improve the summary quality.*