# ***Installing Dependencies***

In [None]:
pip install --upgrade transformers datasets evaluate rouge_score



In [None]:
!pip install textstat datasets transformers



# ***Setting up the Environment and Loading Data***

This code segment handles data loading, preprocessing, and conversion of a news dataset for Transformer-based model fine-tuning. It first imports the necessary libraries‚Äîpandas for structured data manipulation, datasets from Hugging Face for model-ready data formatting, and re for regular expression-based text cleaning. The clean_text() function is defined to standardize and sanitize textual data by converting text to lowercase, removing HTML tags, URLs, and excessive whitespace. This ensures that all input text is consistent, noise-free, and suitable for model training. The script then attempts to load the dataset news-article-categories.csv using UTF-8 encoding, a common standard for text-based data such as Kaggle datasets, while handling potential file-loading errors gracefully.

Once the dataset is successfully loaded, the script performs systematic preprocessing and dataset preparation. It selects only the relevant columns (body and title), renames them to text and summary for consistency, and removes missing values to maintain data quality. The cleaning function is applied to both columns, producing a uniform and readable dataset. After preprocessing, the cleaned data is converted into a Hugging Face Dataset object, which facilitates efficient tokenization and integration with Transformer models. Finally, the dataset is split into training and testing subsets using an 80‚Äì20 ratio, stored in a DatasetDict structure, ensuring an organized and balanced division of data for model fine-tuning and evaluation.

In [None]:
import pandas as pd
from datasets import Dataset, DatasetDict
import re # Import the regular expression library

# --- (A) CREATE A CLEANING FUNCTION ---
def clean_text(text):
    if not isinstance(text, str): # Handle potential non-string data
        return ""
    text = text.lower()
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# --- 1. Load Your Custom Dataset ---
try:
    # Changed encoding to 'utf-8', which is standard for Kaggle datasets
    df = pd.read_csv('news-article-categories.csv', encoding='utf-8')
    print("Successfully loaded 'news-article-categories.csv'")

except FileNotFoundError:
    print("Error: 'news-article-categories.csv' not found.")
    df = None # Set df to None if file not found

if df is not None:
    # --- 2. Preprocess and Prepare the Dataset ---
    # --- THIS IS THE FIX ---
    # Select the correct columns from the new dataset ('body' and 'title')
    df = df[['body', 'title']]
    # Rename them to the standard names the rest of the script expects ('text' and 'summary')
    df.columns = ['text', 'summary']

    # Handle potential missing values in the new dataset
    df.dropna(inplace=True)

    # --- (B) APPLY THE CLEANING FUNCTION TO YOUR DATA ---
    print("\n--- Applying preprocessing to the dataset ---")
    df['text'] = df['text'].apply(clean_text)
    df['summary'] = df['summary'].apply(clean_text)
    print("Preprocessing complete. Example of cleaned article:")
    print(df.iloc[0]['text'])

    # --- 3. Convert to a Hugging Face Dataset ---
    hg_dataset = Dataset.from_pandas(df)

    # --- 4. Split into Training and Validation Sets ---
    train_test_split = hg_dataset.train_test_split(test_size=0.2)
    dataset = DatasetDict({
        'train': train_test_split['train'],
        'test': train_test_split['test']
    })

    print("\nDataset structure:")
    print(dataset)

Successfully loaded 'news-article-categories.csv'

--- Applying preprocessing to the dataset ---
Preprocessing complete. Example of cleaned article:

Dataset structure:
DatasetDict({
    train: Dataset({
        features: ['text', 'summary', '__index_level_0__'],
        num_rows: 5497
    })
    test: Dataset({
        features: ['text', 'summary', '__index_level_0__'],
        num_rows: 1375
    })
})


# ***Tokenization***

This section of the code focuses on tokenization and data preparation for fine-tuning the facebook/bart-base model. It begins by importing the AutoTokenizer class from the Hugging Face Transformers library and defining the model checkpoint. The BART model was selected for its strong performance in text summarization and sequence-to-sequence tasks. The tokenizer corresponding to this checkpoint is loaded using AutoTokenizer.from_pretrained(model_checkpoint), ensuring that the tokenization process aligns with the model‚Äôs pre-training configuration. This step converts raw text into a sequence of numerical tokens that the model can understand while maintaining vocabulary consistency with BART‚Äôs architecture.

A custom preprocessing function, preprocess_function(), is then defined to tokenize both the input articles and their corresponding summaries. The input text is truncated to a maximum length of 1024 tokens, while summaries are limited to 128 tokens to maintain concise outputs. A filter is also applied to exclude articles longer than 500 words, reducing computational overhead and preventing token overflow during training. The map() method applies the tokenization across the dataset in batches, resulting in a structured dataset containing tokenized inputs and labels ready for model fine-tuning. This systematic preprocessing ensures the data is optimized for the BART model‚Äôs encoder-decoder framework, facilitating efficient and context-aware headline generation.

In [None]:
from transformers import AutoTokenizer

# --- 4. Define the Model Checkpoint ---
# ## <-- KEY CHANGE: Switched to the smaller t5-small model ---
model_checkpoint = "google-t5/t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# --- 5. Create a T5-Specific Preprocessing Function ---
prefix = "summarize: "

def preprocess_function(examples):
    # ## <-- KEY CHANGE: Add the prefix to all input articles ---
    inputs = [prefix + doc for doc in examples["text"]]

    # Tokenize the prefixed inputs
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    # Tokenize the target summaries (labels)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# --- 6. Apply the Tokenization ---
dataset = dataset.filter(lambda x: len(x["text"].split()) < 500)
tokenized_datasets = dataset.map(preprocess_function, batched=True)
print("\nSample of tokenized data prepared for T5:")
print(tokenized_datasets['train'][0].keys())

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Filter:   0%|          | 0/5497 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1375 [00:00<?, ? examples/s]

Map:   0%|          | 0/2817 [00:00<?, ? examples/s]



Map:   0%|          | 0/710 [00:00<?, ? examples/s]


Sample of tokenized data prepared for T5:
dict_keys(['text', 'summary', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'])


# ***Model Training***

## ***Fine-Tuning the Model***

This section presents the fine-tuning process of the Google T5-small model for automatic news headline generation. After importing the required modules from the Hugging Face Transformers library, nine adjustable hyperparameters‚Äîincluding learning rate, batch sizes, number of epochs, weight decay, warmup steps, and gradient accumulation‚Äîare defined to optimize the model‚Äôs learning dynamics. The model checkpoint "google-t5/t5-small" is loaded using AutoModelForSeq2SeqLM.from_pretrained(), providing a lightweight yet efficient encoder-decoder framework suitable for text summarization tasks. A compute_metrics() function is implemented to evaluate intrinsic text quality through two key measures: average readability using the Flesch Reading Ease Score from TextStat, and average length of generated summaries. These intrinsic metrics complement standard performance measures by ensuring that the outputs are both linguistically fluent and contextually concise.

The training configuration is defined using Seq2SeqTrainingArguments, aligning closely with the previous BART model setup to maintain consistency across experiments. This configuration specifies essential parameters such as logging frequency, save strategy, gradient accumulation, and evaluation mode. The Seq2SeqTrainer integrates the model, datasets, tokenizer, and data collator into a unified framework for supervised fine-tuning. The fine-tuning process iteratively updates the model‚Äôs weights based on the training data, enhancing its ability to produce coherent, concise, and relevant summaries. After training, the fine-tuned model is saved to a local directory for subsequent evaluation and comparison with the BART model. This process allows a systematic performance analysis between architectures, highlighting the trade-offs between model complexity, readability, and summarization quality.

In [None]:
import transformers
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
import numpy as np
import textstat  # for readability metrics (pip install textstat)

print("Transformers library version:", transformers.__version__)

# --- 9 ADJUSTABLE HYPERPARAMETERS (Copied from Facebook Config) ---
learning_rate = 1e-6                         # 1. Learning rate
train_batch_size = 8                         # 2. Training batch size
eval_batch_size = 8                          # 3. Evaluation batch size
num_train_epochs = 2                         # 4. Number of epochs
weight_decay = 0.15                         # 5. Weight decay
warmup_steps = 500                           # 6. Warmup steps
logging_steps = 50                           # 7. Logging frequency
generation_max_length = 128                  # 8. Max length for generated text
gradient_accumulation_steps = 2              # 9. Gradient accumulation steps

# --- Model Checkpoint ---
model_checkpoint = "google-t5/t5-small"

# --- Compute Metrics (Intrinsic, same as before) ---
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Compute readability metrics (intrinsic quality)
    readability_scores = [textstat.flesch_reading_ease(pred) for pred in decoded_preds if pred]
    avg_readability = np.mean(readability_scores) if readability_scores else 0

    # Compute average length
    prediction_lens = [len(pred.split()) for pred in decoded_preds if pred]
    avg_length = np.mean(prediction_lens) if prediction_lens else 0

    return {
        "avg_readability": round(avg_readability, 2),
        "avg_length": round(avg_length, 2),
    }

# --- Load Pre-trained Model ---
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

# --- Prepare Data Collator ---
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

# --- Define Training Arguments (Synced with Facebook Version) ---
training_args = Seq2SeqTrainingArguments(
    output_dir="./t5_small_finetuned_intrinsic",  # updated path
    do_eval=True,
    logging_strategy="steps",                     # ‚úÖ changed to match Facebook setup
    logging_steps=logging_steps,
    save_strategy="epoch",
    learning_rate=learning_rate,
    per_device_train_batch_size=train_batch_size,
    per_device_eval_batch_size=eval_batch_size,
    weight_decay=weight_decay,
    warmup_steps=warmup_steps,
    save_total_limit=3,
    num_train_epochs=num_train_epochs,
    predict_with_generate=True,
    generation_max_length=generation_max_length,
    gradient_accumulation_steps=gradient_accumulation_steps,
    fp16=True,
    report_to="none",
)

# --- Initialize Trainer ---
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# --- Start Fine-Tuning ---
print("\nStarting model fine-tuning...")
trainer.train()

# --- Save the Fine-Tuned Model ---
model_save_path = "./my_finetuned_t_summarizer_no_ref"
trainer.save_model(model_save_path)
print(f"Model saved to {model_save_path}")

Transformers library version: 4.57.1


  trainer = Seq2SeqTrainer(



Starting model fine-tuning...




Step,Training Loss
50,2.8002


## ***Metric of the Fine-Tuned***

This section details the evaluation phase of the fine-tuned google-t5/t5-small model using the Hugging Face Transformers and Evaluate libraries. The script begins by loading the saved fine-tuned model from the specified directory and preparing the data collator for consistent batch formatting during evaluation. The ROUGE metric, imported through the evaluate library, serves as the primary quantitative measure for summarization performance, assessing word- and phrase-level similarity between generated and reference summaries. The custom safe_decode() function ensures stability by clipping invalid token IDs and decoding model outputs into readable text without including special tokens. This decoding process is critical for obtaining accurate ROUGE scores and text-based readability assessments.

The compute_metrics() function calculates both ROUGE-based and intrinsic quality metrics. ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-Lsum capture the model‚Äôs lexical precision, phrase consistency, and structural coherence. Meanwhile, readability is evaluated through the Flesch Reading Ease score, and average summary length is computed to assess linguistic fluency and conciseness. The Seq2SeqTrainer and Seq2SeqTrainingArguments handle the evaluation workflow, enabling automated metric computation with predict_with_generate=True, which generates summaries dynamically for testing. The final printed metrics provide a holistic assessment of the fine-tuned T5 model‚Äôs summarization performance, balancing quantitative accuracy and qualitative readability, and facilitating direct comparison with the BART model‚Äôs results to identify architectural and optimization differences.

In [None]:
import numpy as np
import torch
import textstat
import evaluate  # ‚úÖ use this instead of datasets.load_metric
from transformers import (
    AutoModelForSeq2SeqLM,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    DataCollatorForSeq2Seq,
)

# --- (1) Load Model ---
model_path = "./my_finetuned_t_summarizer_no_ref"
print(f"Loading fine-tuned model from: {model_path}")

model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
# tokenizer = AutoTokenizer.from_pretrained(model_path)  # Uncomment if needed

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

# --- (2) Load ROUGE Metric ---
rouge = evaluate.load("rouge")  # ‚úÖ updated import

# --- (3) Safe Decode ---
def safe_decode(predictions):
    decoded = []
    for pred in predictions:
        pred = np.clip(pred, 0, tokenizer.vocab_size - 1)  # ‚úÖ ensure valid IDs
        text = tokenizer.decode(pred, skip_special_tokens=True)
        decoded.append(text)
    return decoded

# --- (4) Compute Metrics ---
def compute_metrics(eval_pred):
    predictions, labels = eval_pred

    predictions = np.array(predictions)
    if predictions.ndim > 2:
        predictions = predictions[:, 0, :]  # handle nested arrays

    decoded_preds = safe_decode(predictions)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = [tokenizer.decode(l, skip_special_tokens=True) for l in labels]

    # --- ROUGE scores ---
    rouge_scores = rouge.compute(
        predictions=decoded_preds,
        references=decoded_labels,
        use_stemmer=True
    )
    rouge1 = rouge_scores["rouge1"] * 100
    rouge2 = rouge_scores["rouge2"] * 100
    rougeL = rouge_scores["rougeL"] * 100
    rougeLsum = rouge_scores["rougeLsum"] * 100

    # --- Readability ---
    readability_scores = [textstat.flesch_reading_ease(pred) for pred in decoded_preds]
    avg_readability = np.mean(readability_scores)

    # --- Average Length ---
    prediction_lens = [len(pred.split()) for pred in decoded_preds]
    avg_length = np.mean(prediction_lens)

    return {
        "rouge1": round(rouge1, 4),
        "rouge2": round(rouge2, 4),
        "rougeL": round(rougeL, 4),
        "rougeLsum": round(rougeLsum, 4),
        "avg_readability": round(avg_readability, 2),
        "avg_length": round(avg_length, 2),
    }

# --- (5) Evaluation Args ---
eval_args = Seq2SeqTrainingArguments(
    output_dir="./eval_results",
    per_device_eval_batch_size=4,
    predict_with_generate=True,
    report_to="none",
)

# --- (6) Trainer ---
trainer = Seq2SeqTrainer(
    model=model,
    args=eval_args,
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# --- (7) Evaluate ---
print("\nüîé Evaluating fine-tuned model...")
metrics = trainer.evaluate()

print("\n‚úÖ Evaluation Results:")
for k, v in metrics.items():
    print(f"‚Ä¢ {k}: {v:.4f}" if isinstance(v, (int, float)) else f"‚Ä¢ {k}: {v}")

# ***Using the Models***

## ***Using the Fine-Tuned Model***

In [None]:
!pip install textstat



This section illustrates the interactive inference process for the fine-tuned google-t5/t5-small model, designed for automatic news headline generation. The fine-tuned model and tokenizer are loaded through the Hugging Face pipeline API for the summarization task, which streamlines the end-to-end process from text input to summary generation. The model is retrieved from the saved directory and operated within an interactive loop, allowing users to input news articles and receive real-time summaries. To align with the T5 model‚Äôs architecture, each input text is prefixed with the keyword ‚Äúsummarize:‚Äù, which helps the model recognize the summarization objective. The script tracks execution time to compute generation speed and efficiency metrics, ensuring the model‚Äôs responsiveness in an applied setting.

After generating summaries, the system evaluates both quantitative performance and linguistic quality using several metrics. These include generation time, tokens per second, compression ratio, and redundancy ratio, which assess summarization efficiency and lexical diversity. Additionally, sentence-level structural properties‚Äîsuch as average sentence length‚Äîand readability indices including Flesch Reading Ease, Gunning Fog Index, SMOG Index, and Automated Readability Index (ARI)‚Äîare computed via the TextStat library to measure fluency and accessibility. Together, these metrics provide a holistic understanding of the model‚Äôs ability to generate clear, coherent, and concise summaries. This interactive implementation validates the fine-tuned T5 model‚Äôs readiness for real-world applications in AI-assisted journalism and content automation systems.

In [None]:
from transformers import pipeline
import time
import textstat  # Make sure: !pip install textstat

# --- 1. Load Your Fine-Tuned T5 Model ---
try:
    model_path = "./my_finetuned_t_summarizer_no_ref"

    fine_tuned_summarizer = pipeline(
        "summarization",
        model=model_path,
        tokenizer=model_path
    )
    print("\n‚úÖ Fine-Tuned Summarization Model Loaded")
    print(f"Loaded from: {model_path}")

    # --- 2. Interactive Loop ---
    while True:
        article_text = input("\nEnter an article to summarize (or 'quit' to exit): ")
        if article_text.lower() == "quit":
            print("üëã Exiting fine-tuned summarizer.")
            break
        if not article_text.strip():
            continue

        prefixed_text = "summarize: " + article_text
        start_time = time.time()

        # --- Generate summary ---
        result = fine_tuned_summarizer(prefixed_text, max_length=70, min_length=20, do_sample=False)
        end_time = time.time()

        summary_text = result[0]["summary_text"]

        # --- (A) Core Metrics ---
        generation_time = end_time - start_time
        input_words = len(article_text.split())
        summary_words = len(summary_text.split())
        compression_ratio = summary_words / input_words if input_words else 0
        tokens_per_second = summary_words / generation_time if generation_time else 0

        # --- (B) Redundancy ---
        words = summary_text.split()
        redundancy_ratio = 1 - len(set(words)) / len(words) if words else 0

        # --- (C) Sentence Structure ---
        sentences = [s.strip() for s in summary_text.split('.') if s.strip()]
        avg_sentence_length = sum(len(s.split()) for s in sentences) / len(sentences) if sentences else 0

        # --- (D) Readability Metrics ---
        flesch = textstat.flesch_reading_ease(summary_text)
        gunning_fog = textstat.gunning_fog(summary_text)
        smog = textstat.smog_index(summary_text)
        ari = textstat.automated_readability_index(summary_text)

        # --- (E) Output ---
        print("\nüßæ --- Summary from Fine-Tuned Model ---")
        print(summary_text)
        print("-" * 20)
        print("üìä --- METRICS ---")
        print(f"‚Ä¢ Generation Time: {generation_time:.2f} s")
        print(f"‚Ä¢ Tokens per Second: {tokens_per_second:.2f}")
        print(f"‚Ä¢ Word Count: {summary_words} (from {input_words} original)")
        print(f"‚Ä¢ Compression Ratio: {compression_ratio:.2%}")
        print(f"‚Ä¢ Avg Sentence Length: {avg_sentence_length:.2f} words")
        print(f"‚Ä¢ Redundancy Ratio: {redundancy_ratio:.2%}")
        print(f"‚Ä¢ Readability (Flesch): {flesch:.2f}")
        print(f"‚Ä¢ Gunning Fog Index: {gunning_fog:.2f}")
        print(f"‚Ä¢ SMOG Index: {smog:.2f}")
        print(f"‚Ä¢ ARI: {ari:.2f}")
        print("-" * 60)

except OSError:
    print(f"‚ö†Ô∏è Error: Could not find the fine-tuned model at '{model_path}'.")
    print("Make sure the model was successfully trained and saved at that location.")
except Exception as e:
    print(f"‚ö†Ô∏è An unexpected error occurred: {e}")

Device set to use cuda:0



‚úÖ Fine-Tuned Summarization Model Loaded
Loaded from: ./my_finetuned_t_summarizer_no_ref



Token indices sequence length is longer than the specified maximum sequence length for this model (5623 > 512). Running this sequence through the model will result in indexing errors
Both `max_new_tokens` (=256) and `max_length`(=70) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



üßæ --- Summary from Fine-Tuned Model ---
ex-model says she was groped by a photographer who allegedly raped her when she was 16 . 'i'm sick to my stomach,' she says .
--------------------
üìä --- METRICS ---
‚Ä¢ Generation Time: 2.02 s
‚Ä¢ Tokens per Second: 12.36
‚Ä¢ Word Count: 25 (from 3427 original)
‚Ä¢ Compression Ratio: 0.73%
‚Ä¢ Avg Sentence Length: 11.50 words
‚Ä¢ Redundancy Ratio: 20.00%
‚Ä¢ Readability (Flesch): 77.46
‚Ä¢ Gunning Fog Index: 9.82
‚Ä¢ SMOG Index: 10.13
‚Ä¢ ARI: 3.35
------------------------------------------------------------

Enter an article to summarize (or 'quit' to exit): 
‚ö†Ô∏è An unexpected error occurred: 


## ***Using the Model without Fine-tuning***

This section presents the baseline performance evaluation of the generic pre-trained google-t5/t5-small model before fine-tuning. The model and tokenizer are initialized using the Hugging Face pipeline for the ‚Äúsummarization‚Äù task, which streamlines the process of generating summaries without additional preprocessing or configuration. To match the T5 model‚Äôs architecture, each input article is prefixed with the keyword ‚Äúsummarize:‚Äù, enabling the model to correctly interpret the summarization objective. The program runs in an interactive loop, allowing users to input articles and receive concise summaries in real time. The code also measures generation time, ensuring efficiency analysis, and computes additional metrics such as compression ratio and tokens per second, providing insights into the model‚Äôs processing speed and text condensation capability.

After each generated summary, the script evaluates the linguistic quality and readability of the output using several intrinsic metrics. These include redundancy ratio, average sentence length, and readability indices such as Flesch Reading Ease, Gunning Fog Index, SMOG Index, and Automated Readability Index (ARI), all computed via the TextStat library. Together, these measurements assess the model‚Äôs ability to produce coherent, fluent, and accessible summaries. This baseline evaluation serves as a reference point for comparing the improvements achieved after fine-tuning, particularly in readability, conciseness, and structural consistency, thereby demonstrating the effectiveness of model optimization for news headline generation and summarization tasks.

In [None]:
from transformers import pipeline
import time
import textstat
from transformers.utils import logging
logging.set_verbosity_error()

try:
    summarizer = pipeline("summarization", model="google-t5/t5-small", tokenizer="google-t5/t5-small")
    print("\n‚úÖ Generic Pre-trained Summarization Model Loaded (google-t5/t5-small)")

    while True:
        article_text = input("\nEnter an article to summarize (or 'quit' to exit): ")
        if article_text.lower() == "quit":
            print("üëã Exiting generic summarizer.")
            break
        if not article_text.strip():
            continue

        prefixed = "summarize: " + article_text
        start_time = time.time()

        result = summarizer(prefixed, max_length=50, min_length=5, do_sample=False)
        end_time = time.time()

        summary_text = result[0]["summary_text"]

        # --- Metrics ---
        generation_time = end_time - start_time
        input_words = len(article_text.split())
        summary_words = len(summary_text.split())
        compression_ratio = summary_words / input_words if input_words else 0
        tokens_per_second = summary_words / generation_time if generation_time else 0

        words = summary_text.split()
        redundancy_ratio = 1 - len(set(words)) / len(words) if words else 0

        sentences = [s.strip() for s in summary_text.split('.') if s.strip()]
        avg_sentence_length = sum(len(s.split()) for s in sentences) / len(sentences) if sentences else 0

        # --- Readability ---
        flesch = textstat.flesch_reading_ease(summary_text)
        gunning_fog = textstat.gunning_fog(summary_text)
        smog = textstat.smog_index(summary_text)
        ari = textstat.automated_readability_index(summary_text)

        # --- Output ---
        print("\nüßæ --- Generated Summary ---")
        print(summary_text)
        print("-" * 20)
        print("üìä --- METRICS ---")
        print(f"‚Ä¢ Generation Time: {generation_time:.2f} s")
        print(f"‚Ä¢ Tokens per Second: {tokens_per_second:.2f}")
        print(f"‚Ä¢ Word Count: {summary_words} (from {input_words} original)")
        print(f"‚Ä¢ Compression Ratio: {compression_ratio:.2%}")
        print(f"‚Ä¢ Avg Sentence Length: {avg_sentence_length:.2f} words")
        print(f"‚Ä¢ Redundancy Ratio: {redundancy_ratio:.2%}")
        print(f"‚Ä¢ Readability (Flesch): {flesch:.2f}")
        print(f"‚Ä¢ Gunning Fog Index: {gunning_fog:.2f}")
        print(f"‚Ä¢ SMOG Index: {smog:.2f}")
        print(f"‚Ä¢ ARI: {ari:.2f}")
        print("-" * 60)

except Exception as e:
    print(f"‚ö†Ô∏è An error occurred: {e}")


‚úÖ Generic Pre-trained Summarization Model Loaded (google-t5/t5-small)

Enter an article to summarize (or 'quit' to exit): MANILA ‚Äì The Philippine National Police (PNP) on Friday said it is preparing a comprehensive security plan to secure the country's hosting of the 2026 Association of Southeast Asian Nations (ASEAN) Summit and Related Summits.  PNP acting chief Lt. Gen. Jose Melencio Nartatez, Jr., said this was in line with the directive of President Ferdinand R. Marcos, Jr. to uphold the national commitment to hosting regional and global engagements by ensuring the highest level of safety and security for all delegates and participants.  ‚ÄúWe are already preparing as early as now. The PNP will be on alert as meetings for our hosting have commenced. We are updating our security playbook to ensure it can address any kind of eventuality, from traffic management to VIP protection,‚Äù said Nartatez in a statement.  The country‚Äôs top police official emphasized that the core of th