# Kinda Fast Summerizer

- To create a summarizer with latency between 10-200ms, you'll need a lightweight solution optimized for speed. Below is a practical implementation using both extractive (fastest) and abstractive (better quality) approaches. The extractive method uses efficient algorithms for minimal latency, while the abstractive option leverages a distilled transformer model for balance.

# Solution Overview
- Extractive Summarization (10-50ms):
Uses BM25 ranking to select top sentences (no deep learning). Ideal for strict latency needs.

- Abstractive Summarization (50-200ms):
Uses a distilled transformer (sshleifer/distilbart-cnn-12-6) for concise, human-like summaries.

# Summerization in nano seconds

In [None]:
class NanoSummarizer:
    """Ultra-fast summarizer for <1ms latency"""
    def summarize(self, text: str) -> str:
        # Find the first sentence boundary
        for i, char in enumerate(text):
            if char in '.!?':
                return text[:i+1]
        # If no sentence boundary found, return first 100 characters
        return text[:100] + '...' if len(text) > 100 else text

# Usage:
your_text = """
    Artificial intelligence is transforming industries.
    Machine learning models can now diagnose diseases faster than doctors.
    However, ethical concerns about AI bias remain unresolved.
    Researchers are developing techniques to make AI more transparent.
    The future of AI depends on balancing innovation with regulation.
    """
summarizer = NanoSummarizer()
summary = summarizer.summarize(your_text)

In [None]:
import time
t1=time.time()
s=summarizer.summarize(your_text)
t2=time.time()
print(f"Summary: {s}\n time: {t2-t1}")

Summary: 
    Artificial intelligence is transforming industries.
 time: 8.821487426757812e-05


In [None]:
import re
import time
import string
from collections import Counter
from typing import Literal, Optional
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
import torch

class LowLatencySummarizer:
    def __init__(self,
                 mode: Literal["extractive", "abstractive"] = "extractive",
                 device: Optional[str] = None):
        """
        Args:
            mode: "extractive" (fast, 10-50ms) or "abstractive" (50-200ms).
            device: "cpu", "cuda", or None (auto-detect).
        """
        self.mode = mode
        self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")

        if mode == "abstractive":
            model_name = "sshleifer/distilbart-cnn-12-6"
            self.tokenizer = AutoTokenizer.from_pretrained(model_name)
            self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(self.device)
            # Fixed warm-up call with both required parameters
            self._summarize_abstractive("Warm up summary.", max_input_tokens=300, max_output_tokens=20)

    def summarize(self, text: str,
                 max_input_tokens: int = 512,
                 max_output_tokens: int = 100) -> str:
        """Summarize text within 10-200ms latency."""
        start_time = time.time()

        if self.mode == "extractive":
            summary = self._summarize_extractive(text, num_sentences=3)
        else:
            summary = self._summarize_abstractive(
                text,
                max_input_tokens=max_input_tokens,
                max_output_tokens=max_output_tokens
            )

        latency_ms = (time.time() - start_time) * 1000
        print(f"Latency: {latency_ms:.1f}ms | Mode: {self.mode}")
        return summary

    def _summarize_extractive(self, text: str, num_sentences: int = 3) -> str:
        """Extractive summarization using BM25 scoring (fastest)."""
        # Split into sentences
        sentences = re.split(r'(?<=[.!?])\s+', text)
        if len(sentences) <= num_sentences:
            return text

        # Preprocess words
        words = [word.strip(string.punctuation).lower()
                 for word in text.split() if word.strip(string.punctuation)]
        word_freq = Counter(words)

        # Score sentences by word frequency (BM25 simplified)
        sentence_scores = []
        for sentence in sentences:
            sentence_words = [word.strip(string.punctuation).lower()
                             for word in sentence.split() if word.strip(string.punctuation)]
            score = sum(word_freq.get(word, 0) for word in sentence_words)
            sentence_scores.append(score)

        # Select top sentences (preserving order)
        top_indices = sorted(
            range(len(sentence_scores)),
            key=lambda i: sentence_scores[i],
            reverse=True
        )[:num_sentences]
        top_indices.sort()

        return " ".join(sentences[i] for i in top_indices)

    def _summarize_abstractive(self, text: str,
                              max_input_tokens: int,
                              max_output_tokens: int) -> str:
        """Abstractive summarization using DistilBART (quality-speed tradeoff)."""
        # Tokenize and truncate
        inputs = self.tokenizer(
            text,
            max_length=max_input_tokens,
            truncation=True,
            return_tensors="pt"
        ).to(self.device)

        # Generate summary with greedy decoding (fastest)
        summary_ids = self.model.generate(
            inputs.input_ids,
            max_length=max_output_tokens,
            min_length=0,  # ADD THIS LINE TO FIX MIN_LENGTH WARNING
            num_beams=1,   # Greedy for minimal latency
            # REMOVE early_stopping PARAMETER
        )

        return self.tokenizer.decode(
            summary_ids[0],
            skip_special_tokens=True
        )



In [None]:
# Usage Example
if __name__ == "__main__":
    text = """
    Artificial intelligence is transforming industries.
    Machine learning models can now diagnose diseases faster than doctors.
    However, ethical concerns about AI bias remain unresolved.
    Researchers are developing techniques to make AI more transparent.
    The future of AI depends on balancing innovation with regulation.
    """

    # For <50ms latency (extractive)
    summarizer_fast = LowLatencySummarizer(mode="extractive")
    print("Extractive Summary:", summarizer_fast.summarize(text))

    # For 50-200ms latency (abstractive, better quality)
    summarizer_quality = LowLatencySummarizer(mode="abstractive", device="cpu")
    print("Abstractive Summary:", summarizer_quality.summarize(text))

Latency: 0.1ms | Mode: extractive
Extractive Summary: Machine learning models can now diagnose diseases faster than doctors. Researchers are developing techniques to make AI more transparent. The future of AI depends on balancing innovation with regulation.


The following generation flags are not valid and may be ignored: ['early_stopping', 'length_penalty']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['early_stopping', 'length_penalty']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Latency: 2023.4ms | Mode: abstractive
Abstractive Summary:  Artificial intelligence is transforming industries . Ethic concerns about AI bias remain unresolved . Future of AI depends on balancing innovation with regulation .


In [None]:
text= """In 2013 and 2014, end-to-end neural machine translation had their breakthrough with Kalchbrenner & Blunsom using a convolutional neural network (CNN) for encoding the source[18] and both Cho et al. and Sutskever et al. using a recurrent neural network (RNN) instead.[19][20] All three used an RNN conditioned on a fixed encoding of the source as their decoder to produce the translation. However, these models performed poorly on longer sentences.[21]: 107 [1]: 39 [2]: 7  This problem was addressed when Bahdanau et al. introduced attention to their encoder-decoder architecture: At each decoding step, the state of the decoder is used to calculate a source representation that focuses on different parts of the source and uses that representation in the calculation of the probabilities for the next token.[22] Based on these RNN-based architectures, Baidu launched the "first large-scale NMT system"[23]: 144  in 2015, followed by Google Neural Machine Translation in 2016.[23]: 144 [24] From that year on, neural models also became the prevailing choice in the main machine translation conference Workshop on Statistical Machine Translation.[25]

Gehring et al. combined a CNN encoder with an attention mechanism in 2017, which handled long-range dependencies in the source better than previous approaches and also increased translation speed because a CNN encoder is parallelizable, whereas an RNN encoder has to encode one token at a time due to its recurrent nature.[26]: 230  In the same year, “Microsoft Translator released AI-powered online neural machine translation (NMT).[27] DeepL Translator, which was at the time based on a CNN encoder, was also released in the same year and was judged by several news outlets to outperform its competitors.[28][29][30] It has also been seen that OpenAI's GPT-3 released in 2020 can function as a neural machine translation system. Some other machine translation systems, such as Microsoft translator and SYSTRAN can be also seen to have integrated neural networks into their operations.
Transformer
Main article: Transformer (deep learning architecture)
Another network architecture that lends itself to parallelization is the transformer, which was introduced by Vaswani et al. also in 2017.[31] Like previous models, the transformer still uses the attention mechanism for weighting encoder output for the decoding steps. However, the transformer's encoder and decoder networks themselves are also based on attention instead of recurrence or convolution: Each layer weighs and transforms the previous layer's output in a process called self-attention. Since the attention mechanism does not have any notion of token order, but the order of words in a sentence is obviously relevant, the token embeddings are combined with an explicit encoding of their position in the sentence.[2]: 15 [6]: 7  Since both the transformer's encoder and decoder are free from recurrent elements, they can both be parallelized during training. However, the original transformer's decoder is still auto-regressive, which means that decoding still has to be done one token at a time during inference.

The transformer model quickly became the dominant choice for machine translation systems[2]: 44  and was still by far the most-used architecture in the Workshop on Statistical Machine Translation in 2022 and 2023.[32]: 35–40 [33]: 28–31

Usually, NMT models’ weights are initialized randomly and then learned by training on parallel datasets. However, since using large language models (LLMs) such as BERT pre-trained on large amounts of monolingual data as a starting point for learning other tasks has proven very successful in wider NLP, this paradigm is also becoming more prevalent in NMT. This is especially useful for low-resource languages, where large parallel datasets do not exist.[4]: 689–690  An example of this is the mBART model, which first trains one transformer on a multilingual dataset to recover masked tokens in sentences, and then fine-tunes the resulting autoencoder on the translation task.[34]

Generative LLMs
Instead of fine-tuning a pre-trained language model on the translation task, sufficiently large generative models can also be directly prompted to translate a sentence into the desired language. This approach was first comprehensively tested and evaluated for GPT 3.5 in 2023 by Hendy et al. They found that "GPT systems can produce highly fluent and competitive translation outputs even in the zero-shot setting especially for the high-resource language translations".[35]: 22  The WMT23 evaluated the same approach (but using GPT-4) and found that it was on par with the state of the art when translating into English, but not quite when translating into lower-resource languages.[33]: 16–17  This is plausible considering that GPT models are trained mainly on English text.[36]

Comparison with statistical machine translation
NMT has overcome several challenges that were present in statistical machine translation (SMT):

NMT's full reliance on continuous representation of tokens overcame sparsity issues caused by rare words or phrases. Models were able to generalize more effectively.[18]: 1 [37]: 900–901
The limited n-gram length used in SMT's n-gram language models caused a loss of context. NMT systems overcome this by not having a hard cut-off after a fixed number of tokens and by using attention to choosing which tokens to focus on when generating the next token.[37]: 900–901
End-to-end training of a single model improved translation performance and also simplified the whole process.[citation needed]
The huge n-gram models (up to 7-gram) used in SMT required large amounts of memory,[38]: 88  whereas NMT requires less.
Training procedure """

In [None]:
print("Abstractive Summary:", summarizer_quality.summarize(text))

The following generation flags are not valid and may be ignored: ['early_stopping', 'length_penalty']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Latency: 7234.5ms | Mode: abstractive
Abstractive Summary:  In 2013 and 2014, end-to-end neural machine translation had their breakthrough with Kalchbrenner & Blunsom using a convolutional neural network (CNN) for encoding the source . These models performed poorly on longer sentences . Baidu launched the "first large-scale NMT system" in 2015, followed by Google Neural Machine Translation in 2016 .


# We will try somthing more
- Summerization is okay but latency part is not that good.

In [None]:
!pip install onnxruntime onnx

Collecting onnx
  Downloading onnx-1.18.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.9 kB)
Downloading onnx-1.18.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.6/17.6 MB[0m [31m36.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: onnx
Successfully installed onnx-1.18.0


In [None]:
import re
import time
import string
from collections import Counter
from typing import Literal, Optional
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import numpy as np

class UltraLowLatencySummarizer:
    def __init__(self,
                 mode: Literal["extractive", "abstractive", "hybrid"] = "extractive",
                 device: Optional[str] = None):
        """
        Args:
            mode: "extractive" (fastest), "abstractive" (best quality), or "hybrid"
            device: "cpu", "cuda", or None (auto-detect)
        """
        self.mode = mode
        self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")

        # Initialize abstractive components for abstractive or hybrid modes
        if "abstractive" in mode or mode == "hybrid":
            # Use the smallest available model
            model_name = "sshleifer/distilbart-cnn-6-6"
            self.tokenizer = AutoTokenizer.from_pretrained(model_name)
            self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

            # Apply optimizations
            if self.device == "cuda":
                self.model = self.model.half().to(self.device)  # FP16 for GPU
            else:
                # Apply dynamic quantization for CPU
                self.model = torch.quantization.quantize_dynamic(
                    self.model, {torch.nn.Linear}, dtype=torch.qint8
                )

            # Warm-up the model
            self._summarize_abstractive("Warm up.", 16, 8)

    def summarize(self, text: str,
                 max_input_tokens: int = 128,
                 max_output_tokens: int = 64) -> str:
        """Summarize text with ultra-low latency"""
        start_time = time.time()

        if self.mode == "extractive":
            summary = self._summarize_extractive_fast(text)
        elif self.mode == "abstractive":
            summary = self._summarize_abstractive(
                text, max_input_tokens, max_output_tokens
            )
        else:  # hybrid
            extracted = self._summarize_extractive_fast(text)
            summary = self._summarize_abstractive(
                extracted, max_input_tokens, max_output_tokens
            )

        latency_ms = (time.time() - start_time) * 1000
        print(f"Latency: {latency_ms:.1f}ms | Mode: {self.mode} | Device: {self.device}")
        return summary

    def _summarize_extractive_fast(self, text: str) -> str:
        """Ultra-fast extractive summarization (0.1-1ms)"""
        # Simple approach: first few sentences
        sentences = re.split(r'(?<=[.!?])\s+', text)
        return " ".join(sentences[:2])

    def _summarize_abstractive(self, text: str,
                              max_input_tokens: int,
                              max_output_tokens: int) -> str:
        """Optimized abstractive summarization"""
        # Tokenize input
        inputs = self.tokenizer(
            text,
            max_length=max_input_tokens,
            truncation=True,
            return_tensors="pt",
            padding=True
        )

        # Move to device if using PyTorch
        if hasattr(self, 'model'):
            inputs = {k: v.to(self.device) for k, v in inputs.items()}

        # Generate summary with optimized settings
        summary_ids = self.model.generate(
            inputs["input_ids"],  # FIXED: Access dictionary by key
            max_length=max_output_tokens,
            min_length=8,
            num_beams=1,      # Greedy search for minimal latency
            do_sample=False,   # Disable sampling for speed
            early_stopping=True,
            length_penalty=0.5
        )

        return self.tokenizer.decode(
            summary_ids[0],
            skip_special_tokens=True,
            clean_up_tokenization_spaces=True
        )

# Usage Example
if __name__ == "__main__":
    text = """
    Artificial intelligence is transforming industries.
    Machine learning models can now diagnose diseases faster than doctors.
    However, ethical concerns about AI bias remain unresolved.
    Researchers are developing techniques to make AI more transparent.
    The future of AI depends on balancing innovation with regulation.
    """ * 2  # Make text longer for testing

    # Ultra-fast extractive summarization (~0.1ms)
    summarizer_fast = UltraLowLatencySummarizer(mode="extractive")
    print("Extractive Summary:", summarizer_fast.summarize(text))

    # Optimized abstractive CPU summarization
    summarizer_cpu = UltraLowLatencySummarizer(mode="abstractive", device="cpu")
    print("Abstractive CPU Summary:", summarizer_cpu.summarize(text))

    # GPU-accelerated abstractive summarization
    if torch.cuda.is_available():
        summarizer_gpu = UltraLowLatencySummarizer(mode="abstractive", device="cuda")
        print("Abstractive GPU Summary:", summarizer_gpu.summarize(text))

    # Hybrid approach (extractive + abstractive)
    summarizer_hybrid = UltraLowLatencySummarizer(mode="hybrid")
    print("Hybrid Summary:", summarizer_hybrid.summarize(text))

Latency: 0.0ms | Mode: extractive | Device: cuda
Extractive Summary: 
    Artificial intelligence is transforming industries. Machine learning models can now diagnose diseases faster than doctors.


The following generation flags are not valid and may be ignored: ['early_stopping', 'length_penalty']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['early_stopping', 'length_penalty']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Latency: 1960.1ms | Mode: abstractive | Device: cpu
Abstractive CPU Summary:  The world of AI. The future of AI is the future of the world. The world's world of the AI. Researchers are developing techniques to make AI more transparent. The technology is a. The Future of AI and the world of technology is the most of the technology of the industry. The


The following generation flags are not valid and may be ignored: ['early_stopping', 'length_penalty']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['early_stopping', 'length_penalty']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Latency: 184.9ms | Mode: abstractive | Device: cuda
Abstractive GPU Summary:  The future of AI depends on balancing innovation with regulation. Researchers are developing techniques to make AI more transparent.


The following generation flags are not valid and may be ignored: ['early_stopping', 'length_penalty']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['early_stopping', 'length_penalty']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Latency: 198.2ms | Mode: hybrid | Device: cuda
Hybrid Summary:  Machine learning models can now diagnose diseases faster than doctors. Machine learning is transforming industries.


# Multi linguality
- Earlier model is for english only.
- Using some different model which support for 45+ languages

In [None]:
import re
import time
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

class MultilingualLowLatencySummarizer:
    def __init__(self,
                 device: Optional[str] = None,
                 quantize_cpu: bool = True):
        """
        Args:
            device: "cpu", "cuda", or None (auto-detect)
            quantize_cpu: Use 8-bit quantization for CPU inference
        """
        self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")

        # Use multilingual model (supports 45+ languages)
        model_name = "csebuetnlp/mT5_multilingual_XLSum"
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

        # Apply optimizations
        if self.device == "cuda":
            self.model = self.model.half().to(self.device)  # FP16 for GPU
        elif quantize_cpu:
            # Apply dynamic quantization for CPU
            self.model = torch.quantization.quantize_dynamic(
                self.model, {torch.nn.Linear}, dtype=torch.qint8
            )

        # Warm-up the model
        self.summarize("Initialisation.", max_input_tokens=16, max_output_tokens=8)

    def summarize(self, text: str,
                 max_input_tokens: int = 128,
                 max_output_tokens: int = 64) -> str:
        """Summarize text in any language with low latency"""
        start_time = time.time()

        # Tokenize input
        inputs = self.tokenizer(
            text,
            max_length=max_input_tokens,
            truncation=True,
            return_tensors="pt",
            padding=True
        ).to(self.device)

        # Generate summary with optimized settings
        summary_ids = self.model.generate(
            inputs["input_ids"],
            max_length=max_output_tokens,
            min_length=8,
            num_beams=1,      # Greedy search for minimal latency
            do_sample=False,   # Disable sampling for speed
            early_stopping=True
        )

        summary = self.tokenizer.decode(
            summary_ids[0],
            skip_special_tokens=True,
            clean_up_tokenization_spaces=True
        )

        latency_ms = (time.time() - start_time) * 1000
        print(f"Latency: {latency_ms:.1f}ms | Device: {self.device}")
        return summary

# Usage Example
if __name__ == "__main__":
    # English text
    en_text = "Artificial intelligence is transforming industries worldwide."

    # Spanish text
    es_text = "La inteligencia artificial está transformando industrias en todo el mundo."

    # Chinese text
    zh_text = "人工智能正在改变全世界的产业。"

    summarizer = MultilingualLowLatencySummarizer()

    print("English Summary:", summarizer.summarize(en_text))
    print("Spanish Summary:", summarizer.summarize(es_text))
    print("Chinese Summary:", summarizer.summarize(zh_text))

tokenizer_config.json:   0%|          | 0.00/375 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/730 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


pytorch_model.bin:   0%|          | 0.00/2.33G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.33G [00:00<?, ?B/s]

The following generation flags are not valid and may be ignored: ['early_stopping', 'length_penalty']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['early_stopping', 'length_penalty']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Latency: 1108.6ms | Device: cuda


The following generation flags are not valid and may be ignored: ['early_stopping', 'length_penalty']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Latency: 619.2ms | Device: cuda
English Summary: Artificial intelligence is a growing technology that has changed the world.


The following generation flags are not valid and may be ignored: ['early_stopping', 'length_penalty']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Latency: 775.3ms | Device: cuda
Spanish Summary: La inteligencia artificial es una nueva herramienta para crear robots.
Latency: 548.1ms | Device: cuda
Chinese Summary: 人工智能在全球蔓延。


In [None]:
print("English Summary:", summarizer.summarize(text))
print("Spanish Summary:", summarizer.summarize(text))
print("Chinese Summary:", summarizer.summarize(text))

The following generation flags are not valid and may be ignored: ['early_stopping', 'length_penalty']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['early_stopping', 'length_penalty']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Latency: 883.9ms | Device: cuda
English Summary: These are the latest examples of a new wave of neural machine translation.


The following generation flags are not valid and may be ignored: ['early_stopping', 'length_penalty']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Latency: 767.5ms | Device: cuda
Spanish Summary: These are the latest examples of a new wave of neural machine translation.
Latency: 657.4ms | Device: cuda
Chinese Summary: These are the latest examples of a new wave of neural machine translation.
