In [1]:
import nltk
nltk.download("punkt", quiet=True)

True

In [19]:
text = """
OpenAI released a new model for language tasks, improving accuracy and efficiency across benchmarks.
Analysts say the upgrade could reduce inference costs for enterprises and unlock new applications in education, healthcare, and customer support.
However, some researchers warn about risks around misinformation and bias, urging better evaluations and safety tooling.
Investors reacted positively, with several AI stocks gaining during afternoon trading.
The company also announced partnerships with universities to study the model's impact on learning outcomes.
"""


# Quick Baseline: LEAD-k (news-style)

In [4]:
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to C:\Users\Bhanu
[nltk_data]     Bisht\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to C:\Users\Bhanu
[nltk_data]     Bisht\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

In [5]:
from nltk.tokenize import sent_tokenize

def lead_k_summary(text, k=3):
    sents = sent_tokenize(text)
    return " ".join(sents[:k])

print(lead_k_summary(text, k=2))



OpenAI released a new model for language tasks, improving accuracy and efficiency across benchmarks. Analysts say the upgrade could reduce inference costs for enterprises and unlock new applications in education, healthcare, and customer support.


# 3) Extractive: TextRank (TF-IDF + PageRank)

Idea:

Split into sentences.

Vectorize sentences with TF-IDF.

Build a similarity graph (cosine similarity).

Run PageRank to score sentences.

Select top-N in original order.

In [7]:
from nltk.tokenize import sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx

def textrank_summarize(text, num_sentences=3, min_sent_len=20):
    sents = [s.strip() for s in sent_tokenize(text)]
    if len(sents) <= num_sentences:
        return text.strip()
    
    # Filter very short sentences (often noisy)
    kept = [s for s in sents if len(s) >= min_sent_len]
    if len(kept) < num_sentences:
        kept = sents  # fall back to all
    
    # TF-IDF sentence vectors
    vectorizer = TfidfVectorizer(stop_words="english")
    X = vectorizer.fit_transform(kept)
    
    # Similarity graph
    sim = cosine_similarity(X)
    # Remove self-similarity
    for i in range(sim.shape[0]):
        sim[i, i] = 0.0
    
    # PageRank
    graph = nx.from_numpy_array(sim)
    scores = nx.pagerank(graph)
    
    # Rank sentences by score
    ranked = sorted(((scores[i], i, sent) for i, sent in enumerate(kept)), reverse=True)
    top = sorted(ranked[:num_sentences], key=lambda x: x[1])  # keep original order
    
    return " ".join(s for _, _, s in top)

print(textrank_summarize(text, num_sentences=3))


OpenAI released a new model for language tasks, improving accuracy and efficiency across benchmarks. Analysts say the upgrade could reduce inference costs for enterprises and unlock new applications in education, healthcare, and customer support. The company also announced partnerships with universities to study the model's impact on learning outcomes.


## Make TextRank a function you can reuse on any text list

In [8]:
def summarize_batch_textrank(docs, n=3):
    return [textrank_summarize(d, num_sentences=n) for d in docs]


# 4) Abstractive: Transformer (DistilBART)

This needs transformers + a backend (torch is common). DistilBART is lightweight and good quality.

In [20]:
from transformers import pipeline

# Small, fast summarizer model
#MODEL = "sshleifer/distilbart-cnn-12-6"
summarizer = pipeline("summarization",  model="sshleifer/distilbart-cnn-12-6", framework="pt")  # uses PyTorch by default

def bart_summarize(text, max_words=120, min_words=40):
    # DistilBART is trained with tokens—roughly map words→tokens ~1:1 for short texts
    # Tune these two until you like the output length.
    result = summarizer(
        text,
        max_length=max_words,   # upper bound (tokens)
        min_length=min_words,   # lower bound (tokens)
        do_sample=False,        # deterministic (greedy/beam)
        truncation=True
    )
    return result[0]["summary_text"]

print(bart_summarize(text, max_words=80, min_words=30))


Device set to use cpu


 Analysts say the upgrade could reduce inference costs for enterprises and unlock new applications in education, healthcare, and customer support . However, some researchers warn about risks around misinformation and bias .


In [17]:
from transformers import pipeline

summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6", framework="pt")


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


pytorch_model.bin:  77%|#######7  | 944M/1.22G [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


# 5) Long documents: chunking + 2-pass summarization

Most encoder-decoder models accept ~1024 tokens. For longer inputs:

Split into sentence-based chunks by token budget.

Summarize each chunk.

Summarize the concatenation of chunk summaries (“summary of summaries”).

In [21]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL)

def chunk_by_tokens(text, max_tokens=900):
    sents = sent_tokenize(text)
    chunk, count = [], 0
    for s in sents:
        t = tokenizer.encode(s, add_special_tokens=False)
        if count + len(t) <= max_tokens:
            chunk.append(s)
            count += len(t)
        else:
            yield " ".join(chunk)
            chunk, count = [s], len(t)
    if chunk:
        yield " ".join(chunk)

def summarize_long_text(text, per_chunk_max=140, per_chunk_min=60, final_max=120, final_min=40):
    chunks = list(chunk_by_tokens(text, max_tokens=900))
    if len(chunks) == 1:
        return bart_summarize(chunks[0], max_words=final_max, min_words=final_min)
    partials = [bart_summarize(c, max_words=per_chunk_max, min_words=per_chunk_min) for c in chunks]
    mega = " ".join(partials)
    return bart_summarize(mega, max_words=final_max, min_words=final_min)

# Example:
# long_summary = summarize_long_text(very_long_text)


# 6) Compare methods quickly

In [22]:
print("LEAD-2:\n", lead_k_summary(text, 2), "\n")
print("TextRank (3):\n", textrank_summarize(text, 3), "\n")
print("Abstractive (BART):\n", bart_summarize(text, 80, 30))


LEAD-2:
 
OpenAI released a new model for language tasks, improving accuracy and efficiency across benchmarks. Analysts say the upgrade could reduce inference costs for enterprises and unlock new applications in education, healthcare, and customer support. 

TextRank (3):
 OpenAI released a new model for language tasks, improving accuracy and efficiency across benchmarks. Analysts say the upgrade could reduce inference costs for enterprises and unlock new applications in education, healthcare, and customer support. The company also announced partnerships with universities to study the model's impact on learning outcomes. 

Abstractive (BART):
  Analysts say the upgrade could reduce inference costs for enterprises and unlock new applications in education, healthcare, and customer support . However, some researchers warn about risks around misinformation and bias .


# 7) Evaluate with ROUGE (when you have reference summaries)

If you have gold/reference summaries, compute ROUGE:

In [25]:
from rouge_score import rouge_scorer

def rouge_all(reference, candidate):
    scorer = rouge_scorer.RougeScorer(["rouge1","rouge2","rougeL"], use_stemmer=True)
    scores = scorer.score(reference, candidate)
    # return F1s for readability
    return {k: round(v.fmeasure, 4) for k, v in scores.items()}

# Example:
ref = "OpenAI released a model; analysts expect lower costs and new applications. Some warn about risks; stocks rose; universities to study impact."
cand = bart_summarize(text, 80, 30)
print("ROUGE:", rouge_all(ref, cand))


ROUGE: {'rouge1': 0.3529, 'rouge2': 0.1224, 'rougeL': 0.3529}


In [26]:
# pip install rouge-score