# Text Summarization System

This notebook demonstrates how to build a system that summarizes lengthy articles using extractive and abstractive methods.

**Dataset:** CNN/Daily Mail

**Methods Used:**
- Extractive Summarization using spaCy + TF-IDF
- Abstractive Summarization using BART from HuggingFace
- Evaluation using ROUGE

In [1]:
# Install dependencies (run only once)
!pip install transformers datasets spacy scikit-learn
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB 98.1 kB/s eta 0:02:11
     ---------------------------------------- 0.0/12.8 MB 98.1 kB/s eta 0:02:11
     --------------------------------------- 0.1/12.8 MB 169.9 kB/s eta 0:01:15
     ---

In [2]:
# Load Dataset
from datasets import load_dataset
dataset = load_dataset("cnn_dailymail", "3.0.0",verification_mode="no_checks")
train_data = dataset['train'].select(range(100))  # Using subset for speed

# Sample article and reference summary
article = train_data[0]['article']
reference = train_data[0]['highlights']
print("Sample Article:\n", article[:500])
print("\nReference Summary:\n", reference)

Found cached dataset parquet (C:/Users/Arhum/.cache/huggingface/datasets/parquet/1.0.0-9039e3832de1dbcc/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


  0%|          | 0/3 [00:00<?, ?it/s]

Sample Article:
 LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as s

Reference Summary:
 Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday . Young actor says he has no plans to fritter his cash away . Radcliffe's earnings from first five Potter films have been held in trust fund .


## Extractive Summarization using spaCy + TF-IDF

In [3]:
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

nlp = spacy.load("en_core_web_sm")

def extractive_summary(article, top_n=3):
    doc = nlp(article)
    sentences = [sent.text.strip() for sent in doc.sents if len(sent.text.strip()) > 20]
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(sentences)
    sentence_scores = np.array(X.sum(axis=1)).flatten()
    top_indices = sentence_scores.argsort()[-top_n:][::-1]
    summary = ' '.join([sentences[i] for i in sorted(top_indices)])
    return summary

print("Extractive Summary:\n", extractive_summary(article))

Extractive Summary:
 Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart.


In [4]:
!pip install torch



In [5]:
pip install --upgrade typing_extensions


Note: you may need to restart the kernel to use updated packages.


In [6]:
import torch
print(torch.__version__)
print("CUDA available:", torch.cuda.is_available())


2.7.0+cpu
CUDA available: False


In [7]:
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

def abstractive_summary(text):
    return summarizer(text, max_length=130, min_length=30, do_sample=False)[0]['summary_text']

print("Abstractive Summary:\n", abstractive_summary(article[:1024]))

  torch.utils._pytree._register_pytree_node(


Downloading model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Abstractive Summary:
 Harry Potter star Daniel Radcliffe turns 18 on Monday. He gains access to a reported £20 million ($41.1 million) fortune. Radcliffe says he has no plans to fritter his cash away on fast cars, drink.


In [9]:
!pip install rouge-score absl-py

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting absl-py
  Obtaining dependency information for absl-py from https://files.pythonhosted.org/packages/f6/d4/349f7f4bd5ea92dab34f5bb0fe31775ef6c311427a14d5a5b31ecb442341/absl_py-2.2.2-py3-none-any.whl.metadata
  Downloading absl_py-2.2.2-py3-none-any.whl.metadata (2.6 kB)
Downloading absl_py-2.2.2-py3-none-any.whl (135 kB)
   ---------------------------------------- 0.0/135.6 kB ? eta -:--:--
   --- ------------------------------------ 10.2/135.6 kB ? eta -:--:--
   ----------- --------------------------- 41.0/135.6 kB 495.5 kB/s eta 0:00:01
   ---------------------------------------  133.1/135.6 kB 1.1 MB/s eta 0:00:01
   ---------------------------------------- 135.6/135.6 kB 1.0 MB/s eta 0:00:00
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py): started
  Build

## Abstractive Summarization using BART

## Evaluation using ROUGE

In [10]:
from datasets import load_metric

rouge = load_metric("rouge")
generated = abstractive_summary(article[:1024])
scores = rouge.compute(predictions=[generated], references=[reference])
print("ROUGE Scores:\n", scores)

ROUGE Scores:
 {'rouge1': AggregateScore(low=Score(precision=0.5833333333333334, recall=0.5384615384615384, fmeasure=0.5599999999999999), mid=Score(precision=0.5833333333333334, recall=0.5384615384615384, fmeasure=0.5599999999999999), high=Score(precision=0.5833333333333334, recall=0.5384615384615384, fmeasure=0.5599999999999999)), 'rouge2': AggregateScore(low=Score(precision=0.4, recall=0.3684210526315789, fmeasure=0.3835616438356164), mid=Score(precision=0.4, recall=0.3684210526315789, fmeasure=0.3835616438356164), high=Score(precision=0.4, recall=0.3684210526315789, fmeasure=0.3835616438356164)), 'rougeL': AggregateScore(low=Score(precision=0.5, recall=0.46153846153846156, fmeasure=0.48000000000000004), mid=Score(precision=0.5, recall=0.46153846153846156, fmeasure=0.48000000000000004), high=Score(precision=0.5, recall=0.46153846153846156, fmeasure=0.48000000000000004)), 'rougeLsum': AggregateScore(low=Score(precision=0.5, recall=0.46153846153846156, fmeasure=0.48000000000000004), mi