# Day 65b - Text Summarization with spaCy 

This notebook demonstrates a simple **extractive summarization** approach using **spaCy**.  
I compute word importance by frequency (ignoring stopwords & punctuation) and score sentences by summing important word scores, then select top-scoring sentences as the summary.  
This is a lightweight, interpretable method suitable for short-to-medium documents.


## Steps
1. **Import libraries** — load spaCy and helpers.  
2. **Load model** — use `en_core_web_sm` or larger models for better parsing.  
3. **Load text** — provide the document to summarize.  
4. **Preprocess** — remove stopwords and punctuation.  
5. **Tokenize & Frequency** — compute normalized word frequencies.  
6. **Score sentences** — sum word scores per sentence.  
7. **Select top sentences** — choose top N sentences (percentage or fixed count).  
8. **Create summary** — join selected sentences in original order.

## Notes & Improvements
- This method is fast and interpretable but may miss semantics.  
- For higher quality summaries, consider TF-IDF weighting, POS filtering, sentence-position heuristics, or transformer-based summarizers (BART/T5).

---

## 1. Import Libraries

In [1]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

## 2. Load spaCy Model and Prepare Text

In [2]:
# Load the English language model
nlp = spacy.load("en_core_web_sm")

# Example text
text = """There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.
An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic). This problem is called multi-document summarization. A related application is summarizing news articles. Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary.
Image collection summarization is another application example of automatic summarization. It consists in selecting a representative set of images from a larger set of images.[4] A summary in this context is useful to show the most representative images of results in an image collection exploration system. Video summarization is a related domain, where the system automatically creates a trailer of a long video. This also has applications in consumer or personal videos, where one might want to skip the boring or repetitive actions. Similarly, in surveillance videos, one would want to extract important and suspicious activity, while ignoring all the boring and redundant frames captured """

# Process the text into a spaCy Doc object
doc = nlp(text)

In [3]:
doc

There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.
An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic). This problem is called multi-document summarization. A related application is summari

## 3. Stopwords & Punctuation Setup

In [4]:
stopwords = list(STOP_WORDS) 
len(stopwords)

326

In [5]:
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

## 4. Tokenization

In [6]:
tokens = [token.text for token in doc]
print(tokens) 

['There', 'are', 'broadly', 'two', 'types', 'of', 'extractive', 'summarization', 'tasks', 'depending', 'on', 'what', 'the', 'summarization', 'program', 'focuses', 'on', '.', 'The', 'first', 'is', 'generic', 'summarization', ',', 'which', 'focuses', 'on', 'obtaining', 'a', 'generic', 'summary', 'or', 'abstract', 'of', 'the', 'collection', '(', 'whether', 'documents', ',', 'or', 'sets', 'of', 'images', ',', 'or', 'videos', ',', 'news', 'stories', 'etc', '.', ')', '.', 'The', 'second', 'is', 'query', 'relevant', 'summarization', ',', 'sometimes', 'called', 'query', '-', 'based', 'summarization', ',', 'which', 'summarizes', 'objects', 'specific', 'to', 'a', 'query', '.', 'Summarization', 'systems', 'are', 'able', 'to', 'create', 'both', 'query', 'relevant', 'text', 'summaries', 'and', 'generic', 'machine', '-', 'generated', 'summaries', 'depending', 'on', 'what', 'the', 'user', 'needs', '.', '\n', 'An', 'example', 'of', 'a', 'summarization', 'problem', 'is', 'document', 'summarization', ',

In [7]:
tokens

['There',
 'are',
 'broadly',
 'two',
 'types',
 'of',
 'extractive',
 'summarization',
 'tasks',
 'depending',
 'on',
 'what',
 'the',
 'summarization',
 'program',
 'focuses',
 'on',
 '.',
 'The',
 'first',
 'is',
 'generic',
 'summarization',
 ',',
 'which',
 'focuses',
 'on',
 'obtaining',
 'a',
 'generic',
 'summary',
 'or',
 'abstract',
 'of',
 'the',
 'collection',
 '(',
 'whether',
 'documents',
 ',',
 'or',
 'sets',
 'of',
 'images',
 ',',
 'or',
 'videos',
 ',',
 'news',
 'stories',
 'etc',
 '.',
 ')',
 '.',
 'The',
 'second',
 'is',
 'query',
 'relevant',
 'summarization',
 ',',
 'sometimes',
 'called',
 'query',
 '-',
 'based',
 'summarization',
 ',',
 'which',
 'summarizes',
 'objects',
 'specific',
 'to',
 'a',
 'query',
 '.',
 'Summarization',
 'systems',
 'are',
 'able',
 'to',
 'create',
 'both',
 'query',
 'relevant',
 'text',
 'summaries',
 'and',
 'generic',
 'machine',
 '-',
 'generated',
 'summaries',
 'depending',
 'on',
 'what',
 'the',
 'user',
 'needs',
 '.',


In [8]:
len(tokens)

322

## 5. Word Frequency Calculation

- This is to count how often each non-stopword, non-punctuation token appears.  
- Then normalize frequencies by dividing by the maximum frequency so that scores fall between 0 and 1.  
- This normalized frequency is used as a simple proxy for word importance.

In [9]:
#we have to calcualte the freaquency of each and every word, how many time word is repetation in text 

word_frequencies = {}

for word in doc:
    if word.text.lower() not in stopwords:
        if word.text.lower() not in punctuation:
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1

In [10]:
word_frequencies

{'broadly': 1,
 'types': 1,
 'extractive': 1,
 'summarization': 11,
 'tasks': 1,
 'depending': 2,
 'program': 1,
 'focuses': 2,
 'generic': 3,
 'obtaining': 1,
 'summary': 4,
 'abstract': 2,
 'collection': 3,
 'documents': 2,
 'sets': 1,
 'images': 3,
 'videos': 3,
 'news': 4,
 'stories': 1,
 'etc': 1,
 'second': 1,
 'query': 4,
 'relevant': 2,
 'called': 2,
 'based': 1,
 'summarizes': 1,
 'objects': 1,
 'specific': 1,
 'Summarization': 1,
 'systems': 1,
 'able': 1,
 'create': 1,
 'text': 1,
 'summaries': 2,
 'machine': 1,
 'generated': 1,
 'user': 1,
 'needs': 1,
 '\n': 2,
 'example': 3,
 'problem': 2,
 'document': 4,
 'attempts': 1,
 'automatically': 3,
 'produce': 1,
 'given': 2,
 'interested': 1,
 'generating': 1,
 'single': 1,
 'source': 2,
 'use': 1,
 'multiple': 1,
 'cluster': 1,
 'articles': 3,
 'topic': 2,
 'multi': 1,
 'related': 2,
 'application': 2,
 'summarizing': 1,
 'Imagine': 1,
 'system': 3,
 'pulls': 1,
 'web': 1,
 'concisely': 1,
 'represents': 1,
 'latest': 1,
 'Ima

In [11]:
len(word_frequencies)

103

In [12]:
max_frequency = max(word_frequencies.values())
max_frequency 

11

## 6. Normalized Frequencies

In [13]:
#to get normalized/weighted frequencies you should devide all frequencies with 11
for word in word_frequencies.keys():
    word_frequencies[word] = word_frequencies[word]/max_frequency

In [14]:
word_frequencies

{'broadly': 0.09090909090909091,
 'types': 0.09090909090909091,
 'extractive': 0.09090909090909091,
 'summarization': 1.0,
 'tasks': 0.09090909090909091,
 'depending': 0.18181818181818182,
 'program': 0.09090909090909091,
 'focuses': 0.18181818181818182,
 'generic': 0.2727272727272727,
 'obtaining': 0.09090909090909091,
 'summary': 0.36363636363636365,
 'abstract': 0.18181818181818182,
 'collection': 0.2727272727272727,
 'documents': 0.18181818181818182,
 'sets': 0.09090909090909091,
 'images': 0.2727272727272727,
 'videos': 0.2727272727272727,
 'news': 0.36363636363636365,
 'stories': 0.09090909090909091,
 'etc': 0.09090909090909091,
 'second': 0.09090909090909091,
 'query': 0.36363636363636365,
 'relevant': 0.18181818181818182,
 'called': 0.18181818181818182,
 'based': 0.09090909090909091,
 'summarizes': 0.09090909090909091,
 'objects': 0.09090909090909091,
 'specific': 0.09090909090909091,
 'Summarization': 0.09090909090909091,
 'systems': 0.09090909090909091,
 'able': 0.09090909090

## 7. Sentence Tokenization & Scoring


- Split the document into sentences (`sentence_tokens`).  
- For each sentence, compute a score by summing the normalized frequencies of words that appear in the sentence.  
- This gives higher scores to sentences containing many high-frequency (important) words.

In [15]:
sentence_tokens = [sent for sent in doc.sents]
sentence_tokens

[There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on.,
 The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.).,
 The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query.,
 Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.,
 An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.,
 Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic).,
 This problem is called multi-document summarization.,
 A related applica

In [16]:
len(sentence_tokens)

15

## 8. Sentence Score

In [17]:
# we are going to calculate the sentence score, to calculate the sentence score 
sentence_scores = {}

for sent in sentence_tokens:
    for word in sent:
        if word.text.lower() in word_frequencies.keys():
            if sent not in sentence_scores.keys():
                sentence_scores[sent] = word_frequencies[word.text.lower()]
            else:
                sentence_scores[sent] += word_frequencies[word.text.lower()]

In [18]:
sentence_scores

{There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on.: 2.818181818181818,
 The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.).: 3.9999999999999987,
 The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query.: 3.909090909090909,
 Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.: 3.2727272727272716,
 An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.: 3.9999999999999996,
 Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of artic

## 9. Select Top Sentences

- Decide how many sentences to keep for the summary (e.g., top 30% or fixed number).  
- Use `heapq.nlargest` to pick sentences with highest scores.  
- Finally, preserve the original order of selected sentences to produce a coherent summary.

In [19]:
#lets say our case study was 30% sentence with maximum scores
from heapq import nlargest 

In [20]:
select_length = int(len(sentence_tokens)*0.4)
select_length

6

## 10. Final Summary Output

- Combine selected sentences to create the final extractive summary.  
- This method preserves original wording and tends to be factual but may not be concise or fully coherent for long documents.

In [21]:
#we have to select maximum 4 sentences out of all sentences 
summary = nlargest(select_length,sentence_scores, key = sentence_scores.get)

In [22]:
summary

[An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.,
 The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.).,
 The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query.,
 Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.,
 Image collection summarization is another application example of automatic summarization.,
 Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary.]

In [23]:
# if i need to combine these top 3 sentencs then 

final_summary = [word.text for word in summary]

In [24]:
final_summary

['An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.',
 'The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.).',
 'The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query.',
 'Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.\n',
 'Image collection summarization is another application example of automatic summarization.',
 'Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary.\n']

In [25]:
print(summary) # we get the final summary by our model

[An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document., The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.)., The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query., Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.
, Image collection summarization is another application example of automatic summarization., Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary.
]


---
## Conclusion

In this, I implemented a **text summarization model using spaCy**.
The goal was to generate a **short extractive summary** by identifying the most important sentences based on **word frequency**.
By filtering out stopwords and punctuation, and scoring sentences using normalized word frequencies, I created a lightweight summarizer capable of extracting key information efficiently.

This technique is simple yet effective for smaller documents, offering a quick way to summarize without using deep learning models.


## Key Learning

* Learned the concept of **extractive summarization**, which selects important sentences from the original text.
* Used **spaCy** for tokenization, sentence segmentation, and text processing.
* Understood how to calculate **word frequencies** and normalize them for scoring.
* Learned how to **rank sentences** based on cumulative word importance scores.
* Realized that removing **stopwords and punctuation** improves summary quality.
* Identified the **limitations** of frequency-based summarization — lacks semantic understanding.
* Understood that advanced models like **BART, T5, or Pegasus** can be used for **abstractive summarization** in future improvements.

---