## Import Needed Modules

In [1]:
import spacy
from heapq import nlargest  # This module provides an implementation of the heap queue algorithm, also known as the priority queue algorithm.

## Data Acquisition
You can find this text in wikipedia
- **Link**: https://en.wikipedia.org/wiki/Automatic_summarization#:~:text=There%20are%20broadly,redundant%20frames%20captured.

In [2]:
text = """
There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.
An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic). This problem is called multi-document summarization. A related application is summarizing news articles. Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary.
Image collection summarization is another application example of automatic summarization. It consists in selecting a representative set of images from a larger set of images.[13] A summary in this context is useful to show the most representative images of results in an image collection exploration system. Video summarization is a related domain, where the system automatically creates a trailer of a long video. This also has applications in consumer or personal videos, where one might want to skip the boring or repetitive actions. Similarly, in surveillance videos, one would want to extract important and suspicious activity, while ignoring all the boring and redundant frames captured.
"""

#### Load english large model from **SpaCy**

In [3]:
nlp = spacy.load('en_core_web_lg')

In [4]:
# Tokenization
doc = nlp(text)

#### We will get text words frequencies

In [5]:
word_frequencies = {}
for token in doc:
    # Remove stopwords and punctuations, and also '\n'
    if token.is_stop or token.is_punct or str(token) == '\n':
            continue
    
    # At the first of each word, the word is not existed in the dict
    if token.text not in word_frequencies.keys():
        word_frequencies[token.text] = 1
        
    else:
        word_frequencies[token.text] += 1
        

In [6]:
# That's words frequencies
print(word_frequencies)

{'broadly': 1, 'types': 1, 'extractive': 1, 'summarization': 11, 'tasks': 1, 'depending': 2, 'program': 1, 'focuses': 2, 'generic': 3, 'obtaining': 1, 'summary': 4, 'abstract': 2, 'collection': 3, 'documents': 2, 'sets': 1, 'images': 3, 'videos': 3, 'news': 4, 'stories': 1, 'etc': 1, 'second': 1, 'query': 4, 'relevant': 2, 'called': 2, 'based': 1, 'summarizes': 1, 'objects': 1, 'specific': 1, 'Summarization': 1, 'systems': 1, 'able': 1, 'create': 1, 'text': 1, 'summaries': 2, 'machine': 1, 'generated': 1, 'user': 1, 'needs': 1, 'example': 3, 'problem': 2, 'document': 4, 'attempts': 1, 'automatically': 3, 'produce': 1, 'given': 2, 'interested': 1, 'generating': 1, 'single': 1, 'source': 2, 'use': 1, 'multiple': 1, 'cluster': 1, 'articles': 3, 'topic': 2, 'multi': 1, 'related': 2, 'application': 2, 'summarizing': 1, 'Imagine': 1, 'system': 3, 'pulls': 1, 'web': 1, 'concisely': 1, 'represents': 1, 'latest': 1, 'Image': 1, 'automatic': 1, 'consists': 1, 'selecting': 1, 'representative': 2,

In [7]:
# Get the count of most frequency item
max_frequency = max(word_frequencies.values())
max_frequency 

11

#### We will normalize these frequencies with max frequency item

In [8]:
for word in word_frequencies.keys():
    word_frequencies[word] = word_frequencies[word] / max_frequency

In [9]:
# After normalizing 
print(word_frequencies)

{'broadly': 0.09090909090909091, 'types': 0.09090909090909091, 'extractive': 0.09090909090909091, 'summarization': 1.0, 'tasks': 0.09090909090909091, 'depending': 0.18181818181818182, 'program': 0.09090909090909091, 'focuses': 0.18181818181818182, 'generic': 0.2727272727272727, 'obtaining': 0.09090909090909091, 'summary': 0.36363636363636365, 'abstract': 0.18181818181818182, 'collection': 0.2727272727272727, 'documents': 0.18181818181818182, 'sets': 0.09090909090909091, 'images': 0.2727272727272727, 'videos': 0.2727272727272727, 'news': 0.36363636363636365, 'stories': 0.09090909090909091, 'etc': 0.09090909090909091, 'second': 0.09090909090909091, 'query': 0.36363636363636365, 'relevant': 0.18181818181818182, 'called': 0.18181818181818182, 'based': 0.09090909090909091, 'summarizes': 0.09090909090909091, 'objects': 0.09090909090909091, 'specific': 0.09090909090909091, 'Summarization': 0.09090909090909091, 'systems': 0.09090909090909091, 'able': 0.09090909090909091, 'create': 0.0909090909

### Sentences Tokenization

In [10]:
# Store sentences in a list
sentence_tokens = [sent for sent in doc.sents]

In [11]:
for sent in sentence_tokens:
    print(sent)





There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on.

The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.).

The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query.

Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.



An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.

Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic).

This problem is called multi-document summarization.

A related applicati

#### We will get sentences frequencies

In [12]:
sentence_scores = {}
for sent in sentence_tokens:
    for word in sent:
        # Chech if each word is existed in words list
        if word.text.lower() in word_frequencies.keys():
            
            # At the first of each sentence, the sentence is not existed in the dict
            if sent not in sentence_scores.keys():
                sentence_scores[sent] = word_frequencies[word.text.lower()]
            else:
                sentence_scores[sent] += word_frequencies[word.text.lower()]

In [13]:
for k, v in sentence_scores.items():
    print(f'Sentence: {k} || Frequence: {v}')

Sentence: There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on. || Frequence: 2.818181818181818

Sentence: The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). || Frequence: 3.9999999999999987

Sentence: The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. || Frequence: 3.909090909090909

Sentence: Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.

 || Frequence: 3.09090909090909

Sentence: An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. || Frequence: 3.9999999999999996

Sentence: Sometimes one might be interested in generating a

#### We want to summarize the full text to a factor (factor)

In [14]:
factor = 0.3
select_length = int(len(sentence_tokens) * factor)
select_length 

4

In [15]:
# nlargest function : returns the specified number of largest elements
lg_sents = nlargest(select_length, sentence_scores, key=sentence_scores.get)
lg_sents

[An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.,
 The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.).,
 The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query.,
 Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.]

In [16]:
# lg_sents is list contains 'Span' items
for sent in lg_sents:
    print(type(sent))

<class 'spacy.tokens.span.Span'>

<class 'spacy.tokens.span.Span'>

<class 'spacy.tokens.span.Span'>

<class 'spacy.tokens.span.Span'>


In [17]:
# Convert these items to a strings
summary_sents = [word.text for word in lg_sents]
summary_sents

['An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.',
 'The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.).',
 'The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query.',
 'Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.\n']

### The final summary

In [18]:
summary = ' '.join(summary_sents)

In [19]:
print(summary)

An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.


