# Text Summarization using spacy

In [1]:
pip install spacy

Note: you may need to restart the kernel to use updated packages.


In [6]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ----- ---------------------------------- 1.8/12.8 MB 9.1 MB/s eta 0:00:02
     ------------ --------------------------- 3.9/12.8 MB 10.0 MB/s eta 0:00:01
     ------------------- -------------------- 6.3/12.8 MB 10.3 MB/s eta 0:00:01
     --------------------------- ------------ 8.7/12.8 MB 10.6 MB/s eta 0:00:01
     --------------------------------- ----- 11.0/12.8 MB 10.8 MB/s eta 0:00:01
     --------------------------------------  12.6/12.8 MB 10.7 MB/s eta 0:00:01
     ---------------------------------------- 12.8/12.8 MB 10.1 MB/s  0:00:01
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [7]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("data science and ai has greate career ahead")

In [8]:
doc

data science and ai has greate career ahead

In [9]:
for token in doc:
    print(token.text)

data
science
and
ai
has
greate
career
ahead


In [10]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Apple Apple PROPN NNP nsubj Xxxxx True False
is be AUX VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. U.K. PROPN NNP nsubj X.X. False False
startup startup VERB VBD ccomp xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False


In [11]:
for token in doc:
    print(token.pos_)

PROPN
AUX
VERB
ADP
VERB
PROPN
VERB
ADP
SYM
NUM
NUM


In [12]:
for token in doc:
    print(token.text, token.pos_)

Apple PROPN
is AUX
looking VERB
at ADP
buying VERB
U.K. PROPN
startup VERB
for ADP
$ SYM
1 NUM
billion NUM


In [13]:
for token in doc:
    print(token.text, token.pos_, token.lemma_)

Apple PROPN Apple
is AUX be
looking VERB look
at ADP at
buying VERB buy
U.K. PROPN U.K.
startup VERB startup
for ADP for
$ SYM $
1 NUM 1
billion NUM billion


In [14]:
text = """There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.
An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic). This problem is called multi-document summarization. A related application is summarizing news articles. Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary.
Image collection summarization is another application example of automatic summarization. It consists in selecting a representative set of images from a larger set of images.[4] A summary in this context is useful to show the most representative images of results in an image collection exploration system. Video summarization is a related domain, where the system automatically creates a trailer of a long video. This also has applications in consumer or personal videos, where one might want to skip the boring or repetitive actions. Similarly, in surveillance videos, one would want to extract important and suspicious activity, while ignoring all the boring and redundant frames captured """

In [15]:
text

'There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.\nAn example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic). This problem is called multi-document summarization. A related application is summa

In [16]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

In [17]:
stopwords = list(STOP_WORDS) 
stopwords

['moreover',
 'everyone',
 'many',
 'off',
 'most',
 'am',
 'you',
 'after',
 'made',
 'else',
 'yet',
 'became',
 'such',
 'himself',
 'myself',
 'then',
 '’ll',
 'via',
 'next',
 'now',
 '’s',
 'get',
 'which',
 'noone',
 'one',
 'beyond',
 'why',
 '’ve',
 'either',
 'hereafter',
 'already',
 'full',
 'against',
 'be',
 'well',
 'never',
 'hereupon',
 'otherwise',
 'without',
 'three',
 'whereas',
 'too',
 'wherein',
 'others',
 'much',
 'more',
 'do',
 'serious',
 'over',
 'not',
 'ourselves',
 '’d',
 'doing',
 'empty',
 'through',
 'thus',
 'forty',
 'nor',
 'anyone',
 "'s",
 'anyhow',
 'toward',
 'the',
 'become',
 'ever',
 'across',
 'in',
 'a',
 'everywhere',
 'with',
 'he',
 'among',
 'yourselves',
 'still',
 'itself',
 'besides',
 'did',
 'fifteen',
 'move',
 'indeed',
 'at',
 'her',
 'seemed',
 'neither',
 "'re",
 "'m",
 'no',
 'everything',
 'below',
 'twenty',
 'beforehand',
 'to',
 'very',
 "'d",
 'n‘t',
 'none',
 'that',
 'however',
 'down',
 'least',
 'go',
 'eight',
 'q

In [18]:
len(stopwords)

326

In [19]:
nlp = spacy.load('en_core_web_sm') 

In [20]:
text

'There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.\nAn example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic). This problem is called multi-document summarization. A related application is summa

In [21]:
doc = nlp(text)
doc

There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.
An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic). This problem is called multi-document summarization. A related application is summari

In [22]:
# lets get the tokens from text
tokens = [token.text for token in doc]
print(tokens) 
#when we execute everythihg we created tokens from the text & not removed any of the stopwords & didnt cleaned the data

['There', 'are', 'broadly', 'two', 'types', 'of', 'extractive', 'summarization', 'tasks', 'depending', 'on', 'what', 'the', 'summarization', 'program', 'focuses', 'on', '.', 'The', 'first', 'is', 'generic', 'summarization', ',', 'which', 'focuses', 'on', 'obtaining', 'a', 'generic', 'summary', 'or', 'abstract', 'of', 'the', 'collection', '(', 'whether', 'documents', ',', 'or', 'sets', 'of', 'images', ',', 'or', 'videos', ',', 'news', 'stories', 'etc', '.', ')', '.', 'The', 'second', 'is', 'query', 'relevant', 'summarization', ',', 'sometimes', 'called', 'query', '-', 'based', 'summarization', ',', 'which', 'summarizes', 'objects', 'specific', 'to', 'a', 'query', '.', 'Summarization', 'systems', 'are', 'able', 'to', 'create', 'both', 'query', 'relevant', 'text', 'summaries', 'and', 'generic', 'machine', '-', 'generated', 'summaries', 'depending', 'on', 'what', 'the', 'user', 'needs', '.', '\n', 'An', 'example', 'of', 'a', 'summarization', 'problem', 'is', 'document', 'summarization', ',

In [23]:
tokens

['There',
 'are',
 'broadly',
 'two',
 'types',
 'of',
 'extractive',
 'summarization',
 'tasks',
 'depending',
 'on',
 'what',
 'the',
 'summarization',
 'program',
 'focuses',
 'on',
 '.',
 'The',
 'first',
 'is',
 'generic',
 'summarization',
 ',',
 'which',
 'focuses',
 'on',
 'obtaining',
 'a',
 'generic',
 'summary',
 'or',
 'abstract',
 'of',
 'the',
 'collection',
 '(',
 'whether',
 'documents',
 ',',
 'or',
 'sets',
 'of',
 'images',
 ',',
 'or',
 'videos',
 ',',
 'news',
 'stories',
 'etc',
 '.',
 ')',
 '.',
 'The',
 'second',
 'is',
 'query',
 'relevant',
 'summarization',
 ',',
 'sometimes',
 'called',
 'query',
 '-',
 'based',
 'summarization',
 ',',
 'which',
 'summarizes',
 'objects',
 'specific',
 'to',
 'a',
 'query',
 '.',
 'Summarization',
 'systems',
 'are',
 'able',
 'to',
 'create',
 'both',
 'query',
 'relevant',
 'text',
 'summaries',
 'and',
 'generic',
 'machine',
 '-',
 'generated',
 'summaries',
 'depending',
 'on',
 'what',
 'the',
 'user',
 'needs',
 '.',


In [24]:
len(tokens)

322

In [25]:
#we have to calcualte the freaquency of each and every word, how many time word is repetation in text 

word_frequencies = {}

for word in doc:
    if word.text.lower() not in stopwords:
        if word.text.lower() not in punctuation:
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1

In [26]:
word_frequencies

{'broadly': 1,
 'types': 1,
 'extractive': 1,
 'summarization': 11,
 'tasks': 1,
 'depending': 2,
 'program': 1,
 'focuses': 2,
 'generic': 3,
 'obtaining': 1,
 'summary': 4,
 'abstract': 2,
 'collection': 3,
 'documents': 2,
 'sets': 1,
 'images': 3,
 'videos': 3,
 'news': 4,
 'stories': 1,
 'etc': 1,
 'second': 1,
 'query': 4,
 'relevant': 2,
 'called': 2,
 'based': 1,
 'summarizes': 1,
 'objects': 1,
 'specific': 1,
 'Summarization': 1,
 'systems': 1,
 'able': 1,
 'create': 1,
 'text': 1,
 'summaries': 2,
 'machine': 1,
 'generated': 1,
 'user': 1,
 'needs': 1,
 '\n': 2,
 'example': 3,
 'problem': 2,
 'document': 4,
 'attempts': 1,
 'automatically': 3,
 'produce': 1,
 'given': 2,
 'interested': 1,
 'generating': 1,
 'single': 1,
 'source': 2,
 'use': 1,
 'multiple': 1,
 'cluster': 1,
 'articles': 3,
 'topic': 2,
 'multi': 1,
 'related': 2,
 'application': 2,
 'summarizing': 1,
 'Imagine': 1,
 'system': 3,
 'pulls': 1,
 'web': 1,
 'concisely': 1,
 'represents': 1,
 'latest': 1,
 'Ima

In [27]:
len(word_frequencies)

103

In [28]:
max_frequency = max(word_frequencies.values())
max_frequency 

11

In [29]:
# to get normalized/weighted frequencies you should devide all frequencies with 11
for word in word_frequencies.keys():
    word_frequencies[word] = word_frequencies[word]/max_frequency

In [30]:
# print(word_frequencies)
word_frequencies
# this is the normalized frequencies of each word

{'broadly': 0.09090909090909091,
 'types': 0.09090909090909091,
 'extractive': 0.09090909090909091,
 'summarization': 1.0,
 'tasks': 0.09090909090909091,
 'depending': 0.18181818181818182,
 'program': 0.09090909090909091,
 'focuses': 0.18181818181818182,
 'generic': 0.2727272727272727,
 'obtaining': 0.09090909090909091,
 'summary': 0.36363636363636365,
 'abstract': 0.18181818181818182,
 'collection': 0.2727272727272727,
 'documents': 0.18181818181818182,
 'sets': 0.09090909090909091,
 'images': 0.2727272727272727,
 'videos': 0.2727272727272727,
 'news': 0.36363636363636365,
 'stories': 0.09090909090909091,
 'etc': 0.09090909090909091,
 'second': 0.09090909090909091,
 'query': 0.36363636363636365,
 'relevant': 0.18181818181818182,
 'called': 0.18181818181818182,
 'based': 0.09090909090909091,
 'summarizes': 0.09090909090909091,
 'objects': 0.09090909090909091,
 'specific': 0.09090909090909091,
 'Summarization': 0.09090909090909091,
 'systems': 0.09090909090909091,
 'able': 0.09090909090

In [31]:
sentence_tokens = [sent for sent in doc.sents]
sentence_tokens

[There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on.,
 The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.).,
 The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query.,
 Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.,
 An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.,
 Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic).,
 This problem is called multi-document summarization.,
 A related applica

In [33]:
len(sentence_tokens)

15

In [34]:
# we are going to calculate the sentence score, to calculate the sentence score 
sentence_scores = {}

for sent in sentence_tokens:
    for word in sent:
        if word.text.lower() in word_frequencies.keys():
            if sent not in sentence_scores.keys():
                sentence_scores[sent] = word_frequencies[word.text.lower()]
            else:
                sentence_scores[sent] += word_frequencies[word.text.lower()]

In [35]:
sentence_scores

{There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on.: 2.818181818181818,
 The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.).: 3.9999999999999987,
 The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query.: 3.909090909090909,
 Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.: 3.2727272727272716,
 An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.: 3.9999999999999996,
 Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of artic

In [36]:
#lets say our case study was 30% sentence with maximum scores
from heapq import nlargest 

In [37]:
select_length = int(len(sentence_tokens)*0.4)
select_length

6

In [40]:
# we have to select maximum 4 sentences out of all sentences 
summary = nlargest(select_length,sentence_scores, key = sentence_scores.get)
summary

[An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.,
 The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.).,
 The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query.,
 Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.,
 Image collection summarization is another application example of automatic summarization.,
 Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary.]

In [41]:
# if i need to combine these top 3 sentencs then 

final_summary = [word.text for word in summary]
final_summary

['An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.',
 'The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.).',
 'The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query.',
 'Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.\n',
 'Image collection summarization is another application example of automatic summarization.',
 'Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary.\n']