###### text

In [37]:
text = """
Text summarization refers to the technique of shortening long pieces of text. The intention is to create a coherent and fluent summary having only the main points outlined in the document.
Automatic text summarization is a common problem in machine learning and natural language processing (NLP).
Skyhoshi, who is a U.S.-based machine learning expert with 13 years of experience and currently teaches people his skills, says that “the technique has proved to be critical in quickly and accurately summarizing voluminous texts, something which could be expensive and time consuming if done without machines.”
Machine learning models are usually trained to understand documents and distill the useful information before outputting the required summarized texts.
What’s the need for text summarization?
Propelled by the modern technological innovations, data is to this century what oil was to the previous one. Today, our world is parachuted by the gathering and dissemination of huge amounts of data.
In fact, the International Data Corporation (IDC) projects that the total amount of digital data circulating annually around the world would sprout from 4.4 zettabytes in 2013 to hit 180 zettabytes in 2025. That’s a lot of data!
With such a big amount of data circulating in the digital space, there is need to develop machine learning algorithms that can automatically shorten longer texts and deliver accurate summaries that can fluently pass the intended messages.
Furthermore, applying text summarization reduces reading time, accelerates the process of researching for information, and increases the amount of information that can fit in an area.
What are the main approaches to automatic summarization?
There are two main types of how to summarize text in NLP:
Extraction-based summarization
The extractive text summarization technique involves pulling keyphrases from the source document and combining them to make a summary. The extraction is made according to the defined metric without making any changes to the texts.
Here is an example:
Source text: Joseph and Mary rode on a donkey to attend the annual event in Jerusalem. In the city, Mary gave birth to a child named Jesus.
Extractive summary: Joseph and Mary attend event Jerusalem. Mary birth Jesus.
As you can see above, the words in bold have been extracted and joined to create a summary — although sometimes the summary can be grammatically strange.
Abstraction-based summarization
The abstraction technique entails paraphrasing and shortening parts of the source document. When abstraction is applied for text summarization in deep learning problems, it can overcome the grammar inconsistencies of the extractive method.
The abstractive text summarization algorithms create new phrases and sentences that relay the most useful information from the original text — just like humans do.
Therefore, abstraction performs better than extraction. However, the text summarization algorithms required to do abstraction are more difficult to develop; that’s why the use of extraction is still popular.
Here is an example:
Abstractive summary: Joseph and Mary came to Jerusalem where Jesus was born.
How does a text summarization algorithm work?
Usually, text summarization in NLP is treated as a supervised machine learning problem (where future outcomes are predicted based on provided data).
Typically, here is how using the extraction-based approach to summarize texts can work:
1. Introduce a method to extract the merited keyphrases from the source document. For example, you can use part-of-speech tagging, words sequences, or other linguistic patterns to identify the keyphrases.
2. Gather text documents with positively-labeled keyphrases. The keyphrases should be compatible to the stipulated extraction technique. To increase accuracy, you can also create negatively-labeled keyphrases.
3. Train a binary machine learning classifier to make the text summarization. Some of the features you can use include:
Length of the keyphrase
Frequency of the keyphrase
The most recurring word in the keyphrase
Number of characters in the keyphrase
4. Finally, in the test phrase, create all the keyphrase words and sentences and carry out classification for them.

"""

### loading the libraries

In [38]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

In [1]:
stopwords = list(STOP_WORDS) # creating the list of stopwords

NameError: name 'STOP_WORDS' is not defined

In [40]:
nlp = spacy.load('en_core_web_sm')

In [41]:
doc = nlp(text) # getting the text into the nlp for further function


In [42]:
doc


Text summarization refers to the technique of shortening long pieces of text. The intention is to create a coherent and fluent summary having only the main points outlined in the document.
Automatic text summarization is a common problem in machine learning and natural language processing (NLP).
Skyhoshi, who is a U.S.-based machine learning expert with 13 years of experience and currently teaches people his skills, says that “the technique has proved to be critical in quickly and accurately summarizing voluminous texts, something which could be expensive and time consuming if done without machines.”
Machine learning models are usually trained to understand documents and distill the useful information before outputting the required summarized texts.
What’s the need for text summarization?
Propelled by the modern technological innovations, data is to this century what oil was to the previous one. Today, our world is parachuted by the gathering and dissemination of huge amounts of data.

### converting the text into tokens

In [43]:
tokens = [token.text for token in doc]
print(tokens)

['\n', 'Text', 'summarization', 'refers', 'to', 'the', 'technique', 'of', 'shortening', 'long', 'pieces', 'of', 'text', '.', 'The', 'intention', 'is', 'to', 'create', 'a', 'coherent', 'and', 'fluent', 'summary', 'having', 'only', 'the', 'main', 'points', 'outlined', 'in', 'the', 'document', '.', '\n', 'Automatic', 'text', 'summarization', 'is', 'a', 'common', 'problem', 'in', 'machine', 'learning', 'and', 'natural', 'language', 'processing', '(', 'NLP', ')', '.', '\n', 'Skyhoshi', ',', 'who', 'is', 'a', 'U.S.-based', 'machine', 'learning', 'expert', 'with', '13', 'years', 'of', 'experience', 'and', 'currently', 'teaches', 'people', 'his', 'skills', ',', 'says', 'that', '“', 'the', 'technique', 'has', 'proved', 'to', 'be', 'critical', 'in', 'quickly', 'and', 'accurately', 'summarizing', 'voluminous', 'texts', ',', 'something', 'which', 'could', 'be', 'expensive', 'and', 'time', 'consuming', 'if', 'done', 'without', 'machines', '.', '”', '\n', 'Machine', 'learning', 'models', 'are', 'usu

### adding the string of punctuation

In [44]:
punctuation = punctuation + '\n'


In [45]:
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n'

###### finding max occuring word

In [69]:
word_freq = {}
for word in doc:
    if word.text.lower() not in stopwords:
        if word.text.lower() not in punctuation:
            #print(word)
            if word.text.lower() not in word_freq.keys():
                word_freq[word.text.lower()] = 1
            else:
                word_freq[word.text.lower()] += 1
                

In [70]:
print(word_freq)

{'text': 16, 'summarization': 14, 'refers': 1, 'technique': 5, 'shortening': 2, 'long': 1, 'pieces': 1, 'intention': 1, 'create': 5, 'coherent': 1, 'fluent': 1, 'summary': 6, 'having': 1, 'main': 3, 'points': 1, 'outlined': 1, 'document': 4, 'automatic': 2, 'common': 1, 'problem': 2, 'machine': 6, 'learning': 7, 'natural': 1, 'language': 1, 'processing': 1, 'nlp': 3, 'skyhoshi': 1, 'u.s.-based': 1, 'expert': 1, '13': 1, 'years': 1, 'experience': 1, 'currently': 1, 'teaches': 1, 'people': 1, 'skills': 1, 'says': 1, '“': 1, 'proved': 1, 'critical': 1, 'quickly': 1, 'accurately': 1, 'summarizing': 1, 'voluminous': 1, 'texts': 5, 'expensive': 1, 'time': 2, 'consuming': 1, 'machines': 1, '”': 1, 'models': 1, 'usually': 2, 'trained': 1, 'understand': 1, 'documents': 2, 'distill': 1, 'useful': 2, 'information': 4, 'outputting': 1, 'required': 2, 'summarized': 1, 'need': 2, 'propelled': 1, 'modern': 1, 'technological': 1, 'innovations': 1, 'data': 7, 'century': 1, 'oil': 1, 'previous': 1, 'tod

### finding the max frequency

In [71]:
max_freq = max(word_freq.values())
max_freq

16

In [72]:
key_list = list(word_freq.keys())
value_list = list(word_freq.values())

### extracting the max frequency word

In [73]:
pos = value_list.index(max_freq)
print(key_list[pos] ,':', max_freq)

text : 16


### scaling the words frequency for minmum and maximum (like MinMaxScalar)

In [74]:
for word in word_freq.keys():
    word_freq[word] = word_freq[word]/max_freq

In [75]:
print(word_freq)

{'text': 1.0, 'summarization': 0.875, 'refers': 0.0625, 'technique': 0.3125, 'shortening': 0.125, 'long': 0.0625, 'pieces': 0.0625, 'intention': 0.0625, 'create': 0.3125, 'coherent': 0.0625, 'fluent': 0.0625, 'summary': 0.375, 'having': 0.0625, 'main': 0.1875, 'points': 0.0625, 'outlined': 0.0625, 'document': 0.25, 'automatic': 0.125, 'common': 0.0625, 'problem': 0.125, 'machine': 0.375, 'learning': 0.4375, 'natural': 0.0625, 'language': 0.0625, 'processing': 0.0625, 'nlp': 0.1875, 'skyhoshi': 0.0625, 'u.s.-based': 0.0625, 'expert': 0.0625, '13': 0.0625, 'years': 0.0625, 'experience': 0.0625, 'currently': 0.0625, 'teaches': 0.0625, 'people': 0.0625, 'skills': 0.0625, 'says': 0.0625, '“': 0.0625, 'proved': 0.0625, 'critical': 0.0625, 'quickly': 0.0625, 'accurately': 0.0625, 'summarizing': 0.0625, 'voluminous': 0.0625, 'texts': 0.3125, 'expensive': 0.0625, 'time': 0.125, 'consuming': 0.0625, 'machines': 0.0625, '”': 0.0625, 'models': 0.0625, 'usually': 0.125, 'trained': 0.0625, 'understa

### converting the text into sentences

In [76]:
sents = [sent for sent in doc.sents]
sents

[
 Text summarization refers to the technique of shortening long pieces of text.,
 The intention is to create a coherent and fluent summary having only the main points outlined in the document.,
 
 Automatic text summarization is a common problem in machine learning and natural language processing (NLP).,
 ,
 Skyhoshi, who is a U.S.-based machine learning expert with 13 years of experience and currently teaches people his skills, says that “the technique has proved to be critical in quickly and accurately summarizing voluminous texts, something which could be expensive and time consuming if done without machines.”,
 
 Machine learning models are usually trained to understand documents and distill the useful information before outputting the required summarized texts.,
 ,
 What’s the need for text summarization?,
 Propelled by the modern technological innovations, data is to this century what oil was to the previous one.,
 Today, our world is parachuted by the gathering and disseminatio

### finding the sentence with max frequency word in it and creating the frequency of the with the other like we did for word

In [77]:
sents_score = {}
for sent in sents:
    #print(sent)
    for word in sent:
        #print(word)
        if word.text.lower() in word_freq.keys():
            if sent not in sents_score.keys():
                sents_score[sent] = word_freq[word.text.lower()]
            else:
                sents_score[sent] += word_freq[word.text.lower()]


In [78]:
sents_score

{
 Text summarization refers to the technique of shortening long pieces of text.: 3.5,
 The intention is to create a coherent and fluent summary having only the main points outlined in the document.: 1.5,
 
 Automatic text summarization is a common problem in machine learning and natural language processing (NLP).: 3.375,
 Skyhoshi, who is a U.S.-based machine learning expert with 13 years of experience and currently teaches people his skills, says that “the technique has proved to be critical in quickly and accurately summarizing voluminous texts, something which could be expensive and time consuming if done without machines.”: 2.9375,
 
 Machine learning models are usually trained to understand documents and distill the useful information before outputting the required summarized texts.: 2.25,
 What’s the need for text summarization?: 2.0,
 Propelled by the modern technological innovations, data is to this century what oil was to the previous one.: 0.875,
 Today, our world is parachu

### finding the frequecy with max sentence

In [79]:
from heapq import nlargest

In [80]:
select_length = int(len(sents)*0.5)
select_length

29

In [81]:
summary = nlargest(select_length, sents_score, key = sents_score.get)
summary

[The abstractive text summarization algorithms create new phrases and sentences that relay the most useful information from the original text — just like humans do.,
 Usually, text summarization in NLP is treated as a supervised machine learning problem (where future outcomes are predicted based on provided data).,
 The extractive text summarization technique involves pulling keyphrases from the source document and combining them to make a summary.,
 
 Text summarization refers to the technique of shortening long pieces of text.,
 
 Automatic text summarization is a common problem in machine learning and natural language processing (NLP).,
 When abstraction is applied for text summarization in deep learning problems, it can overcome the grammar inconsistencies of the extractive method.,
 However, the text summarization algorithms required to do abstraction are more difficult to develop; that’s why the use of extraction is still popular.,
 Furthermore, applying text summarization reduce

### making the summary

In [82]:
total_summary = [word.text for word in summary]
sumary = ' '.join(total_summary)


In [83]:
sumary

'The abstractive text summarization algorithms create new phrases and sentences that relay the most useful information from the original text — just like humans do.\n Usually, text summarization in NLP is treated as a supervised machine learning problem (where future outcomes are predicted based on provided data). The extractive text summarization technique involves pulling keyphrases from the source document and combining them to make a summary. \nText summarization refers to the technique of shortening long pieces of text. \nAutomatic text summarization is a common problem in machine learning and natural language processing (NLP). When abstraction is applied for text summarization in deep learning problems, it can overcome the grammar inconsistencies of the extractive method. However, the text summarization algorithms required to do abstraction are more difficult to develop; that’s why the use of extraction is still popular. Furthermore, applying text summarization reduces reading ti

In [84]:
len(sumary)

3457

### using the displacy for displaying the words with its info

In [85]:
from spacy import displacy

In [86]:
sumary = nlp(sumary)

In [87]:
displacy.render(sumary,style='ent')