Recipe 6.3.  Summarizing Text Data
--
If you just look around, there are lots of articles and books available. Let’s
assume you want to learn a concept in NLP and if you Google it, you will
find an article. You like the content of the article, but it’s too long to read
it one more time. You want to basically summarize the article and save it
somewhere so that you can read it later.

Well, NLP has a solution for that. Text summarization will help us do
that. You don’t have to read the full article or book every time.

Problem
--
Text summarization of article/document using different algorithms in Python.

Solution
--
Text summarization is the process of making large documents into smaller
ones without losing the context, which eventually saves readers time. This
can be done using different techniques like the following:

• TextRank: A graph-based ranking algorithm

• Feature-based text summarization

• LexRank: TF-IDF with a graph-based algorithm

• Topic based

• Using sentence embeddings

• Encoder-Decoder Model: Deep learning techniques

How It Works
----
We will explore the first 2 approaches in this recipe and see how it works.

Method 3-1 : TextRank
---
TextRank is the graph-based ranking algorithm for NLP. It is basically
inspired by PageRank, which is used in the Google search engine but
particularly designed for text. It will extract the topics, create nodes out of
them, and capture the relation between nodes to summarize the text.

Let’s see how to do it using the Python package Gensim. 
“Summarize” is the function used.

Before that, let’s do all the imports. 
Let’s say your article is Wikipedia for the topic of Natural language processing.

19/4/2020 : Notes :   LDA :

p(word) -> topic , threshold = 0.5

p(NLP) -> Text classification = 0.51 > 0.5  

word embeddings -> Wikipedia 

p(sentence) -> topic

p(According to the Govt Notifications)  -> Govt. circular = 0.9


doc1 
According to the Govt Notifications -> 90% Govt Circular 
..									-> 34% Pvt Ltd companies
..									-> 32.5% IT Act
..

10 phrases

In [2]:
# Import BeautifulSoup and urllib libraries to fetch data from Wikipedia.
from bs4 import BeautifulSoup
from urllib.request import urlopen

# Function to get data from Wikipedia
def get_only_text(url):
 page = urlopen(url)
 soup = BeautifulSoup(page)
 text = ' '.join(map(lambda p: p.text, soup.find_all('p')))
 return soup.title.text, text

# Mention the Wikipedia url
url="https://en.wikipedia.org/wiki/Natural_language_processing"

# Call the function created above

text = get_only_text(url)

# Count the number of letters
print(len(''.join(text)))

print(text)

8657
('Natural language processing - Wikipedia', 'Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.\n Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.\n The history of natural language processing (NLP) generally started in the 1950s, although work can be found from earlier periods.\nIn 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence[clarification needed].\n The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within thr

In [7]:
# Import summarize from gensim
from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords

# Convert text to string format

text = str(text)

#Summarize the text with ratio 0.1 (10% of the total words.)

summarize(text,ratio=0.1)


'However, part-of-speech tagging introduced the use of hidden Markov models to natural language processing, and increasingly, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data.\nSuch models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks.\\n Many of the notable early successes occurred in the field of machine translation, due especially to work at IBM Research, where successively more complicated statistical models were developed.'

In [8]:
#get list of keywords
print(keywords(text,ratio=0.1))

learning
learn
real
languages
systems
results
result
worlds
world
data
research
researched
tasks
task
statistical
base
based
natural language processing
rules
computers
computing
computational
process
translation
machine
word
words
answers
answering
hand
large
intelligence
human
produced
produce
produces
producing
input
generation
generally
generic
generated
including
include
included


Method 3-2 :  Feature-based text summarization
--
This feature-based text summarization methods will extract a feature from
the sentence and check the importance to rank it. Position, length, term
frequency, named entity, and many other features are used to calculate the
score.

Luhn’s Algorithm is one of the feature-based algorithms, and we will
see how to implement it using the sumy library.

In [11]:
# Install sumy
# !pip install sumy
# Import the packages
from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
from sumy.summarizers.luhn import LuhnSummarizer

# Extracting and summarizing
LANGUAGE = "english"
SENTENCES_COUNT = 10

url="https://en.wikipedia.org/wiki/Natural_language_processing"
parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
summarizer = LsaSummarizer()
summarizer = LsaSummarizer(Stemmer(LANGUAGE))
summarizer.stop_words = get_stop_words(LANGUAGE)

for sentence in summarizer(parser.document, SENTENCES_COUNT):
 print(sentence)

[2] However, real progress was much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced.
Generally, this task is much more difficult than supervised learning , and typically produces less accurate results for a given amount of input data.
However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web ), which can often make up for the inferior results if the algorithm used has a low enough time complexity to be practical.
In some areas, this shift has entailed substantial changes in how NLP systems are designed, such that deep neural network-based approaches may be viewed as a new paradigm distinct from statistical natural language processing.
The machine-learning paradigm calls instead for using statistical inference to automatically learn such rules through the analysis of large 

In [18]:
print(parser.document,'\n\n')

print(summarize,'\n\n')

print(summarizer.stop_words,'\n\n')

<DOM with 18 paragraphs> 


<function summarize at 0x0000021263EC0BF8> 


frozenset({'wish', 'probably', 'zero', "we'd", 'their', 'except', 'far', 'sensible', 'forth', 'afterwards', 'himself', 'whereafter', 'own', "isn't", 'as', 'elsewhere', 'all', 'seems', 'during', 'obviously', 'yours', "it'd", 'ask', 'concerning', 'second', 'themselves', 'especially', 'thats', 'tries', 'get', 'might', 'also', 'am', 'anyone', 'may', "that's", 'thorough', 'either', 'somehow', 'reasonably', 'anyways', 'everybody', 'everything', 'corresponding', 'among', 'beforehand', 'gotten', 'nine', 'seven', 'thereafter', 'instead', 'why', 'really', 'wants', 'together', 'accordingly', 'quite', 'now', 'hither', "haven't", 'again', 'considering', 'can', 'another', 'welcome', 'wherein', 'ex', 'my', "you'd", 'seemed', 'novel', 'liked', 'against', 'who', 'y', 'thereupon', 'yes', 'consequently', 'only', 'ie', 'herein', 'mostly', 'given', 'over', 'serious', 'perhaps', 'exactly', 'regardless', 'always', 'yet', 'ones', 'above