## Problem
Text summarization of article/document using different algorithms in
Python.

## Solution
Text summarization is the process of making large documents into smaller
ones without losing the context, which eventually saves readers time. This
can be done using different techniques like the following:

• TextRank: A graph-based ranking algorithm

• Feature-based text summarization

• LexRank: TF-IDF with a graph-based algorithm

• Topic based

• Using sentence embeddings

• Encoder-Decoder Model: Deep learning techniques

## Method 4-1 TextRank
TextRank is the graph-based ranking algorithm for NLP. It is basically
inspired by PageRank, which is used in the Google search engine but
particularly designed for text. It will extract the topics, create nodes out of
them, and capture the relation between nodes to summarize the text.

Let’s see how to do it using the Python package Gensim. “Summarize”
is the function used.

Before that, let’s import the notes. Let’s say your article is Wikipedia for
the topic of Natural language processing

In [6]:
# Import BeautifulSoup and urllib libraries to fetch data from Wikipedia.
from bs4 import BeautifulSoup
from urllib.request import urlopen
# Function to get data from Wikipedia
def get_only_text(url):
    page = urlopen(url)
    soup = BeautifulSoup(page)
    text = ' '.join(map(lambda p: p.text, soup.find_all('p')))
    print (text)
    return soup.title.text, text

# Mention the Wikipedia url
url="https://en.wikipedia.org/wiki/Natural_language_processing"

# Call the function created above
text = get_only_text(url)

Natural language processing (NLP) is a subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.
 Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.
 The history of natural language processing generally started in the 1950s, although work can be found from earlier periods.
In 1950, Alan Turing published an article titled "Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.
 The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem.[2]  However, real progress was much slower, and after 

In [7]:
# Count the number of letters
len("".join(text))

9398

In [10]:
# Import summarize from gensim
from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords

# Convert text to string format
text = str(text)

#Summarize the text with ratio 0.1 (10% of the total words.)
summarize(text, ratio=0.1)

'However, part-of-speech tagging introduced the use of hidden Markov models to natural language processing, and increasingly, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data.\nSuch models have the advantage that they can express the relative certainty of many different possible answers rather than only one, producing more reliable results when such a model is included as a component of a larger system.\\n Systems based on machine-learning algorithms have many advantages over hand-produced rules:\\n The following is a list of some of the most commonly researched tasks in natural language processing.'

In [11]:
#keywords
print(keywords(text, ratio=0.1))

learning
learn
languages
process
research
researched
real
systems
natural language processing
results
result
worlds
world
data
tasks
task
statistical
base
based
called
calling
calls
translation
word
words
answers
answering
years machine
rules
hand
large
year
human
input
produced
produce
produces
producing
intelligence
generation
generally
generic
generated
including
include
included
corpora


## Method 4-2 Feature-based text summarization
Your feature-based text summarization methods will extract a feature from
the sentence and check the importance to rank it. Position, length, term
frequency, named entity, and many other features are used to calculate the
score.

Luhn’s Algorithm is one of the feature-based algorithms, and we will
see how to implement it using the sumy library.

In [14]:
# Install sumy
#!pip install sumy
# Import the packages
from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
from sumy.summarizers.luhn import LuhnSummarizer

In [13]:
# Extracting and summarizing
LANGUAGE = "english"
SENTENCES_COUNT = 10
url="https://en.wikipedia.org/wiki/Natural_language_processing"
parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
summarizer = LsaSummarizer()
summarizer = LsaSummarizer(Stemmer(LANGUAGE))
summarizer.stop_words = get_stop_words(LANGUAGE)
for sentence in summarizer(parser.document, SENTENCES_COUNT):
    print(sentence)

[2] However, real progress was much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced.
However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web ), which can often make up for the inferior results if the algorithm used has a low enough time complexity to be practical.
In some areas, this shift has entailed substantial changes in how NLP systems are designed, such that deep neural network-based approaches may be viewed as a new paradigm distinct from statistical natural language processing.
Increasingly, however, research has focused on statistical models , which make soft, probabilistic decisions based on attaching real-valued weights to each input feature.
Natural language understanding Convert chunks of text into more formal representations such as first-order logic struct