# TEXT SUMMERIZATION - USING EXISTING LIBRARIES 

**Preprocessing the texts**

In [1]:
import nltk
import re
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/mohammadazimi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
!pip install goose3



In [3]:
from goose3 import Goose
g = Goose()
url = 'https://en.wikipedia.org/wiki/Automatic_summarization'
article = g.extract(url=url)

In [4]:
article.cleaned_text

'Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Artificial intelligence (AI) algorithms are commonly developed and employed to achieve this, specialized for different types of data.\n\nText summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.[1] On the other hand, visual content can be summarized using computer vision algorithms. Image summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection.[2][3][4] Video summarization algorithms identify and extract from the original video content the most important frames (key-frames), and/or the most importa

In [5]:
original_sentences = [sentence for sentence in nltk.sent_tokenize(article.cleaned_text)]
original_sentences

['Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.',
 'Artificial intelligence (AI) algorithms are commonly developed and employed to achieve this, specialized for different types of data.',
 'Text summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.',
 '[1] On the other hand, visual content can be summarized using computer vision algorithms.',
 'Image summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection.',
 '[2][3][4] Video summarization algorithms identify and extract from the original video content the most important frames (key-frames), and

In [6]:
from IPython.core.display import HTML # for displaying HTML in Jupyter Notebook
def visualize(title, sentence_list,best_sentences):
    """
    Display the article title and highlight the best sentences in the summary.
    - title: str, the article title
    - best_sentences: list of str, the selected summary sentences
    - original_sentences: list of str, all sentences in the article (in order)
    """
    text = ""
    for sentence in sentence_list:
        if sentence in best_sentences:
            text += f"<mark>{sentence}</mark> "
        else:
            text += f"{sentence} "
    html = f"<h2>{title}</h2><p>{text}</p>"
    display(HTML(html))

**Sumy**

- https://pypi.org/project/sumy/

In [7]:
!pip install sumy



In [8]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.luhn import LuhnSummarizer

In [9]:
parser = PlaintextParser.from_string(article.cleaned_text, Tokenizer("english"))

In [10]:
summarizer = LuhnSummarizer()

In [11]:
summary = summarizer(parser.document, 120) 

In [12]:
summary

(<Sentence: Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.>,
 <Sentence: Text summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.>,
 <Sentence: Image summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection.>,
 <Sentence: [2][3][4] Video summarization algorithms identify and extract from the original video content the most important frames (key-frames), and/or the most important video segments (key-shots), normally in a temporally ordered fashion.>,
 <Sentence: [5][6][7][8] Video summaries simply retain a carefully selected subset of the origina

In [13]:
best_sentences = []
for sentence in summary: 
    best_sentences.append(str(sentence))
print(best_sentences)

['Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.', 'Text summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.', 'Image summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection.', '[2][3][4] Video summarization algorithms identify and extract from the original video content the most important frames (key-frames), and/or the most important video segments (key-shots), normally in a temporally ordered fashion.', '[5][6][7][8] Video summaries simply retain a carefully selected subset of the original video frames and, therefore, are not identical to th

In [14]:
visualize(article.title, original_sentences, best_sentences)

**Pysummarization**

- https://pypi.org/project/pysummarization/

In [15]:
!pip install pysummarization



In [16]:
from pysummarization.nlpbase.auto_abstractor import AutoAbstractor
from pysummarization.tokenizabledoc.simple_tokenizer import SimpleTokenizer
from pysummarization.abstractabledoc.top_n_rank_abstractor import TopNRankAbstractor

In [17]:
auto_abstractor = AutoAbstractor()
auto_abstractor.tokenizable_doc = SimpleTokenizer()
auto_abstractor.delimiter_list = [".", "\n"]
abstractable_doc = TopNRankAbstractor() # Using Top-N-Rank algorithm to rank sentences

In [18]:
summary = auto_abstractor.summarize(article.cleaned_text, abstractable_doc)

In [19]:
summary

{'summarize_result': ['Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.\n',
  ' Artificial intelligence (AI) algorithms are commonly developed and employed to achieve this, specialized for different types of data.\n',
  'Text summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.\n',
  '[1] On the other hand, visual content can be summarized using computer vision algorithms.\n',
  ' Image summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection.\n',
  '[2][3][4] Video summarization algorithms identify and extract from the original video content the m

In [20]:
best_sentences = []
for sentence in summary["summarize_result"]: 
    best_sentences.append(re.sub(r'\s+', ' ', sentence).strip()) # Clean up whitespace and newlines

In [21]:
best_sentences

['Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.',
 'Artificial intelligence (AI) algorithms are commonly developed and employed to achieve this, specialized for different types of data.',
 'Text summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.',
 '[1] On the other hand, visual content can be summarized using computer vision algorithms.',
 'Image summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection.',
 '[2][3][4] Video summarization algorithms identify and extract from the original video content the most important frames (key-frames), and

In [22]:
visualize(article.title, original_sentences, best_sentences)

**BERT**

- https://pypi.org/project/bert-extractive-summarizer/

In [23]:
!pip install bert-extractive-summarizer



In [24]:
!pip install --upgrade transformers
!pip install torch



In [25]:
from summarizer import Summarizer

  from .autonotebook import tqdm as notebook_tqdm


In [26]:
summarizer = Summarizer()
summary = summarizer(article.cleaned_text)

In [27]:
summary

'Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. There are two general approaches to automatic summarization: extraction and abstraction. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs. Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic). This problem is called multi-document summarization. Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary. Video summarization is a related domain, where the system automatically creates a trailer of a long video. Depending on the different literature and the defi

In [28]:
summary_tokenized = [sentence for sentence in nltk.sent_tokenize(summary)]
summary_tokenized

['Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.',
 'There are two general approaches to automatic summarization: extraction and abstraction.',
 'Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.',
 'Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic).',
 'This problem is called multi-document summarization.',
 'Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary.',
 'Video summarization is a related domain, where the system automatically creates a trailer of a long video.',
 'Depending on the diff

In [None]:
visualize(article.title, original_sentences, summary_tokenized)