<img src='data/images/section-notebook-header.png' />

# Text Summarization

## Overview

Text summarization refers to the task of condensing a given text into a shorter version while retaining its most important information. It is a process of distilling the key points, main ideas, and relevant details from a text document, whether it's an article, news story, research paper, or any other form of textual content. The goal of text summarization is to generate a concise and coherent summary that captures the essence of the original text, enabling users to get a quick overview or understanding of the content without having to read the entire document. Summarization can be extractive or abstractive in nature:

* **Extractive Summarization:** In this approach, the summary is created by selecting and combining the most important sentences or phrases from the original text. The selected sentences are usually representative of the main ideas and are directly taken from the source document. Extractive summarization methods rely on statistical techniques, machine learning algorithms, or graph-based algorithms to identify the salient sentences for inclusion in the summary.

* **Abstractive Summarization:** This approach involves generating a summary that may contain words, phrases, and even sentences that were not present in the original text. Abstractive summarization methods attempt to understand the meaning of the text and generate a summary using natural language generation techniques. They rely on deep learning models, such as recurrent neural networks (RNNs), transformers, or other sequence-to-sequence models, to generate coherent and contextually relevant summaries.

Text summarization has various applications, including:

* News summarization: Generating brief summaries of news articles or headlines.
* Document summarization: Condensing lengthy documents or research papers for efficient reading or reviewing.
* Email summarization: Providing concise overviews of long email threads.
* Social media summarization: Extracting important information from social media posts or conversations.
* Automatic summarization of legal documents: Summarizing legal contracts, court rulings, and other legal texts.
* Summarization for content recommendation: Generating summaries for recommendation systems to provide previews or snippets of articles, blogs, or videos.

Text summarization is a challenging task in NLP, requiring the model to understand the nuances of language, extract relevant information, and generate coherent and concise summaries. It is an active area of research with ongoing efforts to improve the quality and effectiveness of summarization algorithms.

## (Most) Basic Algorithm for Text Summarization

Since abstractive summarization requires the generation of new text, it has become the exclusive domain of Deep Learning methods. However, these models are difficult to train and require huge datasets and a lot of computing resources. Therefore, in practice, most users have to rely on pretrained models. Given the syllabus and focus of this course, this notebook focuses on very basic approaches. The goal is not to generate state-of-the-art summaries but to build intuition and experience about core NLP tasks. The figure below is taken from the lecture slides and shows a basic architecture for text summarization:

<img src='data/images/ts-basic-architecture.png' width='90%' />

As it can been seen in the previous figure, architecture is arguably basic since it makes the following simplifying assumptions:

* **Extractive:** Summaries are generated as a subset of selected sentences; this includes that not new text is being generated

* **Generic:** The summary is solely based on the input document and not dependent on any user inputs such as a query or a prompt.

* **Single Document:** In input for the summarization is a single document about a single topic; in contrast to multiple documents about either the same or different topic.

With our focus on simplicity here, the only real challenge we have to solve here is to select the best sentences we want to choose to form the final summary. There are, of course, many approaches to find the best sentences, essentially differing in the way these methods quantify the importance of a sentence. In this notebook, we make use of the topic we have already covered in the previous notebook about "Keyword Extraction". In a nutshell, the basic intuition is that a sentence is important if it has a high score with respect to keyword scores the sentence contains.

## Setting up the Notebook

### Import Required Packages

Similar to the notebook about "Keyword Extraction" our context are news articles, so the `newspaper` package comes in handy. As a more or less arbitrary choice, we deploy Yake! to handle to keyword extraction and scoring step, naturally requiring the `yake` package again as well. We also use `spacy` for some basic data preprocessing.

In [None]:
import numpy as np
import yake
import re

from newspaper import Article

import spacy
nlp = spacy.load("en_core_web_sm")

---

## Prepare Input Document

### Fetch News Article

The code cell below performs the same steps to fetch an online news article using the [`newspaper`](https://newspaper.readthedocs.io/en/latest/) package like in the "Keyword Extraction" notebook. So we skip any details here. At the end, the variable `text` will contain the raw text of the news article.

In [None]:
url = "https://www.channelnewsasia.com/singapore/built-order-bto-waterway-sunrise-ii-project-delays-exceed-one-year-complete-housing-development-board-hdb-compensation-reimbursement-3324526"
#url = "https://www.channelnewsasia.com/singapore/focus-metaverse-another-fading-tech-fad-or-our-future-online-existence-3320981"
#url = "https://www.channelnewsasia.com/singapore/tiong-bahru-road-tree-collapse-bus-service-car-crushed-walkway-shelter-3322491"

article = Article(url)
article.download()
article.parse()

text = article.text

print(text)

### Preprocessing

As the very first step, we split our input news article into a list of sentences. This is important since first we want to perform some preprocessing before the keyword extraction step, but want to add the original and not the processed sentences to our final summary. Thus, we use `sentences` to hold on to all original sentences until the very end.

In [None]:
doc = nlp(text)

sentences = [ str(s).strip() for s in doc.sents ]

num_sentences = len(sentences)

print('Number of sentences: {}'.format(num_sentences))

As usual, we define ourselves a basic auxiliary method to preprocess all sentences. This makes it easy to go back and change the method to see how different preprocessing steps might affect the summarization results. Below, the method `preprocess()` basically just performs case-folding and some very simple cleaning steps. Since we use Yake! we actually do not want to perform steps like stopword removal. However, feel free to modify this method by applying additional preprocessing steps.

In [None]:
def preprocess(sentences):
    sentences_processed = []
    # Iterate over each sentence and preprocess it
    for s in sentences:
        # Remove line breaks
        processed = re.sub('\n', ' ', s)
        # Use spaCy for tokenzing (seems to be convenient)
        processed = ' '.join([ t.text.lower() for t in nlp(processed) ])
        # Remove duplicate whitespaces
        processed = re.sub('\s+', ' ', processed)
        sentences_processed.append(processed)
    return sentences_processed

Now `sentences_processed` is a list containing all the processed sentences, which we can now combine to a new document which we then can use as the input for Yake!

In [None]:
sentences_processed = preprocess(sentences)

text_processed = ' '.join(sentences_processed)

print(text_processed)

---

## News Article Summarization

We can organize our simple approach for summarizing news articles into 4 basic steps.

* **Keyword Extraction:** Use Yake! for keyword extraction where each keyword is also assigned a score (the lower, the more important)

* **Sentence Scoring:** Use the scores of keywords to assign scores to each sentence

* **Sentence Selection:** Use sentence scores to identify the sentence to be included in the summary

* **Summary Generation:** Create the summary from selected sentences.

In the following, we go through each of the 4 steps in more detail.

### Keyword Extraction

We already saw in the "Keyword Extraction" notebook how to use Yake! to extract the most important keywords together with their scores from a document. In the code cell below perform the same steps, by generally applying the default parameters. But again, you can also modify the parameters to see if and how it might affect the final summary. An interesting parameter is `top` which allows us to specify the maximum number of keywords extracted. It is easy to see that if we set this parameter very low, maybe even to 1, that we only pick sentences that indeed include this single top keyword; all other sentences will have a score of 0 and won't be included in the summary.

In [None]:
language = "en"
max_ngram_size = 3
deduplication_threshold = 0.9
deduplication_algo = 'seqm'
window_size = 1
num_of_keywords = 20 # <-- interesting parameter for us

yake_keyword_extractor = yake.KeywordExtractor(lan=language, 
                                               n=max_ngram_size, 
                                               dedupLim=deduplication_threshold, 
                                               dedupFunc=deduplication_algo, 
                                               windowsSize=window_size,
                                               top=num_of_keywords,
                                               features=None)

yake_keywords = yake_keyword_extractor.extract_keywords(text_processed)

With the `yake.KeywordExtractor` we can extract and display the top keywords.

In [None]:
yake_keywords_sorted = sorted(yake_keywords, key=lambda tup: tup[1], reverse=False)

Let's first have a quick look at the top keywords so we can later check if those keywords are indeed occurring in our summary.

In [None]:
for keyword, score in yake_keywords_sorted:
    print("{:.6f}:\t{}".format(score, keyword))

### Sentence Scoring

Now that we have the top keywords together with their scores, we can use this information to score all sentences. One thing we have to be careful about, though: Recall that with Yake!, the lower the score the more important the keyword. However, simply summing all keyword scores for each sentence, and then using the sentences with the lowest score would be wrong. The problem is that sentences without any keyword would get a score of 0.0 which would be misleading. There are various way to solve this, but here we propose a very straightforward approach

* Calculate the reciprocal of the original keyword scores so that the higher the value the more important the keyword

* Apply a logarithm on the score so that a single keyword cannot dominate the calculation of the score of a sentence

* Normalize the score of all sentences based on the length; otherwise, longer sentences will always tend to be more important.

**Important:** In many ways, this approach is somewhat arbitrary and only chosen based on some intuition. Maybe having longer sentences in our summary is a good thing, in which case we shouldn't normalize the scores. Similarly, maybe applying a log normalization will smoothing the keyword scores too much, and instead we want to maximize the effects of the very top keywords.

In [None]:
# Create dictionary to keep track of sentence scores
sentence_scores = {}

for sid, sent in enumerate(sentences_processed):
    # Initialize sentence score
    sent_score = 0.0
    # Iterate over each top keyword
    for keyword, score in yake_keywords_sorted:
        # Just a failsafe in the case the keyword has score of 0 (should never happen, though)
        if score == 0:
            continue
        # Iterate over all keyword and check if they occur in the sentence
        if keyword in sent:
            # Update sentence score using "some" formula
            sent_score += np.log(1 / score)

    # Compute length of sentence (length = number or words/tokens)
    sent_len = len(sent.split())

    # Normalize score w.r.t. sentence length
    sentence_scores[sid] = sent_score / sent_len

Let's check out the scores we just have calculated for each sentence. In the code cell below, we first order all sentences (more specifically, their indices) with respect to their scores in a descending order.

In [None]:
sentence_scores_sorted = sorted(sentence_scores.items(), key=lambda x: x[1], reverse=True)

for sid, score in sentence_scores_sorted:
    print('Sentence ID: {}\t (score: {})'.format(sid, score))

Now the core part of our summarization algorithm is already done: Each sentence has some kind of importance score that we can now utilize to find the sentence we want to put into the final summary.

### Sentence Selection

Since we now know how important each sentence is, we now mainly have to decide how big our summary should be. A simply but common way to specify the length of the summary is terms of its relative length w.r.t. the original input document. Here, we quantify using `summary_size` which represents the percentage of the number of sentences in the input document. This means, with `summary_size`, we specify that the summary should be around 25% of the length of the input (in terms of the number of sentences). Note that we always round up (`np.ceil`).

In the code cell below, we not only pick the sentence indices with the largest scores, but also extract only those indices -- as we no longer need the score -- and finally sort the indices. We do this to preserve the original sentence order. This means that the most important sentence might not be the first sentence. However, it is only important that the most important sentences are there, and it is arguably preferable to preserve the original sentence order.

In [None]:
summary_size = 1/4

# Sort sentence indices based on sentence score (for Yake!: ascendengin since the smaller the better)
top_sentences = sentence_scores_sorted[:int(np.ceil((summary_size*num_sentences)))]
print(top_sentences)

# Extract only the sentence indices; no need for the score any more
top_sentences = [ t[0] for t in top_sentences ]
print(top_sentences)

# Sort indices to preserve order 
top_sentences = sorted(top_sentences)
print(top_sentences)

print()
print('Number of sentences in the summary: {}'.format(len(top_sentences)))

### Summary Generation

The last step is now to actually generate the summary. The most intuitive approach is to include all the most important sentences (incl. preserving the order). However, we can also include any additional consideration we deem meaningful. For example, one can argue that the summary of a news article should always include the very first sentence as it (a) ensures a "smooth" start of the summary and (b) the first sentence of a news article often contains some core information. While our list of important sentences might already include the first sentences, this is not guaranteed.

As such, in the code cell below, if the first sentence is not part of the most important sentences, we simply add it (i.e., sentence index 0) to the front.


In [None]:
if 0 not in top_sentences:
    top_sentences = [0] + top_sentences

Knowing all the indices of the sentences that should form our summary, we can finally generate it. Note that we take the sentences from `sentences`, i.e., the list containing the original and not the preprocessed sentences to ensure that the sentences in the summary look indeed like the ones in the original news article.

In [None]:
for idx in top_sentences:
    print(sentences[idx].strip())

There it is, our summary. For easy comparison, we can also print all original sentence:

In [None]:
for sent in sentences:
    print(sent)

---

## Discussion

As you have seen throughout the notebook, creating an extractive summary -- at least on a very basic level -- is actually not that difficult. Of course, the approach presented here is unlikely to yield state-of-the-art results since we made many simplifying assumptions and processing steps. Here are just a few consideration when aiming to improve the results

* **Choice of Keyword Extraction Algorithm:** We used keyword extraction as a core building block for our summarization algorithm. Not only do many other keyword extraction exist, most (incl. Yake!) feature a series of input parameters that potentially could be tuned to see better summaries.

* **Beyond Importance:** We selected the most important sentences for the summary, where the importance of a sentence is a normalized sum of the scores of the keywords contained in the sentence. In practice, this might mean that a summary can contain sentences that can be very similar because they contain the same highly scored keywords. To address this, we could also consider other criteria, for example, "novelty" where we might ignore an important sentence when the summary already contains a very similar sentence. Or maybe we want to favor sentences that are particularly short/long, contain many person or place names, or contain numbers. Such choices will often depend on the application context (e.g., we might prefer summaries with many numbers in the case of financial documents).

* **Limitations of Extractive Summaries:** Picking non-consecutive sentences from a document -- even when preserving the order -- is unlikely to yield a summary that "flows" well. This is particularly true if the summary might contain sentences with many pronouns and we might know to which phrases those pronouns refer to -- or even worse, the summary might in fact mislead us regarding the references of pronouns. In this case, might might want to lower the score of sentences containing (many) pronouns, or perform coreference resolution before creating our summary.

In short, it's very easy to point out problems and limitations with the simple summarization approach proposed in this notebook. However, there are also a wide range of meaningful and arguably intuitive extensions for further improvement conceivable. Having such a simple baseline algorithm (a) shows that it can be very easy to get started, but also (b) helps to better understand the involved challenges and potential solutions towards more sophisticated results.

---

## Summary

Extractive summarization based on keyword extraction is a technique that focuses on identifying and selecting the most important sentences or phrases from a text using keywords. In this approach, keywords are extracted from the original text, representing the significant concepts or topics discussed in the document. These keywords serve as the basis for determining the salient sentences that will be included in the summary.

The process of keyword extraction involves several steps. Initially, the text is preprocessed by removing stopwords, punctuation, and other irrelevant elements. Then, various algorithms and techniques are applied to identify and rank the keywords. Common approaches include frequency-based methods like TF-IDF (Term Frequency-Inverse Document Frequency) and statistical techniques such as TextRank, RAKE (Rapid Automatic Keyword Extraction), or -- like in this notebook -- Yake! (Yet Another Keyword Extractor).

Once the keywords are determined, the next step is to identify the sentences in the document that contain these keywords or are semantically related to them. These sentences are considered to be the most informative and are selected for inclusion in the summary. The final summary is created by combining these selected sentences, preserving their original order or rearranging them to improve coherence.

Extractive summarization based on keyword extraction offers a relatively simple and effective way to generate summaries. It relies on the assumption that important information is often associated with the keywords present in the text. However, this approach may have limitations in capturing the overall context and generating summaries that are truly representative of the original content. It may struggle with understanding the relationships between sentences or capturing nuanced information that is not explicitly associated with the extracted keywords.