<img src='data/images/section-notebook-header.png' />

# Keyword Extraction

Keyword extraction in natural language processing (NLP) is the task of automatically identifying and extracting the most important words or phrases from a piece of text. These keywords are representative of the main topics or themes present in the text. The goal is to summarize the content and capture its essence by selecting the most relevant and informative terms.

Keyword extraction is widely used in various NLP applications, including document summarization, question answering, information retrieval, text classification, and topic modeling. By identifying and extracting keywords, we can gain insights into the main subjects discussed in a document or a collection of documents. This helps in organizing, categorizing, and searching textual data more efficiently.

There are different approaches to keyword extraction in NLP. Some common techniques include:

* **Frequency-based methods:** These methods rely on the assumption that important words occur frequently in a document compared to less important ones. They calculate statistical measures such as term frequency (TF) or term frequency-inverse document frequency (TF-IDF) to identify significant terms.

* **Graph-based methods:** These methods construct a graph representation of the text, where nodes represent words or phrases, and edges represent relationships between them. Algorithms like TextRank or PageRank can be applied on the graph to determine the importance of each node and extract keywords based on their centrality scores.

* **Machine learning approaches:** These methods utilize supervised or unsupervised machine learning algorithms to train models on labeled data or extract patterns from the data. Techniques such as support vector machines (SVM), Naive Bayes, or clustering algorithms can be employed for keyword extraction.

* **Hybrid methods:** These methods combine multiple techniques, leveraging both statistical measures and linguistic rules to extract keywords. They often yield better results by incorporating different aspects of keyword importance.

## Setting up the Notebook

### Import Required Packages

In [None]:
import re

Besides the commonly used packages, we also need the following packages in this notebook.

* [Newspaper3k](https://newspaper.readthedocs.io/en/latest/) allows to download news articles, using heuristic to extract the headline and content from the web page. It also performs some basic text analysis, including a basich approach for keyword extraction.

* [rake-nltk](https://pypi.org/project/rake-nltk/) is an implementation of the RAKE algorihtm covered in this notebook. This package requires NLTK to be installed.

* [yake](https://pypi.org/project/yake/) is an implementation of the Yake! algorithm covered in this notebook.

* [PyTextRank](https://spacy.io/universe/project/spacy-pytextrank) is an implementation of the TextRank algorithm covered in this notebook. The package requires spaCy to be installed as the algorithm is added to the spaCy pipeline

In [None]:
from newspaper import Article
from rake_nltk import Rake
import yake

from nltk.corpus import stopwords as sw

In [None]:
import spacy
import pytextrank

# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("en_core_web_sm")

# add PyTextRank to the spaCy pipeline
nlp.add_pipe("textrank")

---

## Fetch News Article

For easy testing of the keyword extraction algorithm, we use online news articles as input documents. With the [Newspaper3k](https://newspaper.readthedocs.io/en/latest/) package this is very easy to do.

**Side note:** Online news sites such Channel NewsAsia articles seem to be more convenient their content contains mainly ASCII characters. sites such as The Straits Times also contain Unicode characters for quotes, apostrophes, etc. As some extraction methods to specify such characters as boundary between keywords or keyword candidates, it's much easier if we can specify them in ASCII only. In practice, this can all be handled, but we want to keep it simple here.

### Create `Article` Object

We first create an `Article` object by giving our article URL of choice as input parameter to the constructor. Note that the code cell below does not actually download the article.

In [None]:
url = "https://www.channelnewsasia.com/singapore/built-order-bto-waterway-sunrise-ii-project-delays-exceed-one-year-complete-housing-development-board-hdb-compensation-reimbursement-3324526"
url = "https://www.channelnewsasia.com/singapore/focus-metaverse-another-fading-tech-fad-or-our-future-online-existence-3320981"
url = "https://www.channelnewsasia.com/singapore/tiong-bahru-road-tree-collapse-bus-service-car-crushed-walkway-shelter-3322491"

article = Article(url)

### Download & Analyze Article

The code cell allow initiates the download ofthe news article as well as perform a basic analysis.

In [None]:
article.download()
article.parse()
article.nlp()

### Inspect Keywords

The analysis of the news article also includes a basic approach for extracting keywords. The approach is basic in the sense that all keywords are only individual words instead of any longer phrases. Still, let's have a look at the results so we can compare them with the results of the more sophistices keyword extraction methods.

In [None]:
for keyword in article.keywords:
    print(keyword)

Lastly, we store the content of the article in variable `text` for use throughout the notebook

In [None]:
text = article.text

### Preprocessing

For the some of the keyword extraction methods used in this notebooks, it's recommended to perform some preprocessing. As mentioned above, it's convenient if the content contains only ASCII characters. The utility method below removes all line break characters. It then performs tokenization using spaCy but then concatenates all tokens again to a string. The difference is that there will be now a whitespace between each token (incl. punctuation marks, quotes, etc.). At least, the methods remove any remaining duplicated whitespaces

In [None]:
def preprocess(text):
    # Remove line breaks
    processed = re.sub('\n', ' ', text)
    # Use spaCy for tokenzing (seems to be convenient)
    processed = ' '.join([ t.text for t in nlp(processed) ])
    # Remove duplicate whitespaces
    processed = re.sub('\s+', ' ', processed)
    return processed

We can now preprocess our news article and actually have a look at it. This will give a sense for what the article is about, which in turn allows to make some assessment what kind of keywords to expect.

In [None]:
text = preprocess(text)

print(text)

---

## Keyword Extraction

There are several basic keyword extraction algorithms that are commonly used in natural language processing and text analysis. Here are some of the most widely used algorithms:

* **Frequency-based approach:** This method simply counts the number of occurrences of each word in a document and extracts the words with the highest frequency as keywords.

* **TF-IDF (Term Frequency-Inverse Document Frequency) approach:** This method assigns a weight to each word in a document based on its frequency and its rarity across all documents in a corpus. The words with the highest TF-IDF score are extracted as keywords.

* **RAKE (Rapid Automatic Keyword Extraction) algorithm:** This algorithm identifies candidate keywords based on their co-occurrence with other words in the text, and assigns each candidate a score based on its frequency of occurrence, the number of words in the keyword, and the degree to which the words in the keyword are separated from each other.

* **YAKE (Yet Another Keyword Extractor) algorithm:** This algorithm combines both statistical and semantic features to identify the most relevant keywords or phrases in a text document, and assigns a score to each keyword based on its relevance to the text.
    
* **TextRank algorithm:** This algorithm uses a graph-based ranking method to identify the most significant sentences or keywords in a text document, based on the degree of similarity between the sentences.

These basic keyword extraction algorithms can be applied to various types of text documents and can provide useful insights into the topics and concepts present in the text. However, they may have limitations in terms of accuracy and effectiveness, depending on the specific context and the quality of the text data.

### RAKE -- Rapid Automatic Keyword Extraction

RAKE (Rapid Automatic Keyword Extraction) is a keyword extraction algorithm that was introduced by Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley in a research paper in 2010. The RAKE algorithm is a simple and efficient technique for automatically extracting keywords or phrases from a text document. It works by first splitting the text into individual words and then identifying candidate keywords based on their co-occurrence with other words in the text.

The RAKE algorithm assigns each candidate keyword a score based on its frequency of occurrence, the number of words in the keyword, and the degree to which the words in the keyword are separated from each other. The algorithm then selects the keywords with the highest scores as the final set of extracted keywords.

One of the key advantages of the RAKE algorithm is its simplicity and speed. It does not rely on complex linguistic or statistical models, and it can process large amounts of text quickly and efficiently. However, the RAKE algorithm may not always produce the most accurate or representative set of keywords, as it does not take into account the semantic relationships between words or the context in which they appear.

#### Create `Rake` Object

As we saw in the lecture, RAKE used stopwords, punctuation marks, and other user-defined characters to sepcify the boundaries of keywords. While `Rake` uses some meaingful default parameters, let's manually specidy the set `stopwords` the set of `punctuations` here. Since we have an English news article, it naturally to go with English stopword. In `punctuations` we put all common punctuation marks, brackets/parentheses, hypens, and quotes. But feel free to play with these sets and see the results.

In [None]:
stopwords = set(sw.words('english'))
punctuations = set([ c for c in ".,;:?!(){}[]-'\""])

rake = Rake(stopwords=stopwords, punctuations=punctuations)

#### Extract Keywords

By calling the method `extract_keywords_from_text()` we initiate the keyword extraction process.

In [None]:
#text = "Keyword keyword extraction is not that difficult after all. There are awesome libraries that can help you with keyword extraction. Rapid automatic keyword extraction is one of those."

rake.extract_keywords_from_text(text)

A keyword, particularly keywords with a high score, is likely to occur multiple times in the document. RAKE will return each individual instance. Since all instances of the same keyword will have the same score, we can remove duplicates to makes the result set look cleaner. The code cell below accomplishes this by adding all extracted keywords, together with their score to a set.

In [None]:
rake_unique_keywords = set()

for score, keyword in rake.get_ranked_phrases_with_scores()[:10]:
    rake_unique_keywords.add((keyword.lower(), score))

Having all duplicates removed, we can now sort the keywords with respect to their score, as well as limit the result set to the top-10 keywords.

In [None]:
rake_unique_keywords_sorted = sorted(rake_unique_keywords, key=lambda tup: tup[1], reverse=True)[:100]

Lastly, we can print the top keywords.

In [None]:
for keyword, score in rake_unique_keywords_sorted:
    print("{:.3f}:\t{}".format(score, keyword))

### YAKE! -- Yet Another Keyword Extractor

YAKE (Yet Another Keyword Extractor) is a keyword extraction algorithm that was introduced in a research paper in 2018. It is a state-of-the-art algorithm that is designed to automatically extract keywords or phrases from a text document by considering both their statistical properties and their semantic meanings. The YAKE algorithm works by first identifying candidate keywords based on their statistical properties such as their frequency of occurrence, length, and position in the text. It then uses a language model and a graph-based ranking method to evaluate the relevance of each candidate keyword to the text.

The language model used in YAKE is based on a concept called "term specificity," which measures how much information a word provides about the topic of the document. The graph-based ranking method considers the relationships between words in the text and uses this information to assign a score to each candidate keyword. One of the key advantages of the YAKE algorithm is its ability to handle multi-word expressions and phrases as keywords. It also considers the context and meaning of the text when extracting keywords, which can result in a more accurate and representative set of keywords. Additionally, the YAKE algorithm is highly customizable, allowing users to adjust the parameters and weights of the algorithm to suit their specific needs.

#### Create `yake.KeywordExtractor` Object

Similar to RAKE, the YAKE! implementation supports a series of input parameters; see below for a concrete example. However, let's first run it with the default parameters which reflect the values for the parameters that have been identified using parameter tuning in the paper.

In [None]:
yake_keyword_extractor = yake.KeywordExtractor()

#### Extract Keywords

By calling the method `extract_keywords()` we initiate the keyword extraction process.

In [None]:
yake_keywords = yake_keyword_extractor.extract_keywords(text)

As the YAKE! algorithm -- or at least this implementation of the algorithm -- does no return duplicates, we can directly print the top keywords. Since we don't remove duplicates using a `set()`, there is no need to sort the keywords as they are already sorted. By default, the algorithm return the top-20 keywords. Note that in case of Yake!, the lower the score the better.

In [None]:
for keyword, score in yake_keywords:
    print("{:.6f}:\t{}".format(score, keyword))

#### Using a Custom `yake.KeywordExtractor` Object

In the code cell below, we again create a `yake.KeywordExtractor` object, but this time we specify the input parameters explicitly. Note, however, that the values below represent in fact the default values; apart form `top` which is set to 10 instead of 20. You can change the values of these parameter to see how it affects the result keyword list.

In [None]:
language = "en"
max_ngram_size = 3
deduplication_threshold = 0.9
deduplication_algo = 'seqm'
window_size = 1
num_of_keywords = 10

yake_keyword_extractor = yake.KeywordExtractor(lan=language, 
                                               n=max_ngram_size, 
                                               dedupLim=deduplication_threshold, 
                                               dedupFunc=deduplication_algo, 
                                               windowsSize=window_size,
                                               top=num_of_keywords,
                                               features=None)

yake_keywords = yake_keyword_extractor.extract_keywords(text)

We the `yake.KeywordExtractor` we can extract and display the top keywords.

In [None]:
yake_keywords_sorted = sorted(yake_keywords, key=lambda tup: tup[1], reverse=False)

In [None]:
for keyword, score in yake_keywords_sorted:
    print("{:.6f}:\t{}".format(score, keyword))

### TextRank

TextRank is a keyword and sentence extraction algorithm that was introduced in a research paper by Mihalcea and Tarau in 2004. It is a graph-based ranking algorithm that uses a variation of the PageRank algorithm to identify the most significant sentences or keywords in a text document. The TextRank algorithm works by first breaking the text into individual sentences and then creating a graph representation of the text, with each sentence represented as a node in the graph. The algorithm then assigns weights to the edges between the nodes based on the degree of similarity between the sentences.

The similarity between sentences is determined using a measure of semantic similarity, such as cosine similarity, which compares the vectors of the words in the sentences. The TextRank algorithm then calculates the importance score of each sentence or keyword by applying a variant of the PageRank algorithm to the graph representation of the text. The resulting scores represent the relative importance of each sentence or keyword in the text, with higher scores indicating greater significance. The TextRank algorithm can be used for both keyword extraction and summarization, with the top-ranked keywords or sentences representing the most important concepts and ideas in the text.

One of the key advantages of the TextRank algorithm is its ability to identify important concepts and relationships between concepts in a text document, rather than just individual keywords or sentences. This can result in a more comprehensive and accurate summary or representation of the text.

#### Extract Keywords

When setting up the notebook, we already added the implementation of TextRank to the spaCy pipeline. This means that we can spaCy to analyze our news article as usual, and spaCy will extract all keywords using the TextRank algorithm. Similar to Rake, the TextRank result might contain duplicated occurrences of the same keyword (but with using different captilization, for example). So we again use a `set()` to easily remove those duplicates.

In [None]:
doc = nlp(text)

textrank_unique_keywords = set()

for phrase in doc._.phrases:
    textrank_unique_keywords.add((phrase.text.lower(), phrase.rank))

This intermediate step of removing the duplicates requires that we have to sort the keywords again; we also limit the result to the top 10 keywords to be consistent with the outer outputs above.

In [None]:
textrank_unique_keywords_sorted = sorted(textrank_unique_keywords, key=lambda tup: tup[1], reverse=True)[:10]

As the last step, we print the top keywords and compare them to the result from the other algorithms

In [None]:
for keyword, score in textrank_unique_keywords_sorted:
    print("{:.3f}:\t{}".format(score, keyword))

---

## Summary

Rake, Yake, and TextRank are all keyword extraction algorithms that use different techniques to identify the most important keywords or phrases in a text document. Here are some of the key differences between these algorithms:

* **Approach:** Rake uses a statistical approach that is based on the co-occurrence of words in a text, while Yake uses a combination of statistical and semantic approaches to identify relevant keywords. TextRank uses a graph-based approach that considers the relationships between sentences in a text.

* **Multi-word expressions:** Yake and TextRank are capable of extracting multi-word expressions and phrases as keywords, while Rake can only extract individual words as part of the core algorithm.; Rake for multi-word keywords as a kind postprocessing step.

* **Performance:** Yake is considered to be a state-of-the-art keyword extraction algorithm with high accuracy, while Rake and TextRank are more basic algorithms that may be less accurate but faster to compute.

* **Customization:** Yake allows for more customization of the algorithm parameters and weights than Rake and TextRank, making it more adaptable to specific contexts and data sets.

* **Application:** While all three algorithms can be used for keyword extraction, TextRank is also commonly used for text summarization, as it can identify the most important sentences in a text document.

Overall, the choice of which algorithm to use for keyword extraction will depend on the specific goals and requirements of the analysis, as well as the characteristics of the text data being analyzed.