<a href="https://colab.research.google.com/github/abdulrahman-hassanin/Keyword-Extraction/blob/main/Keyword_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Keyword Extraction 

Keyword extraction algorithms are methods that find relevant keywords or key phrases from text documents. They are useful for summarizing, indexing, and searching text documents. There are different types of keyword extraction algorithms, such as:

- **Statistical methods**: They compute statistics for keywords and use those statistics to score them. Some examples are word frequency, word collocation, co-occurrence, TF-IDF, and YAKE.

- **Graph-based methods**: They build a graph of words or phrases and use graph algorithms to rank them. Some examples are TextRank, Multi-word Keyword Scoring Strategy, ExpandRank, PositionRank, and Word Attraction Rank.

- **Deep-Learning-Embedding-based methods**: They are based on semantic similarity not statistcs properties by using word or phrase embeddings to measure the similarity and importance of keywords such as KeyBERT.

In [20]:
text = """ The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to
be superior in quality while being more parallelizable and requiring significantly
less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including
ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task,
our model establishes a new single-model state-of-the-art BLEU score of 41.8 after
training for 3.5 days on eight GPUs, a small fraction of the training costs of the
best models from the literature. We show that the Transformer generalizes well to
other tasks by applying it successfully to English constituency parsing both with
large and limited training data.
"""

In [None]:
!pip install yake
!pip install git+https://github.com/boudinfl/pke.git
!pip install rake-nltk
!pip install sentence-transformers

# Statical Methods

The simplest method. It Uses the statistics for keywords to score them.

- **TF-IDF**: Estimates word importance relative to the entire corpus.
- **YAKE**: uses statistical features from a single document to extract keywords.


In [28]:
import yake 

def yake_extractor(text):
    """
    Uses YAKE to extract the top 5 keywords from a text
    Arguments: text (str)
    Returns: list of keywords (list)
    """
    keywords = yake.KeywordExtractor(lan="en", n=3, windowsSize=3, top=5).extract_keywords(text)
    results = []
    for keyword, score in keywords:
        results.append(keyword)
    return results 

yake_extractor(text)

['dominant sequence transduction',
 'convolutional neural networks',
 'sequence transduction models',
 'dominant sequence',
 'sequence transduction']

# Graph-based methods
Graph-based methods generate a graph of related terms from the documents. A graph, for example, connects terms that co-occur in the text.


*   **PositionRank**: Compute Candidates using Part-Of-Speech tagging, then weight them using the position

*   **RAKE**: based on the observation that keywords are frequently composed of multiple words and usually do not include the stop-words or punctuations.

In [33]:
import pke

def position_rank_extractor(text):
    """
    Uses PositionRank to extract the top 5 keywords from a text
    Arguments: text (str)
    Returns: list of keywords (list)
    """
    # define the valid Part-of-Speeches to occur in the graph
    pos = {'NOUN', 'PROPN', 'ADJ', 'ADV'}
    extractor = pke.unsupervised.PositionRank()
    extractor.load_document(text, language='en')
    extractor.candidate_selection(maximum_word_number=5)
    # 4. weight the candidates using the sum of their word's scores that are
    #    computed using random walk biaised with the position of the words
    #    in the document. In the graph, nodes are words (nouns and
    #    adjectives only) that are connected if they occur in a window of
    #    3 words.
    extractor.candidate_weighting(window=3, pos=pos)
    # 5. get the 5-highest scored candidates as keyphrases
    keyphrases = extractor.get_n_best(n=5)
    results = []
    for keyword, score in keyphrases:
        results.append(keyword)
    return results 

position_rank_extractor(text)    

['dominant sequence transduction models',
 'best models',
 'models',
 'convolutional neural networks',
 'new simple network architecture']

In [39]:
import nltk
nltk.download('punkt')
from rake_nltk import Rake

def rake_extractor(text):
    """
    Uses Rake to extract the top 5 keywords from a text
    Arguments: text (str)
    Returns: list of keywords (list)
    """
    r = Rake()
    r.extract_keywords_from_text(text)
    return r.get_ranked_phrases()[:5]

rake_extractor(text)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['best performing models also connect',
 'two machine translation tasks show',
 'requiring significantly less time',
 'new simple network architecture',
 'dominant sequence transduction models']

# Deep-Learning-based methods

The previous models typically work based on the statistical properties of a text and not so much on semantic similarity.

The appearance of deep learning has enabled embedding-based methods. Researchers have developed several keyword extraction methods that use document embeddings and enable the model to be based on the semantic similarity.

## Keyword Extraction using BERT

It is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings.


1.   **Candidate Keywords/Keyphrases**:
    * Creating a list of candidate keywords or keyphrases from a document.
    * `CountVectorizer`. This allows us to specify the length of the keywords and make them into keyphrases. It also is a nice method for quickly removing stop words.

2.   **Embedding**: 
    * We use BERT for this purpose as it has shown great results for both semantic similarity and paraphrase.
    *  **Distilbert** as it has shown great performance in similarity tasks, which is what we are aiming for with keyword/keyphrase extraction!

3.   **Similarity**: 
    * Find the candidates that are most similar to the document. 
    * We will be using the **cosine similarity** between vectors



In [51]:
from sklearn.feature_extraction.text import CountVectorizer
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# 1. Extract candidate words/phrases
n_gram_range = (2, 2)
stop_words = "english"

count = CountVectorizer(ngram_range=n_gram_range, stop_words=stop_words).fit([text])
candidates = count.get_feature_names_out()

# 2. Embedding
model = SentenceTransformer('distilbert-base-nli-mean-tokens')
doc_embedding = model.encode([text])
candidate_embeddings = model.encode(candidates)

# 3. Similarity
top_n = 5
distances = cosine_similarity(doc_embedding, candidate_embeddings)
keywords = [candidates[index] for index in distances.argsort()[0][-top_n:]]

keywords

['machine translation',
 'neural networks',
 'convolutional neural',
 'models superior',
 'decoder best']

# Refrences

- [Keyword Extraction — A Benchmark of 7 Algorithms in Python](https://towardsdatascience.com/keyword-extraction-a-benchmark-of-7-algorithms-in-python-8a905326d93f)
- [Keyword Extraction Methods — The Overview](https://towardsdatascience.com/keyword-extraction-methods-the-overview-35557350f8bb)
- [Keyword Extraction process in Python with Natural Language Processing(NLP)](https://towardsdatascience.com/keyword-extraction-process-in-python-with-natural-language-processing-nlp-d769a9069d5c)
- [Keyword Extraction with BERT](https://towardsdatascience.com/keyword-extraction-with-bert-724efca412ea)