# Keyword Extractor

## Problem Statement:
Extract keywprds whcih describes the given text better

## Solution
We will make use of BERT NLP model, which is Bidirectional Encoder Representations from Transformers, a paper published by Google AI,
applying the bidirectional training of Transformer, a popular attention model, to language modelling.

### How BERT works
BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary. The detailed workings of Transformer are described in a paper by Google.

As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once. Therefore it is considered bidirectional, though it would be more accurate to say that it’s non-directional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).

### Our Approach

#### Step 1: Extract keywords using n-grams

For a given text, extract key words using n-grams, We will use _CountVectorizer_ from Scikit-Learn, it allows us to specifty the length of keywords that we require for extraction

#### Step 2: Create Embeddings for both document and keywords

Now we generate embeddings from both document and keywords, using BERT, pretrained using distilbert — base-nli-stsb-mean-tokens or xlm-r-distilroberta-base-paraphase-v1 as they have shown great performance in semantic similarity and paraphrase identification respectively.

#### Step 3: Similarity between document and keywords

Final step is to find the candidates/keywords that are most similar to the document. We assume that the most similar candidates to the document are good keywords/keyphrases for representing the document

We will use Cosine similarity to calculate the similarity between the keywprds and the document.

#### Step 4: Diversification

Step 3 returns phrases which are very similar to other, to solve this issue we will use diversification technique

#### Step 5: Diversificatoin

There is a reason why similar results are returned… they best represent the document! If we were to diversify the keywords/keyphrases then they are less likely to represent the document well as a collective.

Thus, the diversification of our results requires a delicate balance between the accuracy of keywords/keyphrases and the diversity between them.
There are two algorithms that we will be using to diversify our results:

1. Max Sum Similarity
2. Maximal Marginal Relevance




In [1]:
! pip install keybert


Collecting keybert
  Downloading keybert-0.5.0.tar.gz (19 kB)
Collecting sentence-transformers>=0.3.8
  Downloading sentence-transformers-2.2.0.tar.gz (79 kB)
Collecting scikit-learn>=0.22.2
  Downloading scikit_learn-1.0.2-cp39-cp39-win_amd64.whl (7.2 MB)
Collecting numpy>=1.18.5
  Downloading numpy-1.22.3-cp39-cp39-win_amd64.whl (14.7 MB)
Collecting rich>=10.4.0
  Downloading rich-12.0.0-py3-none-any.whl (224 kB)
Collecting commonmark<0.10.0,>=0.9.0
  Downloading commonmark-0.9.1-py2.py3-none-any.whl (51 kB)
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-3.1.0-py3-none-any.whl (14 kB)
Collecting joblib>=0.11
  Using cached joblib-1.1.0-py2.py3-none-any.whl (306 kB)
Collecting scipy>=1.1.0
  Downloading scipy-1.8.0-cp39-cp39-win_amd64.whl (36.9 MB)
Collecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
Collecting tqdm
  Downloading tqdm-4.63.0-py2.py3-none-any.whl (76 kB)
Collecting torch>=1.6.0
  Downloading torch-1.11.0-cp39-cp

You should consider upgrading via the 'D:\Code\DeepLearning\KeywordExtractor\vKeyword\Scripts\python.exe -m pip install --upgrade pip' command.


In [2]:
from keybert import KeyBERT
doc = """
         Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal). 
         A supervised learning algorithm analyzes the training data and produces an inferred function, 
         which can be used for mapping new examples. An optimal scenario will allow for the 
         algorithm to correctly determine the class labels for unseen instances. This requires 
         the learning algorithm to generalize from the training data to unseen situations in a 
         'reasonable' way (see inductive bias).
      """
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc)

Downloading: 100%|██████████| 1.18k/1.18k [00:00<00:00, 125kB/s]
Downloading: 100%|██████████| 10.2k/10.2k [00:00<?, ?B/s]
Downloading: 100%|██████████| 612/612 [00:00<?, ?B/s] 
Downloading: 100%|██████████| 116/116 [00:00<?, ?B/s] 
Downloading: 100%|██████████| 39.3k/39.3k [00:00<00:00, 124kB/s] 
Downloading: 100%|██████████| 349/349 [00:00<00:00, 43.3kB/s]
Downloading: 100%|██████████| 90.9M/90.9M [00:04<00:00, 21.4MB/s]
Downloading: 100%|██████████| 53.0/53.0 [00:00<00:00, 17.8kB/s]
Downloading: 100%|██████████| 112/112 [00:00<?, ?B/s] 
Downloading: 100%|██████████| 466k/466k [00:01<00:00, 401kB/s] 
Downloading: 100%|██████████| 350/350 [00:00<00:00, 68.2kB/s]
Downloading: 100%|██████████| 13.2k/13.2k [00:00<00:00, 2.61MB/s]
Downloading: 100%|██████████| 232k/232k [00:00<00:00, 233kB/s]  
Downloading: 100%|██████████| 190/190 [00:00<00:00, 38.2kB/s]


In [3]:
keywords

[('supervised', 0.6676),
 ('labeled', 0.4896),
 ('learning', 0.4813),
 ('training', 0.4134),
 ('labels', 0.3947)]

In [4]:
kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)

[('supervised', 0.6676),
 ('labeled', 0.4896),
 ('learning', 0.4813),
 ('training', 0.4134),
 ('labels', 0.3947)]

In [5]:
kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)

[('supervised learning', 0.6779),
 ('supervised', 0.6676),
 ('signal supervised', 0.6152),
 ('in supervised', 0.6124),
 ('labeled training', 0.6013)]

In [6]:
kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words=None)

[('supervised learning is', 0.72),
 ('supervised learning algorithm', 0.6992),
 ('in supervised learning', 0.6624),
 ('labeled training data', 0.6125),
 ('supervised learning each', 0.6098)]

In [7]:
# with Max sum similarity
kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', 
                              use_maxsum=True, nr_candidates=20, top_n=5)

[('learning function maps', 0.5341),
 ('training data unseen', 0.1464),
 ('learning algorithm analyzes', 0.4862),
 ('machine learning task', 0.2497),
 ('supervisory signal supervised', 0.3511)]

In [8]:
# With maximum marginal relevance
kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', 
                              use_mmr=True, diversity=0.7)

[('supervised learning algorithm', 0.6992),
 ('pairs infers function', 0.1981),
 ('unseen situations reasonable', 0.2142),
 ('value called supervisory', 0.2895),
 ('class labels unseen', 0.3469)]

#### Candidate Keywords/Keyphrases
In some cases, one might want to be using candidate keywords generated by other keyword algorithms or retrieved from a select list of possible keywords/keyphrases. In KeyBERT, you can easily use those candidate keywords to perform keyword extraction

In [None]:
! pip install yake

In [11]:
import yake
from keybert import KeyBERT

doc = """
         Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs.[1] It infers a
         function from labeled training data consisting of a set of training examples.[2]
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal). 
         A supervised learning algorithm analyzes the training data and produces an inferred function, 
         which can be used for mapping new examples. An optimal scenario will allow for the 
         algorithm to correctly determine the class labels for unseen instances. This requires 
         the learning algorithm to generalize from the training data to unseen situations in a 
         'reasonable' way (see inductive bias).
      """

# Create candidates
kw_extractor = yake.KeywordExtractor(top=50)
candidates = kw_extractor.extract_keywords(doc)
candidates = [candidate[0] for candidate in candidates]

# KeyBERT init
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc, candidates)

In [12]:
keywords

[('supervised learning algorithm', 0.6834),
 ('Supervised learning', 0.6658),
 ('Supervised', 0.6523),
 ('labeled training data', 0.5959),
 ('labeled training', 0.5779)]

### Guided KeyBERT
When applying KeyBERT it automatically extracts the most related keywords to a specific document. However, there are times when stakeholders and users are looking for specific types of keywords. For example, when publishing an article on your website through contentful, you typically already know the global keywords related to the article.

In [13]:
kw_model = KeyBERT()
seed_keywords = ["information"]
keywords = kw_model.extract_keywords(doc, use_mmr=True, diversity=0.1, seed_keywords=seed_keywords)

In [14]:
keywords

[('supervised', 0.6921),
 ('learning', 0.5493),
 ('labeled', 0.5388),
 ('data', 0.4507),
 ('training', 0.4438)]