# KeyBERT
KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document.

Corresponding medium post can be found [here](https://towardsdatascience.com/keyword-extraction-with-bert-724efca412ea).

This notebook is implemented based on github repo of keyBERT. Repo can be found [here](https://github.com/MaartenGr/KeyBERT)

## Getting Started

### Installation

In [13]:
!pip install keybert
!pip install flair

### Usage
The most minimal example can be seen below for the extraction of keywords:



In [7]:
from keybert import KeyBERT

doc = """
         Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal). 
         A supervised learning algorithm analyzes the training data and produces an inferred function, 
         which can be used for mapping new examples. An optimal scenario will allow for the 
         algorithm to correctly determine the class labels for unseen instances. This requires 
         the learning algorithm to generalize from the training data to unseen situations in a 
         'reasonable' way (see inductive bias).
      """
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc)

You can set <b>keyphrase_ngram_range</b> to set the length of the resulting keywords/keyphrases:



In [14]:
kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)


To extract keyphrases, simply set <b>keyphrase_ngram_range</b> to (1, 2) or higher depending on the number of words you would like in the resulting keyphrases:

In [15]:
kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)

We can highlight the keywords in the document by simply setting hightlight:

In [12]:
keywords = kw_model.extract_keywords(doc, highlight=True)

### Max Sum Similarity
To diversify the results, we take the 2 x top_n most similar words/phrases to the document. Then, we take all top_n combinations from the 2 x top_n words and extract the combination that are the least similar to each other by cosine similarity.

In [16]:
kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', 
                              use_maxsum=True, nr_candidates=20, top_n=5)

### Maximal Marginal Relevance 
To diversify the results, we can use Maximal Margin Relevance (MMR) to create keywords / keyphrases which is also based on cosine similarity. The results with high diversity:

In [18]:
kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', 
                              use_mmr=True, diversity=0.7)


The results with low diversity:

In [29]:
kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', 
                              use_mmr=True, diversity=0.2)

## Embedding Models
KeyBERT supports many embedding models that can be used to embed the documents and words:

* Sentence-Transformers
* Flair
* Spacy
* Gensim
* USE

Click [here](https://maartengr.github.io/KeyBERT/guides/embeddings.html) for a full overview of all supported embedding models.



### Sentence-Transformers
You can select any model from sentence-transformers [here](https://www.sbert.net/docs/pretrained_models.html) and pass it through KeyBERT with model:



In [27]:
from keybert import KeyBERT
kw_model = KeyBERT(model='all-MiniLM-L6-v2')

Or select a SentenceTransformer model with your own parameters:



In [28]:
from keybert import KeyBERT
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
kw_model = KeyBERT(model=sentence_model)


### Flair
[Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that is publicly available. Flair can be used as follows:



In [26]:
from keybert import KeyBERT
from flair.embeddings import TransformerDocumentEmbeddings

roberta = TransformerDocumentEmbeddings('roberta-base')
kw_model = KeyBERT(model=roberta)

## References
@misc
{grootendorst2020keybert,<br>
  author       = {Maarten Grootendorst},<br>
  title        = {KeyBERT: Minimal keyword extraction with BERT.},<br>
  year         = 2020,<br>
  publisher    = {Zenodo},<br>
  version      = {v0.3.0},<br>
  doi          = {10.5281/zenodo.4461265},<br>
  url          = {https://doi.org/10.5281/zenodo.4461265}
}