# KeyBERT Tutorial

### Derrick Luyen
### ICS 691B Final Project

In [1]:
# Installation
# %pip install keybert

## What is KeyBERT?

The main idea behind KeyBERT is the idea of keyword extraction, where KeyBERT is a technique used to extract keywords from its given input. KeyBERT focuses on extracting the words that are most similar to content / meaning of the input document(s).

### Quick Explanation of BERT

BERT, otherwise known as Bidirectional Encoder Representations from Transformers, is a machine learning technique that is used for natural language processing (NLP). The main focus of BERT is to help computers understand meanings of words that may have multiple meanings by using context clues from the words around it. 

KeyBERT takes advantage of BERT embeddings to get the key words / phrases, where BERT embeddings are vectors that are typically a length of 768 that are used to encode words. In simpler terms, BERT embeddings are vectors that represent words.


## How does KeyBERT work?

<img src="./images/keybert.png" alt="keybert flow" />

### Input

First of all, KeyBERT takes in a document or documents of text as input.

### Tokenize Words / Phrases

The first thing we need to do with this input document is to determine possible words and phrases to be extracted from the given input document, and KeyBERT uses CountVectorizer from the scikit-learn package to do so. What CountVectorizer does is that it creates tokens given a text input and a n-gram, where a n-gram allows us to choose a range of how many words we want in our key words / phrases, so if we choose something like (1,3) for example, we are telling KeyBERT and CountVectorizer to look for possible key words / phrases that contain between 1 and 3 words.

### Extract Embeddings

From there, we first use BERT to extract document embeddings, which is where we take the entire document, and we create an embedding for it, so our result is a vector that represents the entire document in the form of a vector. We then go after our word embeddings, where we take each word / phrase of the document and generate an embedding for each of them, so that each word / phrase then has its own vector to represent itself. How many words can be in the word / phrase is to be set later on in the method, but we just need to know that we can take advantage of extracting not only key words, but also key phrases (multiple words). 

How KeyBERT takes care of this step is through a package called Sentence Transformers. This package is one that can be used to compute vectors to represent the meanings of the words and the documents, and these vectors are the embeddings that are generated. The main benefit to KeyBERT using the Sentence Transformers library is that this library allows for the computation of vectors that are able to represent the meaning of complete documents, whereas something like Word2Vec can only be used for single words. 

### Comparisons

Finally, after all of that, we then take the cosine similarity between the embeddings of each word / phrase and the original document embeddings, and from there, we can conclude on what words / phrases are the most similar to the input document(s) based on cosine similarity.

### Model Note

By default, KeyBERT uses the all-MiniLM-L6-v2 model if no other model is passed to the initializer.


### Basic Example

In [1]:
from keybert import KeyBERT

doc = """
         Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal).
         A supervised learning algorithm analyzes the training data and produces an inferred function,
         which can be used for mapping new examples. An optimal scenario will allow for the
         algorithm to correctly determine the class labels for unseen instances. This requires
         the learning algorithm to generalize from the training data to unseen situations in a
         'reasonable' way (see inductive bias).
      """

kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc)

keywords

[('supervised', 0.6676),
 ('labeled', 0.4896),
 ('learning', 0.4813),
 ('training', 0.4134),
 ('labels', 0.3947)]

### Explanation of Values

We see that with the most basic call to extract_keywords, we are given 5 words, each with a value associated with it. That value is the calculated cosine similarity of the word and represents how similar the word is to the original input document. As mentioned above, this cosine similarity is a comparision between the word / phrase embedding vector and the input document vector. With cosine similarity, we know the highest value is 1 while the lowest is -1 as cosine fluctuates between those two values, and in this case, higher values means higher similarity between the word embedding and the document embedding.

## KeyBERT Methods

The KeyBERT API consists of two main methods, which are extract_embeddings(...) and extract_keywords(...). The names of the methods are pretty self-explanatory, but extract_embeddings is used to extract the document embedding for the original document as well as the word embeddings for all of the key words / phrases.
    

## Extract Embeddings

The extract embeddings method is used to extract document and word embeddings. The document embedding is the embedding of the input document, and the word embeddings represent the embeddings of key words / phrases. This method returns two items, doc_embeddings and word embeddings, where doc_embeddings is the document embeddings for each input document, and word_embeddings is the embeddings of POTENTIAL key words / phrases from the given input document(s).

### Extract Embeddings Example

In [24]:
doc_embeddings, word_embeddings = kw_model.extract_embeddings(doc)
print(doc_embeddings)
print(doc_embeddings.shape)

[[-6.59579635e-02 -2.62582451e-02 -5.84359877e-02  2.30566636e-02
   8.50326121e-02  4.17129025e-02  3.69997546e-02 -6.58201650e-02
  -3.87015156e-02 -2.95363809e-03 -3.30499262e-02 -1.44510521e-02
   5.30893803e-02  4.53584604e-02 -3.71540338e-02  3.82542983e-02
   8.76147598e-02 -8.54388159e-03 -2.05052681e-02 -1.00440502e-01
   3.98336798e-02  2.59869397e-02 -4.42725830e-02  5.32478541e-02
  -4.35705557e-02  6.08860105e-02  3.51422019e-02  1.28424191e-03
  -8.47642217e-03 -3.32369134e-02  2.45928876e-02 -4.37021479e-02
   1.95506550e-02 -2.32752077e-02 -7.13654310e-02  2.95184217e-02
  -5.31128272e-02  8.29254314e-02  1.79729536e-02 -4.40264381e-02
   6.71629189e-03  3.34282182e-02 -2.02634130e-02  7.17922486e-03
   6.06902204e-02  7.60842115e-02  2.87512783e-02 -5.89286201e-02
  -9.60301980e-02  4.31644022e-02 -8.03859383e-02 -2.43355948e-02
  -5.58135509e-02  3.92430387e-02 -3.63611020e-02 -1.13825584e-02
   4.60491255e-02 -5.44759743e-02 -3.21346484e-02  6.92857802e-02
   4.05919

As we can see, doc_embeddings represents the entire document embedding for the input document, and we can also see that it has a shape of (1, 384), where 1 represents the one input document, and 384 represents the length of the document embedding vector.

In [25]:
print(word_embeddings.shape)

(50, 384)


The above shape of the word embeddings object tells us that there are 50 potential keywords from the document above that have been extracted and converted into vector embeddings. 

### Extract Embeddings -> Extract Keywords

Doc_embeddings and word_embeddings can also be passed to extract_keywords as parameters, and this would be the same as if you only passed in the original document, but we have the option to pass in embeddings in case we decide to calculate them a different way than the original intended way.

In [12]:
keywords = kw_model.extract_keywords(doc, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings)
print(keywords)

[('supervised', 0.6676), ('labeled', 0.4896), ('learning', 0.4813), ('training', 0.4134), ('labels', 0.3947)]


As we can see, the keywords here are the same as above, so we can either get the document + word embeddings to pass to the extract keywords method, or we can simply use extract keywords, which will take care of the embeddings by itself on its way to extracting the keywords.

## Extract Embeddings Parameter Notes

Here is the method definition of the extract embeddings method:

extract_embeddings(self, docs, candidates=None, keyphrase_ngram_range=(1, 1), stop_words='english', min_df=1, vectorizer=None)

The docs parameter is a required parameter as it indicates what / where to extract keywords from, whereas all of the other parameters are optional and have different use cases. For the vectorizer parameter, if it is passed a CountVectorizer from sklearn.feature_extraction.text.CountVectorizer, then all other parameters besides docs are not used. The candidates parameter allows you to pass in a list of potential key words to use instead of extracting them from the document. This parameter allows you to limit the keywords to whatever is passed in, and the keywords from this list with the highest cosine similarity when compared to the input document will be extracted as the keywords of the input document. The keyphrase_ngram_range parameter allows you to set the minimum and maximum number of words for key words / phrases, so for example, if you set the range to (1,2), this means KeyBERT will only look for key words and phrases that contain between 1 to 2 words. The stop_words parameter allows you to provide a list of stop words, which are essentially words that you want to filter out. The min_df parameter allows you to set a minimum amount of times a word has to appear across all documents in order for it to be in consideration as a keyword.


## Extract Keywords

The extract_keywords method is used to extract key words or key phrases, and you can pass in multiple documents at once as input. This method uses cosine similarity to find words / phrases with closest distance to the entire input document(s) based on the document embedding and word embedding comparisons. This method returns the top n keywords with the closest cosine similarity to input document, where n is set to a default value of 5.

A basic example of this extract_keyword method is shown above, so here are some examples of the parameters we can use to further customize and narrow down what we want from the method:

## Extract Keywords Parameter Notes

Here is the method definition of the extract keywords method:

extract_keywords(self, docs, candidates=None, keyphrase_ngram_range=(1, 1), stop_words='english', top_n=5, min_df=1, use_maxsum=False, use_mmr=False, diversity=0.5, nr_candidates=20, vectorizer=None, highlight=False, seed_keywords=None, doc_embeddings=None, word_embeddings=None)

The docs, candidates, keyphrase_ngram_range, stop_words, min_df, and vectorizer work the same way as in the extract_embeddings method, so please refer to the above text about the extract embeddings parameters. The top_n parameter allows you to set the number of keywords / phrases you want to return, so if you want the top 10 keywords, you would then set top_n equal to 10. The highlight parameter is a more visual feature where it allows you to print the document text and highlight the key words or phrases that are found as a result of extract keywords. The seed_keywords parameter is a list of words that you want the extracted keywords to be similar to, so if you want to guide words in the direction of medicine for example, you might pass in words such as medicine, tylenol, fever, and etc... 

As mentioned earlier, the extract keywords method also allows you to pass in document embeddings and word embeddings, and both of these should be generated by the extract_embeddings method.

Now, the last 4 parameters are more unique in the sense that they involve a little more explanation, but to give a basic gist of it, the use_maxsum parameter is a boolean that decides whether or not to use Max Sum Distance for keyword / keyphrase selection, and the nr_candidates parameter decides the number of candidates to consider, should use_maxsum be set to true. On the other hand, the use_mmr parameter is a boolean that decides whether or not to use Maximal Marginal Relevance for keyword / keyphrase selection, where the diversity parameter controls the diversity of the results, and this is set between 0 and 1. Both of these techniques are used to diversify the results of key words / phrases so that we can get different key words / phrases instead of just variations of the same word.


## Max Sum Distance

Max Sum Distance tries to increase diversity in the pool of keywords that are extracted. It is essentially where we take the top_n param (the default value if not passed manually is 5), and then we pool together the 2 x top_n most similar words / phrases, and lets call this list A. We then take combinations from A that consist of top_n words each, and we calculate cosine similarity on each combination. In the end, we extract the combination of top_n words that are the least similar to each other based on cosine similarity. 

In [12]:
keywords_ms = kw_model.extract_keywords(doc, use_maxsum=True)

keywords_ms

[('inductive', 0.2577),
 ('bias', 0.2644),
 ('function', 0.2658),
 ('supervisory', 0.3297),
 ('labels', 0.3947)]

For this example, we can see that compared to above when we extracted the keywords without max sum, we can see that labels is in common, but the other 4 words are different, providing the diversity we wanted when we decided to use Max Sum Distance. These words were found by taking the top 10 most similar words (as top_n was 5 in this case since that is the default value), and then pooling together combinations of 5 words each. We then take the cosine similarity of each of those combinations and extract the one that results in the least similar words based on the cosine similarity value, which then gives us the result we see above.

## Maximal Marginal Relevance

Maximal Marginal Relevance also tries to maximize the diversity of keywords choosen and this technique tries to make sure keywords selected are not too similar to each other. The amount of diversity depends on diversity value passed in, and the default value is 0.5. MMR takes into account how similar potential key words and phrases are to key words and phrases that have already been selected. By doing so, it chooses not to extract keywords that are too similar to each other, resulting in more variety in the words, rather than words that are just variations of each other.

In [13]:
keywords_mmr = kw_model.extract_keywords(doc, use_mmr=True)

keywords_mmr

[('supervised', 0.6676),
 ('training', 0.4134),
 ('function', 0.2658),
 ('bias', 0.2644),
 ('inductive', 0.2577)]

In this example, we can see that compared to the very first example above where labeled and labels were both keywords, in this case, since we use MMR, we exclude those words as they are too similar with one another, resulting in a more diversity key word pool.

In [5]:
keywords_mmr_75 = kw_model.extract_keywords(doc, use_mmr=True, diversity=0.25)

keywords_mmr_75

[('supervised', 0.6676),
 ('labeled', 0.4896),
 ('learning', 0.4813),
 ('training', 0.4134),
 ('supervisory', 0.3297)]

If we change the value of diversity to something like 0.35 as the example above shows, we get less diverse keywords, and compared to the words in the previous example with the default diversity of 0.5, we can see from a meaning perspective that these words seem more similar to each other than the words from the above example do, which is a result of the decreased diversity. For example, we get words like supervised and supervisory in this example, which are just variations of the same word, making them more similar which again is a result of the decreased diversity value.

## Using KeyBERT to extract keywords from a real dataset

The URL to the dataset I will be using for this tutorial is here: https://www.kaggle.com/datasets/anandhuh/covid-abstracts?resource=download. This dataset contains 10,000 research papers centering around COVID-19, where each entry contains a title, abstract, and the url of the paper. Below I will be using KeyBERT to extract keywords from this COVID-19 Research Paper dataset.

In [2]:
import pandas

csvFile = pandas.read_csv('./data/covid_abstracts.csv')

print(len(csvFile))

10000


First we want to read in the data from the file, so to accomplish this, we can use the read_csv method from the pandas package to easily read in all the entries. We then want to make sure that all the entries have been stored, and since we know there are 10,000 entries in the dataset, we can take the length of the read-in data, which I called csvFile, and we can see that it also has a length of 10,000, meaning all the data has been read in correctly.

In [3]:
abstracts = []
for abstract in csvFile['abstract']:
    abstracts.append(abstract)
    
print(len(abstracts)) # contains all abstracts of the 10,000 research papers

10000


In our dataset, as I mentioned above, each record contains a title, abstract, and a url of the paper. The url is irrelevant in this case, and between the title and the abstract, I believe the abstract would contain more overall information about what the paper is about, and also provide more variation and words to consider for the extraction of the keywords which is why I chose to focus on only the abstracts of the papers.  To do this, I extract all of the abstracts of each paper to put into an array, and then I want to check the length of the array to make sure it is still 10,000 to prove I extracted all of the abstracts, and as we can see from above, that is indeed the case, so abstract extraction was successful.

In [7]:
keywords_covid = kw_model.extract_keywords(abstracts[0])

keywords_covid

[('covid', 0.473),
 ('hospitalization', 0.3638),
 ('patients', 0.2883),
 ('diagnosed', 0.2421),
 ('dakota', 0.2278)]

Here is an initial example where we extract the keywords from the first abstract. This tells us that this paper likely focuses on COVID-19 hospitalization and dealing with patients who have been diagnosed with COVID-19.

In [4]:
first_10 = abstracts[:10]
keywords_10 = kw_model.extract_keywords(first_10)

keywords_10

[[('covid', 0.473),
  ('hospitalization', 0.3638),
  ('patients', 0.2883),
  ('diagnosed', 0.2421),
  ('dakota', 0.2278)],
 [('coronavirus', 0.4481),
  ('immunity', 0.3374),
  ('covid', 0.3309),
  ('immunosuppression', 0.3027),
  ('immune', 0.2837)],
 [('oncology', 0.3354),
  ('resilience', 0.2778),
  ('covid', 0.2625),
  ('pandemic', 0.2533),
  ('distress', 0.2459)],
 [('coronavirus', 0.3736),
  ('covid', 0.3381),
  ('clustering', 0.3139),
  ('classifiers', 0.2915),
  ('classification', 0.2868)],
 [('nasopharyngeal', 0.4071),
  ('cov', 0.3387),
  ('nasal', 0.3347),
  ('covid', 0.3325),
  ('infections', 0.3122)],
 [('hiv', 0.3844),
  ('opioid', 0.3425),
  ('outpatient', 0.31),
  ('buprenorphine', 0.2951),
  ('comorbid', 0.2854)],
 [('pandemic', 0.3228),
  ('internet', 0.3161),
  ('surveys', 0.3038),
  ('digital', 0.2913),
  ('digitalization', 0.2844)],
 [('receptor', 0.3791),
  ('ace2', 0.338),
  ('mutations', 0.3016),
  ('molecular', 0.299),
  ('virus', 0.2677)],
 [('coronavirus', 0.4

Here is what happens when we take the first 10 abstracts and try to attain the keywords of each paper. As we can see, each paper has different keywords as expected, although most of them are more or less centered around COVID. This can be useful for extracting keywords of individual documents, but what if we want to get a general idea of what the whole set of papers is about in general?

In [5]:
set_10 = ""
for abstract in first_10:
    set_10 += abstract + " "

keywords_set = kw_model.extract_keywords(set_10)

keywords_set

[('covid', 0.473),
 ('hospitalized', 0.411),
 ('hospitalization', 0.3638),
 ('coronavirus', 0.3614),
 ('pandemic', 0.3234)]

In that case, we can take the abstracts and combine them together, essentially to form a mega abstract that contains all of the 10 abstracts from above, and from there, we can then run extract keywords on that combined string, which shows us that the main focuses of these papers are indeed about the COVID-19 pandemic and the hospitalizations that occurred as a result of it. This can be useful when going through a set of similar papers where you want to get an initial idea of what exactly these papers have in common without having to read through all of them at first.

In [19]:
keywords_set_mmr = kw_model.extract_keywords(set_10, use_mmr=True)

keywords_set_mmr

[('covid', 0.473),
 ('hospitalization', 0.3638),
 ('comorbidities', 0.2695),
 ('dakota', 0.2278),
 ('faulkton', 0.1662)]

These are the keywords we get when using the Maximal Margin Relevance Algorithm, and as we can see, they are all words that are not so related to each other, whereas in the previous example, there was hospitalized and hospitalization that were very related to each other. This shows us that the diversity of the key words was indeed increased as words that were very similar to each other were removed as a result of the MMR algorithm being applied.

In [24]:
keywords_set_ms = kw_model.extract_keywords(set_10, use_maxsum=True)

keywords_set_ms

[('dakota', 0.2278),
 ('outpatient', 0.2489),
 ('vaccination', 0.2608),
 ('comorbidities', 0.2695),
 ('coronavirus', 0.3614)]

By using Max Sum Distance, we get the 5 (top_n was default set to 5) keywords that are the least similar to each other in terms of cosine similarity from an initial group of 2 x top_n (10) keywords where we took combinations consisting of 5 keywords each to find the least similar group.

In [21]:
keywords_set_range = kw_model.extract_keywords(set_10, keyphrase_ngram_range=(1,2))

keywords_set_range

[('patients covid', 0.5816),
 ('covid 19', 0.5577),
 ('severity covid', 0.5415),
 ('2019 covid', 0.534),
 ('pandemic covid', 0.4889)]

Another use case for extract_keywords is to extract key phrases where we can set the range of words to choose from, and in the example above, we wants key phrases that are either 1 or 2 words. Since we do not set any diversification algorithms like MMR or Max Sum, we get key phrases that are still similar, but the main idea is that we can allow phrases instead of just key words by adjusting the keyphrase_ngram_range parameter.

In [23]:
keywords_set_12 = kw_model.extract_keywords(set_10, top_n=12)

keywords_set_12

[('covid', 0.473),
 ('hospitalized', 0.411),
 ('hospitalization', 0.3638),
 ('coronavirus', 0.3614),
 ('pandemic', 0.3234),
 ('clinic', 0.3019),
 ('patients', 0.2883),
 ('cov', 0.2788),
 ('comorbidities', 0.2695),
 ('vaccination', 0.2608),
 ('clinical', 0.2581),
 ('illnesses', 0.2517)]

In this example, we adjust the top_n parameter in the extract keywords method, thus allowing us to change the number of keywords that we want to return. The default value for this is 5, and in this case, I chose to set this value to 12, giving us 12 keywords. This adjustment can be useful when you want more keywords for a document, which can give you a better idea of what the document is about as opposed to if you had fewer keywords to go off of.

In [26]:
keywords_set_filter = kw_model.extract_keywords(set_10, stop_words=['covid', '19'], keyphrase_ngram_range=(1,2))

keywords_set_filter

[('facility coronavirus', 0.4556),
 ('coronavirus pandemic', 0.4503),
 ('hospitalized participants', 0.4479),
 ('hospitalization rates', 0.411),
 ('hospitalized', 0.411)]

We can also add in our own stop words, so in this example, I filtered out covid and 19, just to see what other key phrases we may have since in the previous example, we saw a lot of keyphrases revolving around these two terms. As a result, we get a bit more diversity in the keywords, although they are still relatively similar, but the stop words added are indeed filtered out, which can be useful when you have a general idea of what a document is about and you are looking for other keywords that may come up to refine your idea of the document.

## KeyBERT Limitations

While KeyBERT is a very nice tool for extracting keywords, like every other tool out there, it does have its limitations. One of its limitations is execution time, in the sense that the full process of going through and using the KeyBERT methods can take a while, especially if we have large documents of text. The reason why this is the case is due to the BERT models that KeyBERT takes as input. By default, KeyBERT uses the all-MiniLM-L6-v2 model, but any BERT model can be passed into the KeyBERT initializer to be used. These models typically are very large in size and take up a lot of computer resources, and even more so when the input documents are larger, so depending on the scale of the project, KeyBERT can be a good tool to use, but it can also limit you in some ways. 

Another limitation with KeyBERT is the consistency of the key words and phrases. What I mean by this is that when producing key phrases, KeyBERT can produce phrases that are similar to the contents of the document via cosine similarity, but sometimes the phrases themselves might not make the most sense. For example, take a look at this example from earlier on:

[('facility coronavirus', 0.4556),
 ('coronavirus pandemic', 0.4503),
 ('hospitalized participants', 0.4479),
 ('hospitalization rates', 0.411),
 ('hospitalized', 0.411)]
 
For the most part, these phrases make sense, but the first one, facility coronavirus, is not very coherent. It would make much more sense if the words were switched for example to make 'coronavirus facility'. There is nowhere in the process flow of KeyBERT that checks for coherence as it mostly just focuses on embedding vectors, so that could be something added in the future.

## Conclusion

Overall, KeyBERT is a great library and has useful methods for keyword and keyphrase extraction. It allows users to pass in various BERT models based on whatever model they need to use. Using this model, KeyBERT is then able to be used to extract the document and word embeddings of the given input document(s), as well as able to extract keywords from said input document(s). Each method has its own useful parameters to customize the code so that users can control what exactly it is they want to extract. Specifically, extract keywords has a bunch of useful parameters like use_mmr and use_maxsum to increase keyword diversity. However, with that being said, there are still some limitations for KeyBERT such as execution time depending on input documents and the model chosen as well as coherency of key phrases selected. Overall though, I would say KeyBERT is a great tool for key word and key phrase extraction and it has many useful features for users to take advantage of as they push forward on their keyword extraction journey.

## Sources

1. [KeyBERT GitHub](https://maartengr.github.io/KeyBERT/index.html)
2. [How to Extract Relevant Keywords with KeyBERT](https://towardsdatascience.com/how-to-extract-relevant-keywords-with-keybert-6e7b3cf889ae)
3. [KeyBERT vs YAKE](https://github.com/MaartenGr/KeyBERT/issues/25)
4. [Keyword Extraction With KeyBERT](https://www.vennify.ai/keybert-keyword-extraction/)