# KeyBERT Review

### Derrick Luyen
### ICS 691B Final Project

In [1]:
# Installation
# %pip install keybert

Collecting keybert
  Using cached keybert-0.7.0.tar.gz (21 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting rich>=10.4.0
  Using cached rich-12.6.0-py3-none-any.whl (237 kB)
Collecting commonmark<0.10.0,>=0.9.0
  Using cached commonmark-0.9.1-py2.py3-none-any.whl (51 kB)
Building wheels for collected packages: keybert
  Building wheel for keybert (setup.py) ... [?25ldone
[?25h  Created wheel for keybert: filename=keybert-0.7.0-py3-none-any.whl size=23776 sha256=2d78b06961a9d1e26f99d390a95a0e26a3448ba3636aa943cb8c056cc8488e4d
  Stored in directory: /Users/derrickluyen/Library/Caches/pip/wheels/66/8d/e6/b0e2f8d883b0fd51819226f67ad9843e04913ce4a97241ff4b
Successfully built keybert
Installing collected packages: commonmark, rich, keybert
Successfully installed commonmark-0.9.1 keybert-0.7.0 rich-12.6.0
Note: you may need to restart the kernel to use updated packages.


## What is KeyBERT?

- KEYWORD EXTRACTION
    - technique used to extract keywords
- takes advantage of BERT embeddings to get the key words / phrases
- extracts words that are most similar to content / meaning of document


### Basic Example

In [1]:
from keybert import KeyBERT

doc = """
         Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal).
         A supervised learning algorithm analyzes the training data and produces an inferred function,
         which can be used for mapping new examples. An optimal scenario will allow for the
         algorithm to correctly determine the class labels for unseen instances. This requires
         the learning algorithm to generalize from the training data to unseen situations in a
         'reasonable' way (see inductive bias).
      """

kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc)

keywords

[('supervised', 0.6676),
 ('labeled', 0.4896),
 ('learning', 0.4813),
 ('training', 0.4134),
 ('labels', 0.3947)]

### Explanation of Values

We see that with the most basic call to extract_keywords, we are given 5 words, each with a value associated with it. That value is the calculated cosine similarity of the word and represents how similar the word is to the rest of the document. With cosine similarity, we know the highest value is 1 while the lowest is -1 which explains why higher values lead to more similarity.

## Underlying Workings of KeyBERT

- Input: document(s) to extract keywords from
- Step 1: extract document embeddings via BERT (BERT embeddings) to get document-level representation
- Step 2: extract word embeddings from document embeddings to generate n-gram words / phrases
    - n-gram meaning a contiguous sequence of n words in this case
- Step 3: use cosine similarity on word embeddings to compare to full document to find out which words / phrases are the most similar to original document
    - allows us to extract key words / phrases
   

## KeyBERT Methods

- KeyBERT API consists of two main methods:
    - extract_embeddings(...)
    - extract_keywords(...)
    

## Extract Embeddings

extract_embeddings(self, docs, candidates=None, keyphrase_ngram_range=(1, 1), stop_words='english', min_df=1, vectorizer=None)

- used to extract document and word embeddings (Step 1 + Step 2 from earlier) 
    - document embeddings for input document
    - word embeddings represent the embeddings of key words / phrases
    
- Returns:
    - doc_embeddings: document embeddings for each input document
    - word_embeddings: embeddings of POTENTIAL key words / phrases from the given input document(s)

### Extract Embeddings Example

In [10]:
doc_embeddings, word_embeddings = kw_model.extract_embeddings(doc)
print(doc_embeddings) # represents the entire document embedding for the input doc

[[-6.59579635e-02 -2.62582451e-02 -5.84359877e-02  2.30566636e-02
   8.50326121e-02  4.17129025e-02  3.69997546e-02 -6.58201650e-02
  -3.87015156e-02 -2.95363809e-03 -3.30499262e-02 -1.44510521e-02
   5.30893803e-02  4.53584604e-02 -3.71540338e-02  3.82542983e-02
   8.76147598e-02 -8.54388159e-03 -2.05052681e-02 -1.00440502e-01
   3.98336798e-02  2.59869397e-02 -4.42725830e-02  5.32478541e-02
  -4.35705557e-02  6.08860105e-02  3.51422019e-02  1.28424191e-03
  -8.47642217e-03 -3.32369134e-02  2.45928876e-02 -4.37021479e-02
   1.95506550e-02 -2.32752077e-02 -7.13654310e-02  2.95184217e-02
  -5.31128272e-02  8.29254314e-02  1.79729536e-02 -4.40264381e-02
   6.71629189e-03  3.34282182e-02 -2.02634130e-02  7.17922486e-03
   6.06902204e-02  7.60842115e-02  2.87512783e-02 -5.89286201e-02
  -9.60301980e-02  4.31644022e-02 -8.03859383e-02 -2.43355948e-02
  -5.58135509e-02  3.92430387e-02 -3.63611020e-02 -1.13825584e-02
   4.60491255e-02 -5.44759743e-02 -3.21346484e-02  6.92857802e-02
   4.05919

In [11]:
print(word_embeddings.shape) # tells us that there are 50 potential keywords from the document above

(50, 384)


Doc_embeddings and word_embeddings can then be passed to extract_keywords as parameters.

In [12]:
keywords = kw_model.extract_keywords(doc, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings)
print(keywords)

[('supervised', 0.6676), ('labeled', 0.4896), ('learning', 0.4813), ('training', 0.4134), ('labels', 0.3947)]


# ADD SOME MORE HERE ABOUT PARAMS

As we can see, the keywords here are the same as above, so we can either get the document + word embeddings to pass to the extract keywords method, or we can simply use extract keywords, which will take care of the embeddings by itself on its way to extracting the keywords.

## Extract Keywords

extract_keywords(self, docs, candidates=None, keyphrase_ngram_range=(1, 1), stop_words='english', top_n=5, min_df=1, use_maxsum=False, use_mmr=False, diversity=0.5, nr_candidates=20, vectorizer=None, highlight=False, seed_keywords=None, doc_embeddings=None, word_embeddings=None)

- method used to extract key words or key phrases
    - can pass in multiple documents at once
- uses cosine similarity to find words / phrases with closest distance to the entire document(s)

- returns: top n keywords with the closest cosine similarity to input document, n is default set to 5

Basic example is shown above, so here are some examples of the parameters we can use:

# ADD SOME MORE HERE ABOUT PARAMS

## Using KeyBERT to extract keywords from a real dataset

- Dataset URL: https://www.kaggle.com/datasets/anandhuh/covid-abstracts?resource=download
    - contains 10,000 research papers
        - each entry contains a title, abstract, and the url of the paper
- using KeyBERT to extract keywords from a COVID-19 Research Paper dataset

In [5]:
import pandas

csvFile = pandas.read_csv('./data/covid_abstracts.csv')

print(len(csvFile))

10000


In [16]:
abstracts = []
for abstract in csvFile['abstract']:
    abstracts.append(abstract)
    
print(len(abstracts)) # contains all abstracts of the 10,000 research papers

10000
