In [1]:
# Sample Title & Text (Abstract)
title = "An Importance of Using Virtualization Technology in Cloud Computing"
text = "The past few decades have seen many changing trends in distributed computing systems in the form of grid, cloud, utility and cluster. With the advent in technology, there has been an increase in the demand for deployment of a robust distributed network for maximizing the performance of such systems and minimizing the infrastructural cost. In this paper we have discussed various levels at which virtualization can be implemented for distributed computing which can contribute to the increased efficiency and performance of distributed computing."

<b>KeyBERT</b>

KeyBERT uses BERT (Bidirectional Encoder Representations from Transformers) to perform keyword extraction and sentence embedding.

The BERT model is a pre-trained neural network model that is able to generate contextualized word embeddings.

Working:
1. Tokenizing the input text into subwords.
2. Generating contextualized embeddings for each subword using the BERT model.
3. Applying unsupervised clustering algorithms to group similar words or keyphrases together.
4. Selecting the top N keywords or keyphrases from the clusters based on their relevance to the input text.

In [2]:
# pip install keybert
from keybert import KeyBERT

model_keybert = KeyBERT(model='all-mpnet-base-v2')

keywords = model_keybert.extract_keywords(text,keyphrase_ngram_range=(1, 2), stop_words='english', highlight=False, top_n=5)

for kw, s in keywords:
  print("Keyword: ",kw, ", Score", s)

  from .autonotebook import tqdm as notebook_tqdm


Keyword:  distributed computing , Score 0.6487
Keyword:  virtualization implemented , Score 0.6468
Keyword:  virtualization , Score 0.6334
Keyword:  levels virtualization , Score 0.5964
Keyword:  grid cloud , Score 0.5326


<b>YAKE</b>

YAKE (Yet Another Keyword Extractor) uses a statistical approach based on co-occurrence analysis and statistical significance testing.

Working:

1. The input text is pre-processed by removing stop words, punctuation, and other non-alphanumeric characters.

2. YAKE uses co-occurrence analysis to identify frequent combinations of words in the text. This involves counting the number of times each word appears in the text and the number of times each pair of words co-occurs in the text. The co-occurrence matrix is then used to calculate a score for each combination of words.

3. YAKE uses the Yule-Simpson coefficient to calculate the statistical significance of the co-occurrence of each pair of words. This coefficient measures the diversity of a word distribution and provides a measure of the statistical significance of a co-occurrence.

4. Once the scores for all combinations of words have been calculated, YAKE selects the top N keywords or keyphrases based on their scores.

In [3]:
# pip install git+https://github.com/LIAAD/yake
import yake

model_yake = yake.KeywordExtractor(top=5, stopwords=None)
keywords = model_yake.extract_keywords(text)
for kw, s in keywords:
  print("Keyword: ",kw, ", Score", s)

Keyword:  utility and cluster , Score 0.015583732582705545
Keyword:  form of grid , Score 0.0191621476358869
Keyword:  distributed computing systems , Score 0.020036360637023316
Keyword:  past few decades , Score 0.02358133565825394
Keyword:  changing trends , Score 0.02358133565825394


<b>PKE</b>

PKE (Python Keyphrase Extraction) uses a statistical approach based on graph-based models to extract the most relevant keyphrases from a given text.

Working:

1. The input text is loaded into the PKE model using the load_document method and basic pre-processing such as lowercasing and removing non-alphanumeric characters are performed.

2. The candidate_selection method is used to select the candidate keyphrases from the text. This method uses Part-Of-Speech (POS) tagging to identify the potential noun phrases and extracts them as candidate keyphrases.

3. The candidate_weighting method is used to calculate the weights of the candidate keyphrases. The weights are calculated using a ranking algorithm that assigns scores to the candidate keyphrases based on their importance in the text.

4. The get_n_best method is used to select the top N keyphrases from the candidate keyphrases based on their scores. The method returns a list of tuples, where each tuple contains a keyphrase and its score.

In [4]:
# pip install git+https://github.com/boudinfl/pke.git
import pke

model_pke = pke.unsupervised.TopicRank()

model_pke.load_document(input=text, language='en')

model_pke.candidate_selection()

model_pke.candidate_weighting()

keywords = model_pke.get_n_best(n=5)
for kw, s in keywords:
  print("Keyword: ",kw, ", Score", s)

Keyword:  computing systems , Score 0.08890515296435779
Keyword:  performance , Score 0.07241923429281419
Keyword:  cloud , Score 0.061749581996090555
Keyword:  utility , Score 0.061228499575187195
Keyword:  grid , Score 0.060025590579581026


<b>Davinci</b>

Davinci is a LLM (Large Language Model) developed by OpenAI that uses deep learning techniques to perform various natural language processing tasks such as language translation, text summarization, and keyword extraction.

Davinci uses a transformer-based architecture that consists of a multi-layered neural network.

Davinci is more advanced in keyword identification from text is that it has been pre-trained on massive amounts of textual data, enabling it to learn and recognize patterns and structures in language that are difficult for traditional rule-based algorithms to identify. This means that Davinci can understand the context and meaning of words in a sentence and identify important keywords and phrases more accurately.

In [5]:
import openai

openai.api_key = "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"

response = openai.Completion.create(
  model="text-davinci-003",
  prompt="You will be given the title and the abstract of a paper. Please find out the appropriate keywords by analysing the abstract and relating them to the title.\nTitle:"+title+"\nAbstract: "+text+"\n\nRestrict the search space to technical keywords and give only 2-3 most important ones"
)
print(response.choices[0].text)



Keywords: Virtualization, Cloud Computing, Distributed Computing
