In [6]:
# Sample Title & Text (Abstract)
text = "Module 1 Introduction: What Is Machine Learning?, How Do We Define Learning?, How Do We Evaluate Our Networks?, How Do We Learn Our Network?, What are datasets and how to handle them?, Feature sets, Dataset division: test, train and validation sets, cross validation. Module 2 Basics of machine learning: Applications of Machine Learning, processes involved in Machine Learning, Introduction to Machine Learning Techniques: Supervised Learning, Unsupervised Learning and Reinforcement Learning, Real life examples of Machine Learning. Module 3 Supervised learning: Classification and Regression: K-Nearest Neighbor, Linear Regression, Logistic Regression, Support Vector Machine (SVM), Evaluation Measures: SSE, MME, R2, confusion matrix, precision, recall, F-Score, ROC-Curve. Module 4 Unsupervised learning: Introduction to clustering, Types of Clustering: Hierarchical, Agglomerative Clustering and Divisive clustering; Partitional Clustering - K-means clustering. Module 5 Miscellaneous: Dimensionality reduction techniques: PCA, LDA, ICA. Introduction to Deep Learning, Gaussian Mixture Models, Natural Language Processing, Computer Vision."

<b>KeyBERT</b>

KeyBERT uses BERT (Bidirectional Encoder Representations from Transformers) to perform keyword extraction and sentence embedding.

The BERT model is a pre-trained neural network model that is able to generate contextualized word embeddings.

Working:
1. Tokenizing the input text into subwords.
2. Generating contextualized embeddings for each subword using the BERT model.
3. Applying unsupervised clustering algorithms to group similar words or keyphrases together.
4. Selecting the top N keywords or keyphrases from the clusters based on their relevance to the input text.

In [7]:
# pip install keybert
from keybert import KeyBERT

model_keybert = KeyBERT(model='all-mpnet-base-v2')

keywords = model_keybert.extract_keywords(text,keyphrase_ngram_range=(1, 2), stop_words='english', highlight=False, top_n=10)

for kw, s in keywords:
  print(kw)

introduction clustering
machine learning
unsupervised learning
learning classification
supervised learning
clustering
svm
support vector
means clustering
machine svm


<b>YAKE</b>

YAKE (Yet Another Keyword Extractor) uses a statistical approach based on co-occurrence analysis and statistical significance testing.

Working:

1. The input text is pre-processed by removing stop words, punctuation, and other non-alphanumeric characters.

2. YAKE uses co-occurrence analysis to identify frequent combinations of words in the text. This involves counting the number of times each word appears in the text and the number of times each pair of words co-occurs in the text. The co-occurrence matrix is then used to calculate a score for each combination of words.

3. YAKE uses the Yule-Simpson coefficient to calculate the statistical significance of the co-occurrence of each pair of words. This coefficient measures the diversity of a word distribution and provides a measure of the statistical significance of a co-occurrence.

4. Once the scores for all combinations of words have been calculated, YAKE selects the top N keywords or keyphrases based on their scores.

In [8]:
# pip install git+https://github.com/LIAAD/yake
import yake

model_yake = yake.KeywordExtractor(top=10, stopwords=None)
keywords = model_yake.extract_keywords(text)
for kw, s in keywords:
  print(kw)

Evaluate Our Networks
Learn Our Network
Machine Learning
Machine Learning Techniques
Feature sets
Support Vector Machine
Define Learning
Learning
Machine
Supervised Learning


<b>PKE</b>

PKE (Python Keyphrase Extraction) uses a statistical approach based on graph-based models to extract the most relevant keyphrases from a given text.

Working:

1. The input text is loaded into the PKE model using the load_document method and basic pre-processing such as lowercasing and removing non-alphanumeric characters are performed.

2. The candidate_selection method is used to select the candidate keyphrases from the text. This method uses Part-Of-Speech (POS) tagging to identify the potential noun phrases and extracts them as candidate keyphrases.

3. The candidate_weighting method is used to calculate the weights of the candidate keyphrases. The weights are calculated using a ranking algorithm that assigns scores to the candidate keyphrases based on their importance in the text.

4. The get_n_best method is used to select the top N keyphrases from the candidate keyphrases based on their scores. The method returns a list of tuples, where each tuple contains a keyphrase and its score.

In [9]:
# pip install git+https://github.com/boudinfl/pke.git
import pke

model_pke = pke.unsupervised.TopicRank()

model_pke.load_document(input=text, language='en')

model_pke.candidate_selection()

model_pke.candidate_weighting()

keywords = model_pke.get_n_best(n=10)
for kw, s in keywords:
  print(kw)

machine learning
clustering
module
introduction
regression
unsupervised learning
validation sets
processes
datasets
dimensionality reduction techniques


<b>GPT</b>

In [11]:
import openai
openai.api_key = ""
system_intel = "You will be given data containing the syllabus of a computer science related program. Your task is to extract the topics from the syllabus as keywords. The keywords must be non-repetitive and computer science related, DO NOT include general terms. YOUR RESPONSE THROUGH API WILL BE USED IN AN APPLICATION DIRECTLY. Hence, STRICTLY FOLLOW THE FORMAT. DO NOT GENERATE ANY EXTRA TEXT OTHER THAN THE KEYWORDS SEPARATED BY COMMAS. STRICTLY GENERATE ONLY 10 KEYWORDS."

result = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": system_intel},
        {"role": "user", "content": text},
    ],
)
response = result["choices"][0]["message"]["content"]

keywords = response.split(", ")

for kw in keywords:
    print(kw)

Machine Learning
Definition of Learning
Evaluation of Networks
Handling Datasets
Feature Sets
Dataset Division
Cross Validation
Supervised Learning
Unsupervised Learning
Reinforcement Learning


<b>Davinci</b>

Davinci is a LLM (Large Language Model) developed by OpenAI that uses deep learning techniques to perform various natural language processing tasks such as language translation, text summarization, and keyword extraction.

Davinci uses a transformer-based architecture that consists of a multi-layered neural network.

Davinci is more advanced in keyword identification from text is that it has been pre-trained on massive amounts of textual data, enabling it to learn and recognize patterns and structures in language that are difficult for traditional rule-based algorithms to identify. This means that Davinci can understand the context and meaning of words in a sentence and identify important keywords and phrases more accurately.

In [None]:
# import openai

# openai.api_key = ""

# response = openai.Completion.create(
#   model="text-davinci-003",
#   prompt="You will be given the title and the abstract of a paper. Please find out the appropriate keywords by analysing the abstract and relating them to the title.\nTitle:"+title+"\nAbstract: "+text+"\n\nRestrict the search space to technical keywords and give only 2-3 most important ones"
# )
# print(response.choices[0].text)