MultPAX: Keyphrase Extraction using Language Models and Knowledge Graphs

This repositoy contains the source code of our paper: "MultPAX: Keyphrase Extraction using Language models and Knowledge Graphs". The paper has been accepted at the ISWC 2022 conference.

Fig. 1 the architecure of MultPAX framework

Summary:

Keyphrase extraction is the process of extracting a small set of phrases that best describe an input corpus.
The automatic generation of keyphrases has become essential for many natural language applications such as text categorization, indexing, and summarization.
In this paper, we propose MultPAX, a multitask framework for extracting present and absent keyphrases using pretrained language models and knowledge graphs. In particular, our framework contains three components:
1. MultPAX identifies present keyphrases from the input corpus.
2. MultPAX then links the input corpus with external knowledge graphs to get more relevant phrases.
3. MultPAX ranks the extracted phrases based on their semantic relatedness to input corpus.

Our Contributions:

1) We propose an *unsupervised* multitask framework that not only extracts present keyphrases, but also generate absent ones.
    
2) To the best of our knowledge, our approach is the first attempt that leverages existing knowledge graphs for keyphrase extraction without the need to create keyphrase vocabularies or phrase banks.
    
3) We introduce an embedding-based F1 score that considers semantic similarity between generated and ground-truth keyphrases rather than the existing exact-matching. 
    
4) We carried out several experiments on four benchmark datasets. The evaluation results showed that our approach proved to be more accurate compared with state-of-the-art baselines.

Repository Structure:

.
├── Baselines
│   ├── EmbedRank-Baseline.ipynb
│   ├── EmbedRank(Wordwise)- Baseline.ipynb
│   ├── TextRank-Baseline.ipynb
│   └── YAKE-Baseline.ipynb
├── Inspec experiment
│   └── MltPAX-Inspec.ipynb
├── Krapivin2009 experiment
│   └── MltPAX-Krapivin2009.ipynb
├── NUS experiment
│   └── MltPAX-NUS.ipynb
├── SemEval2010 experiment
│   └── MltPAX-SemEval2010.ipynb
└── .DS_Store

How to run:

We conduct several experiments on four benchmark datasets, namely: Inspec, SemEval2010, NUS and Krapivin2009. The datasets are available at the Dropbox Folder.

To setup the experiments, you need to install the following libraries via pip install -r requirments.txt or install them manually:

Python 3.7
keybert
sentence-transformers 2.2.0
SPARQLWrapper 2.0.0
SciPy 1.8.0
NumPy 1.21.5
Pandas 1.4.2
NLTK 3.6.6 
requests 2.27.1
py-babelnet

We provide our experiements as Jupyter notebooks (see Experiments Folder) and source files (see src Folder). We recommend using Jupyter notebooks for an interactive execution of our experiments. Furhtermore, we provide a Jupyter notebook for each experiments:

Baselines:

We obtain the implementation of baselines: TextRank, YAKE from the open source library PKE. The source-codes for these baselines are available at:

Furthermore, we implemented the EmbedRank using the BERT pretrained model from the spaCycake library. Our implementation can be found at:

EmbedRank

For the baseline AutoGen: We obtain the implemenation from its official GitHub repository

For the baseline CopyRNN, the implemenation can be obtained from its Github repository.

Evaluation

The following notebooks contains the implementation of the evaluation metrics used in our experiments:

Citation

@INPROCEEDINGS
{zahera2022multpax, 
author = "Hamada M. Zahera, Daniel Vollmers, Mohamed Ahmed Sherif and Axel-Cyrille Ngonga Ngomo", 
title = "MultPAX: Keyphrase Extraction using Language Models and Knowledge Graphs",
booktitle = "The 21th International Semantic Web Conference (ISWC) 2022", 
year = "2022", series = "Springer"}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MultPAX: Keyphrase Extraction using Language Models and Knowledge Graphs

Summary:

Our Contributions:

Repository Structure:

How to run:

Baselines:

Evaluation

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Experiments		Experiments
data		data
evaluation		evaluation
src		src
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
requirments.txt		requirments.txt

License

dice-group/MultPAX

Folders and files

Latest commit

History

Repository files navigation

MultPAX: Keyphrase Extraction using Language Models and Knowledge Graphs

Summary:

Our Contributions:

Repository Structure:

How to run:

Baselines:

Evaluation

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages