Skip to content

A Multitask Framework for Present and Absent Keyphrase Generation using Knowledge Graphs

License

Notifications You must be signed in to change notification settings

dice-group/MultPAX

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MultPAX: Keyphrase Extraction using Language Models and Knowledge Graphs

This repositoy contains the source code of our paper: "MultPAX: Keyphrase Extraction using Language models and Knowledge Graphs". The paper has been accepted at the ISWC 2022 conference.

Fig. 1 the architecure of MultPAX framework

Summary:

  • Keyphrase extraction is the process of extracting a small set of phrases that best describe an input corpus.
  • The automatic generation of keyphrases has become essential for many natural language applications such as text categorization, indexing, and summarization.
  • In this paper, we propose MultPAX, a multitask framework for extracting present and absent keyphrases using pretrained language models and knowledge graphs. In particular, our framework contains three components:
    1. MultPAX identifies present keyphrases from the input corpus.
    2. MultPAX then links the input corpus with external knowledge graphs to get more relevant phrases.
    3. MultPAX ranks the extracted phrases based on their semantic relatedness to input corpus.

Our Contributions:

1) We propose an *unsupervised* multitask framework that not only extracts present keyphrases, but also generate absent ones.
    
2) To the best of our knowledge, our approach is the first attempt that leverages existing knowledge graphs for keyphrase extraction without the need to create keyphrase vocabularies or phrase banks.
    
3) We introduce an embedding-based F1 score that considers semantic similarity between generated and ground-truth keyphrases rather than the existing exact-matching. 
    
4) We carried out several experiments on four benchmark datasets. The evaluation results showed that our approach proved to be more accurate compared with state-of-the-art baselines.  

Repository Structure:

.
├── Baselines
│   ├── EmbedRank-Baseline.ipynb
│   ├── EmbedRank(Wordwise)- Baseline.ipynb
│   ├── TextRank-Baseline.ipynb
│   └── YAKE-Baseline.ipynb
├── Inspec experiment
│   └── MltPAX-Inspec.ipynb
├── Krapivin2009 experiment
│   └── MltPAX-Krapivin2009.ipynb
├── NUS experiment
│   └── MltPAX-NUS.ipynb
├── SemEval2010 experiment
│   └── MltPAX-SemEval2010.ipynb
└── .DS_Store

How to run:

We conduct several experiments on four benchmark datasets, namely: Inspec, SemEval2010, NUS and Krapivin2009. The datasets are available at the Dropbox Folder.

To setup the experiments, you need to install the following libraries via pip install -r requirments.txt or install them manually:

Python 3.7
keybert
sentence-transformers 2.2.0
SPARQLWrapper 2.0.0
SciPy 1.8.0
NumPy 1.21.5
Pandas 1.4.2
NLTK 3.6.6 
requests 2.27.1
py-babelnet

We provide our experiements as Jupyter notebooks (see Experiments Folder) and source files (see src Folder). We recommend using Jupyter notebooks for an interactive execution of our experiments. Furhtermore, we provide a Jupyter notebook for each experiments:

Baselines:

We obtain the implementation of baselines: TextRank, YAKE from the open source library PKE. The source-codes for these baselines are available at:

Furthermore, we implemented the EmbedRank using the BERT pretrained model from the spaCycake library. Our implementation can be found at:

For the baseline AutoGen: We obtain the implemenation from its official GitHub repository

For the baseline CopyRNN, the implemenation can be obtained from its Github repository.

Evaluation

The following notebooks contains the implementation of the evaluation metrics used in our experiments:


Citation

@INPROCEEDINGS
{zahera2022multpax, 
author = "Hamada M. Zahera, Daniel Vollmers, Mohamed Ahmed Sherif and Axel-Cyrille Ngonga Ngomo", 
title = "MultPAX: Keyphrase Extraction using Language Models and Knowledge Graphs",
booktitle = "The 21th International Semantic Web Conference (ISWC) 2022", 
year = "2022", series = "Springer"}

About

A Multitask Framework for Present and Absent Keyphrase Generation using Knowledge Graphs

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published