`pke` - python keyphrase extraction

This is a fork of the original pke library, which can be found here. The build and unit testing steps of the original library have been fixed.

pke is an open source python-based keyphrase extraction toolkit. It provides an end-to-end keyphrase extraction pipeline in which each component can be easily modified or extended to develop new models. pke also allows for easy benchmarking of state-of-the-art keyphrase extraction models, and ships with supervised models trained on the SemEval-2010 dataset.

Installation

Clone the repo and create a conda environment with a Python install >= 3.6 and, after cding to the project directory containing the requirements.txt file, activate the environment, and install all dependencies via pip

conda create -n keyphrase_extraction python==3.7
conda activate keyphrase_extraction
pip install -r requirements.txt

Then run

python -m nltk.downloader stopwords
python -m nltk.downloader universal_tagset

Install the pytest module separately via

pip install -U pytest

Usually, one downloads spacy models via python -m spacy download etc. But we'll be manually installing the models. First download en_core_web_sm-2.3.1.tar.gz from here. Untar the folder wherever you'd like to store your spacy models:

tar -xvf en_core_web_sm-2.3.1.tar.gz

Next, wtih the conda environment created above activated, run the following command:

 python -m spacy link directory_with_your_spacy_model_folder/en_core_web_sm-2.3.1/en_core_web_sm en
 --force

Next, make sure everything works by running the unit tests via

pytest tests

Minimal example

pke provides a standardized API for extracting keyphrases from a document. Start by typing the 5 lines below. For using another model, simply replace pke.unsupervised.TopicRank with another model (list of implemented models).

import pke

# initialize keyphrase extraction model, here TopicRank
extractor = pke.unsupervised.TopicRank()

# load the content of the document, here document is expected to be in raw
# format (i.e. a simple text file) and preprocessing is carried out using spacy
extractor.load_document(input='/path/to/input.txt', language='en')

# keyphrase candidate selection, in the case of TopicRank: sequences of nouns
# and adjectives (i.e. `(Noun|Adj)*`)
extractor.candidate_selection()

# candidate weighting, in the case of TopicRank: using a random walk algorithm
extractor.candidate_weighting()

# N-best selection, keyphrases contains the 10 highest scored candidates as
# (keyphrase, score) tuples
keyphrases = extractor.get_n_best(n=10)

A detailed example is provided in the examples/ directory.

Getting started

Tutorials and code documentation are available at https://boudinfl.github.io/pke/.

Implemented models

pke currently implements the following keyphrase extraction models:

Unsupervised models
- Statistical models
  - TfIdf [documentation]
  - KPMiner [documentation, article by (El-Beltagy and Rafea, 2010)]
  - YAKE [documentation, article by (Campos et al., 2020)]
- Graph-based models
  - TextRank [documentation, article by (Mihalcea and Tarau, 2004)]
  - SingleRank [documentation, article by (Wan and Xiao, 2008)]
  - TopicRank [documentation, article by (Bougouin et al., 2013)]
  - TopicalPageRank [documentation, article by (Sterckx et al., 2015)]
  - PositionRank [documentation, article by (Florescu and Caragea, 2017)]
  - MultipartiteRank [documentation, article by (Boudin, 2018)]
Supervised models
- Feature-based models
  - Kea [documentation, article by (Witten et al., 2005)]
  - WINGNUS [documentation, article by (Nguyen and Luong, 2010)]

Citing pke

If you use pke, please cite the following paper:

@InProceedings{boudin:2016:COLINGDEMO,
  author    = {Boudin, Florian},
  title     = {pke: an open source python-based keyphrase extraction toolkit},
  booktitle = {Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations},
  month     = {December},
  year      = {2016},
  address   = {Osaka, Japan},
  pages     = {69--73},
  url       = {http://aclweb.org/anthology/C16-2015}
}

Name		Name	Last commit message	Last commit date
Latest commit History 311 Commits
docs		docs
examples		examples
pke		pke
tests		tests
.gitignore		.gitignore
.nojekyll		.nojekyll
.travis.yml		.travis.yml
LICENCE.md		LICENCE.md
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

examples

examples

pke

pke

tests

tests

.gitignore

.gitignore

.nojekyll

.nojekyll

.travis.yml

.travis.yml

LICENCE.md

LICENCE.md

MANIFEST.in

MANIFEST.in

README.md

README.md

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

`pke` - python keyphrase extraction

Table of Contents

Installation

Minimal example

Getting started

Implemented models

Citing pke

About

Releases

Packages

Languages

License

adam-faulkner/pke

Folders and files

Latest commit

History

Repository files navigation

pke - python keyphrase extraction

Table of Contents

Installation

Minimal example

Getting started

Implemented models

Citing pke

About

Resources

License

Stars

Watchers

Forks

Languages

`pke` - python keyphrase extraction