The fast-rake
package is an optimized implementation of the RAKE algorithm for unsupervised keyword
extraction. It is specifically built to efficiently process large
collections of text in an uninterrupted fashion. The performance gains derive from
using optimized regular expressions of stopword lists and a few Python-specific optimizations.
The Rapid Automatic Keyword Extraction (RAKE) algorithm is described in "Automatic Keyword Extraction from Individual Documents", Rose, S., et al., (2010)
-
Use of optimized regular expressions for splitting sentences into candidate keywords. Included are optimized stopword lists from gensim, google, nltk, scikit-learn, and SMART.
-
Allows for custom stopword lists to augment the built-in stopword lists.
-
The RAKE implementation is easy to subclass. Two (sub)classes are available showing how difference sentence and word tokenizers can be incorporated:
RakePunkt
uses the punkt sentence tokenizer fromnltk
for improved sentence splitting. All the requirednltk_data
are included and installed.RakeNLTK
subclassesRakePunkt
and adds the word tokenizerTreebankWordTokenizer
fromnltk
.
-
Python-specific optimizations to speed each step of the algorithm.
-
Safe for multiprocessing (see
examples/bbc_mp.py
).
To install as a module
pip install .
If pytest
is installed, tests can be run via:
python -m pytest -v
The following example is from Rose, et al.:
Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types of systems and systems of mixed types.
The implementation uses __call__
:
>>> from fast_rake import Rake
>>>
>>> # default arguments are shown
>>> smart_rake = Rake(stopword_name="smart", custom_stopwords=None, max_kw=None, ngram_range=None, top_percent=1.0, kw_only=False)
>>>
>>> text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types of systems and systems of mixed types."
>>> kw = smart_rake(text)
The resulting list, kw
:
[
("minimal generating sets", 8.666666666666666),
("linear Diophantine equations", 8.5),
("minimal supporting set", 7.666666666666666),
("minimal set", 4.666666666666666),
("linear constraints", 4.5),
("natural numbers", 4.0),
("strict inequations", 4.0),
("nonstrict inequations", 4.0),
("Upper bounds", 4.0),
("mixed types", 3.666666666666667),
("considered types", 3.166666666666667),
("set", 2.0),
("types", 1.6666666666666667),
("considered", 1.5),
("Compatibility", 1.0),
("systems", 1.0),
("Criteria", 1.0),
("compatibility", 1.0),
("system", 1.0),
("components", 1.0),
("solutions", 1.0),
("algorithms", 1.0),
("construction", 1.0),
("criteria", 1.0),
("constructing", 1.0),
("solving", 1.0),
]
The data are 2,225 BBC News articles
from BBC-Dataset-News-Classification
(not included). examples/bbc_news.py
presents a typical use case of
finding keywords for each document in a corpus.
bbc_news.py --input-dir BBC-Dataset-News-Classification/dataset/data_files --algorithm rake-og --stopwords nltk
Rake v1.2.0, stopwords: nltk
num docs: 2,225
time: 1.93149 secs
rate: 1151.96 docs/sec
UserWarnings: 0
fast-rake
is safe for multiprocessing.
The example bbc_mp.py
uses joblib
as the multiprocessing backend (you must
install joblib
to run this example).
bbc_mp.py --dataset bbc --top-dir BBC-Dataset-News-Classification/dataset/data_files --njobs -1 --algorithm rake-og --stopwords nltk
running dataset: bbc
Rake v1.2.0, stopwords: nltk
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 2 tasks | elapsed: 0.8s
[Parallel(n_jobs=-1)]: Done 16 tasks | elapsed: 0.8s
[Parallel(n_jobs=-1)]: Done 43 tasks | elapsed: 0.8s
[Parallel(n_jobs=-1)]: Done 108 tasks | elapsed: 0.8s
[Parallel(n_jobs=-1)]: Done 303 tasks | elapsed: 0.9s
[Parallel(n_jobs=-1)]: Done 1024 tasks | elapsed: 0.9s
[Parallel(n_jobs=-1)]: Done 2084 tasks | elapsed: 0.9s
[Parallel(n_jobs=-1)]: Done 2160 tasks | elapsed: 1.0s
[Parallel(n_jobs=-1)]: Done 2210 out of 2225 | elapsed: 1.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 2225 out of 2225 | elapsed: 1.0s finished
num docs: 2,225
time: 0.96945 secs
rate: 2295.11 docs/sec
UserWarnings: 0
Copyright © 2024, Lion Technologies, LLC.
MIT License
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.