# Keyphrase Extraction with `pke`

[`pke`](https://github.com/boudinfl/pke) (Python Keyphrase Extraction) by Boudin et al. is probably the most comprehensive collection of algorithms related to keyphrase extraction.
It implements both a variety of supervised and unsupervised models, although they are primarily focused around traditional approaches, and not neural networks.  
However, it offers a unified interface to the various approaches, which is a huge plus over related libraries.

Advantages:
* Easy-to-use API
* Several different models available
* Somewhat regular updates

Disadvantages:
* No neural models implemented (yet)
* GPL license (restrictive for commercial use ;-) )

Available on Github: https://github.com/boudinfl/pke

## Existing Datasets and Other Libraries

(Campos et al.) have a neat collection of KPE datasets, mostly focusing on scientific articles (but in a variety of languages!  
https://github.com/LIAAD/KeywordExtractor-Datasets

The KP20k dataset from (Meng et al., 2017) can also be found online, including the entire 570,000 documents used for training, validation *and* testing. See: https://github.com/memray/OpenNMT-kpg-release  
Note that this is associated with a follow-up paper not covered in the lecture. The original repository for the "Deep Keyphrase Generation" paper is [here](https://github.com/memray/seq2seq-keyphrase). Although I cannot really recommend the model for anything beyond the actual architecture, because it is unfortunately implemented in theano.

For SotA KPE prediction, the team behind Span-BERT  (Sun et al., 2020) has [released their code](https://github.com/thunlp/BERT-KPE) as well. However, I have not checked how easy it is to use (or set up with custom data samples).




## Careful with Other Implementations
I generally advise you to be careful about choosing other libraries. Especially for popular methods (such as RAKE or TextRank), there are quite a few different implementations. At least for some, there are implementation errors in the respective code repositories, which leads to worse/wrong results! If you are looking for a correct implementation for RAKE, the implementation by Alyona Medelyan (https://github.com/zelandiya/RAKE-tutorial) is probably your best shot.

`pke` at least has somewhat decent testing, which should provide enough resilience against completely faulty implementations.

In [None]:
!pip install git+https://github.com/boudinfl/pke.git

Collecting git+https://github.com/boudinfl/pke.git
  Cloning https://github.com/boudinfl/pke.git to /tmp/pip-req-build-99lyrwdw
  Running command git clone -q https://github.com/boudinfl/pke.git /tmp/pip-req-build-99lyrwdw
Collecting unidecode
[?25l  Downloading https://files.pythonhosted.org/packages/74/65/91eab655041e9e92f948cb7302e54962035762ce7b518272ed9d6b269e93/Unidecode-1.1.2-py2.py3-none-any.whl (239kB)
[K     |████████████████████████████████| 245kB 5.5MB/s 
Building wheels for collected packages: pke
  Building wheel for pke (setup.py) ... [?25l[?25hdone
  Created wheel for pke: filename=pke-1.8.1-cp36-none-any.whl size=8763600 sha256=a9e9e1738d0343a3de835a01e8dbdc2c9a6a4701aed4a14f2ae47d4b6301958c
  Stored in directory: /tmp/pip-ephem-wheel-cache-sp03hdk4/wheels/8d/24/54/6582e854e9e32dd6c632af6762b3a5d2f6b181c2992e165462
Successfully built pke
Installing collected packages: unidecode, pke
Successfully installed pke-1.8.1 unidecode-1.1.2


In [None]:
!python -m nltk.downloader stopwords
!python -m nltk.downloader universal_tagset
!python -m spacy download en # download the english model

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


## Input Formats
For reference, see here: https://boudinfl.github.io/pke/build/html/tutorials/input.html

Mostly, you will want to load directly from a file or text that you already have in memory:

In [None]:
import pke
from urllib.request import urlopen

# Sherlock Holmes book
data = urlopen("https://sherlock-holm.es/stories/plain-text/advs.txt")

sample_text = ""
for line in data:
  sample_text += str(line)

print(sample_text[0:500])


b'\n'b'\n'b'\n'b'\n'b'                        THE ADVENTURES OF SHERLOCK HOLMES\n'b'\n'b'                               Arthur Conan Doyle\n'b'\n'b'\n'b'\n'b'                                Table of contents\n'b'\n'b'               A Scandal in Bohemia\n'b'               The Red-Headed League\n'b'               A Case of Identity\n'b'               The Boscombe Valley Mystery\n'b'               The Five Orange Pips\n'b'               The Man with the Twisted Lip\n'b'               The Adventure 


In [None]:
extractor = pke.unsupervised.YAKE

extractor.load_document(input=sample_text, language="en")

In [None]:
len(extractor.sentences)

5192

In [None]:
# Alternatively, load from a file.
with open("./sample_text.txt", "w") as f:
  f.write(sample_text)

extractor2 = pke.unsupervised.YAKE()
extractor2.load_document(input="./sample_text.txt", language="en")


In [None]:
len(extractor2.sentences)

5192

## Hints about Non-English Input Formats
`pke` works with inputs both in ASCII and unicode. This means you can theoretically load documents in other languages, however, this would require you to provide custom stopword lists and other language modules. Since `pke` uses spaCy as an underlying module, the language is working as long as you have the correct spaCy language model installed.

## Selecting Algorithm Parameters


In [None]:
# See https://boudinfl.github.io/pke/build/html/unsupervised.html#yake for description of YAKE's parameters

# Select criteria for candidate phrase generation. Also allows for custom stopword lists.
# n i n this case is referring to the maximum n-gram length of the candidate phrases
extractor.candidate_selection(n=3)

# Choose algorithm parameters
extractor.candidate_weighting(window=2, use_stems=True)

## Select Keyphrases

In [None]:
# Select how many keywords should be returned. Can be thresholded to only return "important enough" keyphrases
keyphrases = extractor.get_n_best(15, threshold=0.5)

print(keyphrases)

[('sherlock holm', 4.53937378536963e-05), ('said holm', 6.29030995107384e-05), ('holm', 8.156935573010463e-05), ('upon the tabl', 0.00021329219739364078), ('holm wa well', 0.0003197319806372044), ('said', 0.0003740219201981197), ('holm in baker', 0.0003844077980404051), ('upon', 0.0004117301383089954), ('littl of holm', 0.0004964470017736267), ('one', 0.0005117275249004911), ('well', 0.0006783442050531117), ('would', 0.000719145196883692), ('word to holm', 0.0008822938565316763), ('man', 0.0009060266248705548), ('holm the relentless', 0.0009793297745161844)]


## Train a Supervised KEA Model
So far, we have looked at a simple statistical model that does not rely on any further information besides the extraction parameters. However, for supervised models, we can also provide our self-trained models, or use the provided pre-trained models.

For further information on training, see https://boudinfl.github.io/pke/build/html/tutorials/training.html

In [None]:
# Compute document frequency statistics
# Training files can be found here: https://sherlock-holm.es/ascii/
pke.utils.compute_document_frequency(input_dir="./training/", 
                                     output_file="./df_counts.gz", 
                                     extension='txt', 
                                     language='en', 
                                     normalization='stemming', 
                                     stoplist=None, 
                                     delimiter='\t', 
                                     n=3)

In [None]:
# load the DF counts from file
df_counts = pke.load_document_frequency_file(input_file='./df_counts.gz')

In [None]:
# train a new Kea model
pke.train_supervised_model(input_dir='/path/to/collection/of/documents/',
                           reference_file='/path/to/reference/file',
                           model_file='./trained_KEA.pke',
                           df=df_counts,
                           extension='txt',
                           language='en',
                           normalization="stemming",
                           model=pke.supervised.Kea())