In [1]:
%reload_ext autoreload
%autoreload 2

## Keyphrase Extraction in `ktrain`

Keyphrase extraction in **ktrain** leverages the [textblob](https://textblob.readthedocs.io/en/dev/) package, which can be installed with:
```
pip install textblob
python -m textblob.download_corpora
```

In [2]:
from ktrain.text.kw import KeywordExtractor
from ktrain.text.textextractor import TextExtractor

### Download a Paper from ArXiv and Extract Text
For our test document, let's download the ktrain ArXiv paper and use the `TextExtractor` module to extract text.

In [3]:
!wget --user-agent="Mozilla" https://arxiv.org/pdf/2004.10703.pdf -O /tmp/downloaded_paper.pdf -q
text = TextExtractor().extract('/tmp/downloaded_paper.pdf')

In [4]:
print(f"# of words in downloaded paper: {len(text.split())}")

# of words in downloaded paper: 4551


### Using N-Grams as the candidate generator

Let's first use `ngrams` as the candidate generator, which is comparatively fast:

In [6]:
kwe = KeywordExtractor()

In [7]:
%%time
kwe.extract_keywords(text, candidate_generator='ngrams')

CPU times: user 355 ms, sys: 53.5 ms, total: 408 ms
Wall time: 407 ms


[('machine learning', 0.10548523206751055),
 ('step', 0.06751054852320675),
 ('learning rate', 0.046413502109704644),
 ('arxiv preprint', 0.046413502109704644),
 ('text classification', 0.03375527426160337),
 ('augmented machine', 0.02531645569620253),
 ('open-domain question-answering', 0.02531645569620253),
 ('augmented machine learning', 0.02531645569620253),
 ('bert', 0.02109704641350211),
 ('low-code library', 0.02109704641350211)]

### Using Noun Phrases as the candidate generator


If we use `noun_phrases` as the candidate generator instead, quality improves slightly at the expense of a longer running time.

In [9]:
%%time
kwe.extract_keywords(text, candidate_generator='noun_phrases')

CPU times: user 1.03 s, sys: 3.92 ms, total: 1.03 s
Wall time: 1.03 s


[('machine learning', 0.0784313725490196),
 ('text classification', 0.049019607843137254),
 ('image classification', 0.049019607843137254),
 ('exact answers', 0.0392156862745098),
 ('augmented machine learning', 0.0392156862745098),
 ('graph data', 0.029411764705882353),
 ('node classification', 0.029411764705882353),
 ('entity recognition', 0.029411764705882353),
 ('code example', 0.029411764705882353),
 ('index documents', 0.029411764705882353)]

### Other Parameters
The `extract_keywords` method has many other parameters to control the output.  For instance, you can control the number of words in keyphrases with the `ngram_range` parameter. Here, we extract 3-word keyphrases:

In [10]:
kwe.extract_keywords(text, candidate_generator='noun_phrases', ngram_range=(3,3))

[('augmented machine learning', 0.07017543859649122),
 ('a. s. maiya', 0.05263157894736842),
 ('optimal learning rate', 0.03508771929824561),
 ('natural language questions', 0.03508771929824561),
 ('support text data', 0.017543859649122806),
 ('learning rate schedules', 0.017543859649122806),
 ('machine learning model', 0.017543859649122806),
 ('unsupervised topic modeling', 0.017543859649122806),
 ('large text corpus', 0.017543859649122806),
 ('social media accounts', 0.017543859649122806)]

### Combining All the Steps:  Low-Code Keyphrase Extraction

In [11]:
from ktrain.text.kw import KeywordExtractor
from ktrain.text.textextractor import TextExtractor
!wget --user-agent="Mozilla" https://arxiv.org/pdf/2004.10703.pdf -O /tmp/downloaded_paper.pdf -q
text = TextExtractor().extract('/tmp/downloaded_paper.pdf')
kwe = KeywordExtractor()
kwe.extract_keywords(text, candidate_generator='noun_phrases')

[('machine learning', 0.0784313725490196),
 ('text classification', 0.049019607843137254),
 ('image classification', 0.049019607843137254),
 ('exact answers', 0.0392156862745098),
 ('augmented machine learning', 0.0392156862745098),
 ('graph data', 0.029411764705882353),
 ('node classification', 0.029411764705882353),
 ('entity recognition', 0.029411764705882353),
 ('code example', 0.029411764705882353),
 ('index documents', 0.029411764705882353)]

### Non-English Keyphrase Extraction

Keyphrases can be extracted for non-English languages by supplying a 2-character language code as the `lang` argument. For simplified or traditional Chinese, use `zh`.

#### Chinese

In [14]:
text = """
监督学习是学习一个函数的机器学习任务
         根据样本输入-输出对将输入映射到输出。他推导出一个
         函数来自由一组训练示例组成的标记训练数据。
         在监督学习中，每个示例都是由一个输入对象组成的对
         （通常是一个向量）和一个期望的输出值（也称为监控信号）。
         监督学习算法分析训练数据并产生推断函数，
         可用于映射新示例。最佳方案将允许
         算法来正确确定不可见实例的类标签。这需要
         学习算法从训练数据泛化到新情况
         “合理”的方式（见归纳偏差）。
"""
kwe = KeywordExtractor(lang='zh')
kwe.extract_keywords(text)

[('监督 学习', 0.05357142857142857),
 ('训练 数据', 0.05357142857142857),
 ('一个 函数', 0.03571428571428571),
 ('学习 算法', 0.03571428571428571),
 ('学习 一个', 0.017857142857142856),
 ('机器 学习', 0.017857142857142856),
 ('学习 任务', 0.017857142857142856),
 ('任务 根据', 0.017857142857142856),
 ('根据 样本', 0.017857142857142856),
 ('样本 输入', 0.017857142857142856)]

#### French

In [15]:
text = """L'apprentissage supervisé est la tâche d'apprentissage automatique consistant à apprendre une fonction qui
         mappe une entrée à une sortie sur la base d'exemples de paires entrée-sortie. Il en déduit une
         fonction à partir de données d'entraînement étiquetées constituées d'un ensemble d'exemples d'entraînement.
         En apprentissage supervisé, chaque exemple est une paire composée d'un objet d'entrée
         (généralement un vecteur) et une valeur de sortie souhaitée (également appelée signal de supervision).
         Un algorithme d'apprentissage supervisé analyse les données d'apprentissage et produit une fonction inférée,
         qui peut être utilisé pour cartographier de nouveaux exemples. Un scénario optimal permettra
         algorithme pour déterminer correctement les étiquettes de classe pour les instances invisibles. Cela nécessite
         l'algorithme d'apprentissage pour généraliser à partir des données d'entraînement à des situations inédites dans un
         manière « raisonnable » (voir biais inductif)."""

kwe = KeywordExtractor(lang='fr')
kwe.extract_keywords(text)

[("données d'entraînement", 0.0392156862745098),
 ("l'apprentissage supervisé", 0.0196078431372549),
 ("tâche d'apprentissage", 0.0196078431372549),
 ("d'apprentissage automatique", 0.0196078431372549),
 ('automatique consistant', 0.0196078431372549),
 ("base d'exemples", 0.0196078431372549),
 ('paires entrée-sortie', 0.0196078431372549),
 ("d'entraînement étiquetées", 0.0196078431372549),
 ('étiquetées constituées', 0.0196078431372549),
 ("constituées d'un", 0.0196078431372549)]

The following languages are supported:

In [16]:
from ktrain.text.kw.core import SUPPORTED_LANGS
for k,v in SUPPORTED_LANGS.items():
    print(k,v)

en english
ar arabic
az azerbaijani
da danish
nl dutch
fi finnish
fr french
de german
el greek
hu hungarian
id indonesian
it italian
kk kazakh
ne nepali
no norwegian
pt portuguese
ro romanian
ru russian
sl slovene
es spanish
sv swedish
tg tajik
tr turkish
zh chinese


### Scalability
With parallelization, keyphrase extraction can easily scale to a large number of documents

In [36]:
text = """
         Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal). 
         A supervised learning algorithm analyzes the training data and produces an inferred function, 
         which can be used for mapping new examples. An optimal scenario will allow for the 
         algorithm to correctly determine the class labels for unseen instances. This requires 
         the learning algorithm to generalize from the training data to unseen situations in a 
         'reasonable' way (see inductive bias).

"""
docs = [text] * 10000
kwe = KeywordExtractor()

We can process these 10,000 documents using 8 processors in only a few seconds:

In [38]:
%%time
from joblib import Parallel, delayed
results = Parallel(n_jobs=8)(delayed(kwe.extract_keywords)(doc) for doc in docs)

CPU times: user 3.73 s, sys: 129 ms, total: 3.86 s
Wall time: 8.69 s
