# Keyword Extraction: from TF-IDF to BERT


https://towardsdatascience.com/keyword-extraction-python-tf-idf-textrank-topicrank-yake-bert-7405d51cd839

The keyword extraction is one of the most required text mining tasks: given a document, the extraction algorithm should identify a set of terms that best describe its argument. In this tutorial, we are going to perform keyword extraction with five different approaches: TF-IDF, TextRank, TopicRank, YAKE!, and KeyBERT


In [21]:
!pip uninstall rasa -y



In [1]:
#Install it just before using it

# keyword extraction using BERT
!pip uninstall -y keyBERT

# !pip install keyBERT

# TopicRank
!pip uninstall -y topicrankpy

# !pip install topicrankpy

Found existing installation: keybert 0.1.3
Uninstalling keybert-0.1.3:
  Successfully uninstalled keybert-0.1.3
Found existing installation: topicrankpy 1.1.0
Uninstalling topicrankpy-1.1.0:
  Successfully uninstalled topicrankpy-1.1.0


In [3]:
# for extracting data from webpages/downloads
!pip install trafilatura

# for text summarization and keyword extraction using Textrank 
!pip install summa

# Yet Another Keyword Extractor (Yake)
!pip install git+https://github.com/LIAAD/yake
    

Collecting git+https://github.com/LIAAD/yake
  Cloning https://github.com/LIAAD/yake to /private/var/folders/vv/sgqty73x7097nx1sgx6l1_bw0000gn/T/pip-req-build-860f7f38
  Running command git clone -q https://github.com/LIAAD/yake /private/var/folders/vv/sgqty73x7097nx1sgx6l1_bw0000gn/T/pip-req-build-860f7f38
Building wheels for collected packages: yake
  Building wheel for yake (setup.py) ... [?25ldone
[?25h  Created wheel for yake: filename=yake-0.4.3-py2.py3-none-any.whl size=66279 sha256=9ced1d327c51238168bbc210ba2b5251aabeee926df6c866e64ac50fbbea7ed3
  Stored in directory: /private/var/folders/vv/sgqty73x7097nx1sgx6l1_bw0000gn/T/pip-ephem-wheel-cache-1holnm1k/wheels/52/79/f4/dae9309f60266aa3767a4381405002b6f2955fbcf038d804da
Successfully built yake


In [22]:
# !pip uninstall -y numpy
# !pip install numpy

# !pip install nltk
import nltk

In [23]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/akshaykumarvaranasi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/akshaykumarvaranasi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/akshaykumarvaranasi/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [24]:
import trafilatura

array_links = [
    "https://en.wikipedia.org/wiki/Coronavirus_disease_2019", 
    "https://en.wikipedia.org/wiki/Recession", 
    "https://en.wikipedia.org/wiki/Vienna", 
    "https://en.wikipedia.org/wiki/Machine_learning", 
    "https://en.wikipedia.org/wiki/Graph_database"
]
array_text = []

for l in array_links:
    html = trafilatura.fetch_url(l)
    text = trafilatura.extract(html)
    text_clean = text.replace("\n", " ").replace("\'", "")
    array_text.append(text_clean[0:5000])

At the end of the execution of the code above, we should have a list of cleaned texts (documents). Of course, you can substitute these texts with anything you want to test!

In [25]:
array_text[0]

'Coronavirus disease 2019 |Coronavirus disease 2019| |Other names| |Pronunciation| |Specialty||Infectious disease| |Symptoms||Fever, cough, fatigue, shortness of breath, loss of taste or smell; sometimes no symptoms at all[5][6]| |Complications||Pneumonia, Viral sepsis, Acute respiratory distress syndrome, Kidney failure, Cytokine release syndrome, Respiratory failure, Kawasaki disease, Pulmonary fibrosis, Paediatric multisystem inflammatory syndrome, Chronic Covid Syndrome| |Usual onset||2–14 days (typically 5) from infection| |Duration||5 days to 6+ months known| |Causes||Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)| |Diagnostic method||rRT-PCR testing, CT scan| |Prevention||Hand washing, face coverings, quarantine, physical/social distancing[7]| |Treatment||Symptomatic and supportive| |Frequency||74,724,989[8] confirmed cases| |Deaths||1,657,706[8]| (COVID-19) - The coronavirus - 2019-nCoV acute respiratory disease - Novel coronavirus pneumonia[1][2] - Severe pneumon

## Using  TF-IDF

Given a document, it is easy to compute the TF score of every word but for the IDF score, we need a huge corpus of similar documents which is not provided most of the time. In this tutorial, we are going to use the IDF scores computed on Wikipedia

In [26]:
from itertools import islice
from tqdm.notebook import tqdm
from re import sub

num_lines = sum(1 for line in open("wiki_tfidf_terms.csv"))

with open("wiki_tfidf_terms.csv") as file:
    dict_idf = {}
    with tqdm(total=num_lines) as pbar:
        for i, line in tqdm(islice(enumerate(file), 1, None)):
            try: 
                cells = line.split(",")
                idf = float(sub("[^0-9.]", "", cells[3]))
                dict_idf[cells[0]] = idf
            except: 
                print("Error on: " + line)
            finally:
                pbar.update(1)

HBox(children=(FloatProgress(value=0.0, max=10394041.0), HTML(value='')))

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Error on: "−7,3,13,23..",1,5989879,15.605581769553712

Error on: "2,4,6,8..",1,5989879,15.605581769553712

Error on: "১,৭৫,৫৭,৩৪৫",1,5989879,15.605581769553712

Error on: ",3f.d.m.,3s.h.p.d.,8s.h.p.d",1,5989879,15.605581769553712





Then, for each article inside our list, we compute the TF score of its words.

In [27]:
from sklearn.feature_extraction.text import CountVectorizer
from numpy import array, log
vectorizer = CountVectorizer()
tf = vectorizer.fit_transform([x.lower() for x in array_text])
tf = tf.toarray()
tf = log(tf + 1)

Now, we are ready to multiply TF with IDF.

In [28]:
tfidf = tf.copy()
words = array(vectorizer.get_feature_names())
for k in tqdm(dict_idf.keys()):
    if k in words:
        tfidf[:, words == k] = tfidf[:, words == k] * dict_idf[k]
    pbar.update(1)

HBox(children=(FloatProgress(value=0.0, max=10099022.0), HTML(value='')))




In [29]:
for j in range(tfidf.shape[0]):
    print("Keywords of article", str(j+1), words[tfidf[j, :].argsort()[-5:][::-1]])

Keywords of article 1 ['covid' 'coronavirus' 'cov' 'symptoms' 'virus']
Keywords of article 2 ['recessions' 'recession' 'nber' 'gdp' 'shaped']
Keywords of article 3 ['vienna' 'km2' 'sq' 'vedunia' 'uuenia']
Keywords of article 4 ['learning' 'machine' 'algorithms' 'tasks' 'unsupervised']
Keywords of article 5 ['graph' 'databases' 'relational' 'database' 'nosql']


## Keywords Extraction with TextRank

TextRank is an unsupervised method to perform keyword and sentence extraction. It is based on a graph where each node is a word and the edges are constructed by observing the co-occurrence of words inside a moving window of predefined size. Important nodes of the graph, computed with an algorithm similar to PageRank, represent keywords in the text.

We are going to use the keywords extractor implemented in summa.

In [11]:
from summa import keywords
for j in range(len(array_text)):
    print("Keywords of article", str(j+1), "\n", (keywords.keywords(array_text[j], words=5)).split("\n"))

Keywords of article 1 
 ['symptoms', 'covid', 'disease', 'syndrome', 'severe']
Keywords of article 2 
 ['recession', 'recessions', 'economics', 'economic', 'shapes', 'shape', 'governments', 'government', 'policies', 'policy']
Keywords of article 3 
 ['vienna', 'viennas', 'city', 'cities', 'citys', 'mi', 'german', 'rank', 'ranked']
Keywords of article 4 
 ['learn', 'learns', 'machine learning', 'algorithms', 'algorithm', 'data', 'machines', 'tasks', 'task']
Keywords of article 5 
 ['graphs', 'graph database', 'databases', 'model', 'models', 'data', 'relates', 'relational', 'relation']


## Keywords Extraction with TopicRank

TopicRank is another unsupervised graph-based keyphrase extractor. Different from TextRank, in this case, the nodes of the graph are topics and each topic is a cluster of similar single and multiword expressions.

Let’s try the Python implementation of this keywords extractor.

In [13]:
!pip install -e git+https://github.com/smirnov-am/pytopicrank.git#egg=pytopicrank

Obtaining pytopicrank from git+https://github.com/smirnov-am/pytopicrank.git#egg=pytopicrank
  Updating ./src/pytopicrank clone
  Running command git fetch -q --tags
  Running command git reset --hard -q 926931b032f339d2f145ce1a4a9cdb50b54c37ba
Collecting decorator==4.2.1
  Using cached decorator-4.2.1-py2.py3-none-any.whl (9.3 kB)
Processing /Users/akshaykumarvaranasi/Library/Caches/pip/wheels/48/b1/c4/94ca0cdd84961331402e857d562d9822ccfcb0567c3d064bf2/networkx-2.1-py2.py3-none-any.whl
Processing /Users/akshaykumarvaranasi/Library/Caches/pip/wheels/60/de/57/6bced01d340818a36413222e6efcc7766d1f1e4575782b6223/nltk-3.2.5-py3-none-any.whl
Collecting numpy==1.14.1
  Using cached numpy-1.14.1.zip (4.9 MB)
Collecting scikit-learn==0.19.1
  Using cached scikit-learn-0.19.1.tar.gz (9.5 MB)
Collecting scipy==1.0.0
  Using cached scipy-1.0.0.tar.gz (15.2 MB)
Collecting six==1.11.0
  Using cached six-1.11.0-py2.py3-none-any.whl (10 kB)
Building wheels for collected packages: numpy, scikit-learn, 

[31m  ERROR: Command errored out with exit status 1:
   command: /Applications/anaconda3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/vv/sgqty73x7097nx1sgx6l1_bw0000gn/T/pip-install-nxl1qb5_/numpy/setup.py'"'"'; __file__='"'"'/private/var/folders/vv/sgqty73x7097nx1sgx6l1_bw0000gn/T/pip-install-nxl1qb5_/numpy/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' clean --all
       cwd: /private/var/folders/vv/sgqty73x7097nx1sgx6l1_bw0000gn/T/pip-install-nxl1qb5_/numpy
  Complete output (10 lines):
  Running from numpy source directory.
  
  `setup.py clean` is not supported, use one of the following instead:
  
    - `git clean -xdf` (cleans all files)
    - `git clean -Xdf` (cleans all versioned files, doesn't touch
                        files that aren't checked into the git repo)
  
  Add `--force` to your command t

  Building wheel for scipy (setup.py) ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /Applications/anaconda3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/vv/sgqty73x7097nx1sgx6l1_bw0000gn/T/pip-install-nxl1qb5_/scipy/setup.py'"'"'; __file__='"'"'/private/var/folders/vv/sgqty73x7097nx1sgx6l1_bw0000gn/T/pip-install-nxl1qb5_/scipy/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /private/var/folders/vv/sgqty73x7097nx1sgx6l1_bw0000gn/T/pip-wheel-2z9sbaj0
       cwd: /private/var/folders/vv/sgqty73x7097nx1sgx6l1_bw0000gn/T/pip-install-nxl1qb5_/scipy/
  Complete output (1255 lines):
    import imp
  lapack_opt_info:
  lapack_mkl_info:
  customize UnixCCompiler
    FOUND:
      libraries = ['mkl_rt', 'pthread']
      library_dirs = ['/Applications/anaconda3/lib']
      defi

[31m  ERROR: Command errored out with exit status 1:
   command: /Applications/anaconda3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/vv/sgqty73x7097nx1sgx6l1_bw0000gn/T/pip-install-nxl1qb5_/scipy/setup.py'"'"'; __file__='"'"'/private/var/folders/vv/sgqty73x7097nx1sgx6l1_bw0000gn/T/pip-install-nxl1qb5_/scipy/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' clean --all
       cwd: /private/var/folders/vv/sgqty73x7097nx1sgx6l1_bw0000gn/T/pip-install-nxl1qb5_/scipy
  Complete output (11 lines):
    import imp
  
  `setup.py clean` is not supported, use one of the following instead:
  
    - `git clean -xdf` (cleans all files)
    - `git clean -Xdf` (cleans all versioned files, doesn't touch
                        files that aren't checked into the git repo)
  
  Add `--force` to your command to use it anyway if you m

[31mERROR: Command errored out with exit status 1: /Applications/anaconda3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/vv/sgqty73x7097nx1sgx6l1_bw0000gn/T/pip-install-nxl1qb5_/numpy/setup.py'"'"'; __file__='"'"'/private/var/folders/vv/sgqty73x7097nx1sgx6l1_bw0000gn/T/pip-install-nxl1qb5_/numpy/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/vv/sgqty73x7097nx1sgx6l1_bw0000gn/T/pip-record-o58x610n/install-record.txt --single-version-externally-managed --compile --install-headers /Applications/anaconda3/include/python3.7m/numpy Check the logs for full command output.[0m


In [14]:
# from pytopicrank import TopicRank
# for j in range(len(array_text)):
#     tr = TopicRank(array_text[j])
#     print("Keywords of article", str(j+1), "\n", tr.get_top_n(n=5, extract_strategy='first'))

ModuleNotFoundError: No module named 'pytopicrank'

## Keywords Extraction with YAKE!

YAKE! is an unsupervised keyword extraction algorithm based on the features extracted from the documents and it is multilingual: English, Italian, German, Dutch, Spanish, Finnish, French, Polish, Turkish, Portuguese, and Arabic.

In [17]:
from yake import KeywordExtractor
kw_extractor = KeywordExtractor(lan="en", n=1, top=5)
for j in range(len(array_text)):
    keywords = kw_extractor.extract_keywords(text=array_text[j])
    keywords = [x for x, y in keywords]
    print("Keywords of article", str(j+1), "\n", keywords)

Keywords of article 1 
 ['coronavirus', 'syndrome', 'respiratory', 'pneumonia', 'symptoms']
Keywords of article 2 
 ['economic', 'recession', 'recessions', 'gdp', 'united']
Keywords of article 3 
 ['vienna', 'city', 'state', 'capital', 'utc']
Keywords of article 4 
 ['learning', 'machine', 'computer', 'tasks', 'computers']
Keywords of article 5 
 ['graph', 'databases', 'database', 'data', 'relationships']


## Keywords Extraction with BERT

BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based model for natural language processing. Pretrained models can transform sentences or words in language representation consisting of an array of numbers (embedding). Sentences or words having similar latent representations (embedding) should have similar semantic meanings. An implementation that uses this approach to extract the keywords of a text is KeyBERT.

In [18]:
!pip install keyBERT

Processing /Users/akshaykumarvaranasi/Library/Caches/pip/wheels/87/3d/ac/084f1894e515b02e6257ca5bf699380b21af42d2852ecc0531/keybert-0.1.3-py3-none-any.whl
Installing collected packages: keyBERT
Successfully installed keyBERT-0.1.3


In [20]:
from keybert import KeyBERT
kw_extractor = KeyBERT('distilbert-base-nli-mean-tokens')
for j in range(len(array_text)):
    keywords = kw_extractor.extract_keywords(array_text[j], stop_words='english') #, keyphrase_length=1
    print("Keywords of article", str(j+1), "\n", keywords)

Keywords of article 1 
 ['coronavirus', 'pneumonia', 'vaccines', 'virus', 'pandemic']
Keywords of article 2 
 ['recessions', 'recession', 'unemployment', 'pandemic', 'worsening']
Keywords of article 3 
 ['austrias', 'vienna', 'austria', 'austrian', 'viennas']
Keywords of article 4 
 ['algorithms', 'algorithm', 'computational', 'computers', 'mathematical']
Keywords of article 5 
 ['graphs', 'graph', 'databases', 'web', 'database']
