<a href="https://colab.research.google.com/github/bhadreshpsavani/NLP-based-Article-Analysis/blob/main/Keyword_Extraction_and_Ranking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Keyword Extraction and Ranking:
We have a Query and an Article. **We want to find keywords/Tags from Article Which is best suitable for given query.**

## Approach:
I see this problem as two task:
1. Keyword Extraction
2. Similarity check

## 1. Keyword Extraction:
**We have to find Keywords from given Article**

There are Unsupervised Machine Learning state of the art approches like TF.IDF, KP-Miner, RAKE, TextRank, SingleRank, ExpandRank, TopicRank, TopicalPageRank, PositionRank, MultipartiteRank etc which relies on statistics of Text and one supervised method (KEA). 

Experimental results carried out on top of twenty datasets (see Benchmark section below) show that [YAKE](https://github.com/LIAAD/yake) methods significantly outperform state-of-the-art methods under a number of collections of different sizes, languages or domains.

### Features:
* Unsupervised approach
* Corpus-Independent
* Domain and Language Independent
* Single-Document

In [1]:
# install packages
!pip install -q  git+https://github.com/LIAAD/yake
!pip install -q transformers

[?25l[K     |███▌                            | 10kB 30.5MB/s eta 0:00:01[K     |███████                         | 20kB 36.3MB/s eta 0:00:01[K     |██████████▌                     | 30kB 40.3MB/s eta 0:00:01[K     |██████████████                  | 40kB 41.6MB/s eta 0:00:01[K     |█████████████████▌              | 51kB 42.7MB/s eta 0:00:01[K     |█████████████████████           | 61kB 44.0MB/s eta 0:00:01[K     |████████████████████████▌       | 71kB 43.1MB/s eta 0:00:01[K     |████████████████████████████    | 81kB 39.4MB/s eta 0:00:01[K     |███████████████████████████████▌| 92kB 39.6MB/s eta 0:00:01[K     |████████████████████████████████| 102kB 13.6MB/s 
[?25h  Building wheel for yake (setup.py) ... [?25l[?25hdone
  Building wheel for segtok (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 1.8MB 15.7MB/s 
[K     |████████████████████████████████| 890kB 50.3MB/s 
[K     |████████████████████████████████| 2.9MB 53.7MB/s 
[?25h  Buil

In [2]:
import yake
from transformers import pipeline

In [3]:
## Example
text = "Sources tell us that Google is acquiring Kaggle, a platform that hosts data science and machine learning "\
"competitions. Details about the transaction remain somewhat vague, but given that Google is hosting its Cloud "\
"Next conference in San Francisco this week, the official announcement could come as early as tomorrow. "\
"Reached by phone, Kaggle co-founder CEO Anthony Goldbloom declined to deny that the acquisition is happening. "\
"Google itself declined 'to comment on rumors'. Kaggle, which has about half a million data scientists on its platform, "\
"was founded by Goldbloom  and Ben Hamner in 2010. "\
"The service got an early start and even though it has a few competitors like DrivenData, TopCoder and HackerRank, "\
"it has managed to stay well ahead of them by focusing on its specific niche. "\
"The service is basically the de facto home for running data science and machine learning competitions. "\
"With Kaggle, Google is buying one of the largest and most active communities for data scientists - and with that, "\
"it will get increased mindshare in this community, too (though it already has plenty of that thanks to Tensorflow "\
"and other projects). Kaggle has a bit of a history with Google, too, but that's pretty recent. Earlier this month, "\
"Google and Kaggle teamed up to host a $100,000 machine learning competition around classifying YouTube videos. "\
"That competition had some deep integrations with the Google Cloud Platform, too. Our understanding is that Google "\
"will keep the service running - likely under its current name. While the acquisition is probably more about "\
"Kaggle's community than technology, Kaggle did build some interesting tools for hosting its competition "\
"and 'kernels', too. On Kaggle, kernels are basically the source code for analyzing data sets and developers can "\
"share this code on the platform (the company previously called them 'scripts'). "\
"Like similar competition-centric sites, Kaggle also runs a job board, too. It's unclear what Google will do with "\
"that part of the service. According to Crunchbase, Kaggle raised $12.5 million (though PitchBook says it's $12.75) "\
"since its   launch in 2010. Investors in Kaggle include Index Ventures, SV Angel, Max Levchin, Naval Ravikant, "\
"Google chief economist Hal Varian, Khosla Ventures and Yuri Milner "

In [4]:
def keyword_extractor(language="en", max_ngram_size=3, deduplication_thresold=0.9, deduplication_algo='seqm', windowSize=1, numOfKeywords=20, features=None, stopwords=None):
  """
  This function will take following parameters as input
    language, 
    max_ngram_size, 
    deduplication_thresold, 
    deduplication_algo, 
    windowSize, 
    numOfKeywords, 
    features, 
    stopwords
  according to KeywordExtractor() from https://github.com/LIAAD/yake/blob/master/yake/yake.py
  and 
  return list of tuple of keyword with its candidate value/confidence score 
  """
  custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, 
                                              dedupLim=deduplication_thresold, 
                                              dedupFunc=deduplication_algo, 
                                              windowsSize=windowSize, 
                                              top=numOfKeywords, 
                                              features=features,
                                              stopwords=stopwords)

  keywords = custom_kw_extractor.extract_keywords(text)
  keywords.sort(key=lambda items: -items[1]) # sort it in decreasing order of candidate value
  return keywords

In [5]:
%%time
language = "en"
max_ngram_size = 3
deduplication_thresold = 0.9
deduplication_algo = 'seqm'
windowSize = 1
numOfKeywords = 20

keywords = keyword_extractor(language, 
                            max_ngram_size, 
                            deduplication_thresold, 
                            deduplication_algo, 
                            windowSize, 
                            numOfKeywords)

for kw in keywords:
    print(kw)

keywords_list = [keyword[0] for keyword in keywords]

('goldbloom', 0.14611408778815776)
('service', 0.12546743261462942)
('conference in san', 0.12392066376108138)
('platform', 0.1183512305596321)
('francisco this week', 0.11519915079240485)
('machine learning competitions', 0.10773000650607861)
('google cloud', 0.10260128641464673)
('data', 0.097574333771058)
('kaggle co-founder ceo', 0.093805063905847)
('machine learning', 0.09147989238151344)
('anthony goldbloom', 0.09123482372372106)
('ceo anthony', 0.08915156857226395)
('acquiring kaggle', 0.08723571551039863)
('co-founder ceo anthony', 0.07357749587020043)
('google cloud platform', 0.06261974476422487)
('anthony goldbloom declined', 0.06176910090701819)
('san francisco', 0.048810837074825336)
('ceo anthony goldbloom', 0.029946071606210194)
('kaggle', 0.0289005976239829)
('google', 0.026580863364597897)
CPU times: user 108 ms, sys: 1.85 ms, total: 110 ms
Wall time: 111 ms


## 2. Getting Suitable Keywords

**We have list of keywords and a query, we need to find top matching keywords with given query**

We will use Huggingface Zeroshot learning pipeline which uses model trained on NLI task (Natural Language Infernece). It is classification task having three labels:
1. Contradiction
2. Neutral
3. Entailment

Basically the model is trained to compare two sentence. We can use this model to check the simiality of keywords/labels/tags with query according to Zero-Shot-Classification problem

### Advantage of ZeroShotLearning Pipeline:
1. Domain Independent

In [8]:
# the pipeline uses bart-large-mnli by default, we use our own custom 
# trained model on NLI task
# classifier = pipeline("zero-shot-classification") # cpu
classifier = pipeline("zero-shot-classification", device=0) # to utilize GPU

Some weights of the model checkpoint at facebook/bart-large-mnli were not used when initializing BartModel: ['model.encoder.version', 'model.decoder.version']
- This IS expected if you are initializing BartModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BartModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at facebook/bart-large-mnli were not used when initializing BartForSequenceClassification: ['model.encoder.version', 'model.decoder.version']
- This IS expected if you are initializing BartForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification m

In [9]:
%%time
query = "what is the intention of Google?"
result = classifier(query, keywords_list, multi_class=True)

CPU times: user 117 ms, sys: 51.6 ms, total: 169 ms
Wall time: 634 ms


In [10]:
results_dict = ({score:label for label, score in zip(result['labels'], result['scores'])})
dict(sorted(results_dict.items(), key=lambda item: item[1]))

{0.0013416995061561465: 'ceo anthony goldbloom',
 0.0015500524314120412: 'anthony goldbloom declined',
 0.0018288666615262628: 'anthony goldbloom',
 0.002348936628550291: 'goldbloom',
 0.003008637111634016: 'francisco this week',
 0.0035422123037278652: 'ceo anthony',
 0.00473477877676487: 'kaggle',
 0.005838700570166111: 'acquiring kaggle',
 0.005853407084941864: 'kaggle co-founder ceo',
 0.008472663350403309: 'san francisco',
 0.012140297330915928: 'conference in san',
 0.0408928208053112: 'machine learning competitions',
 0.1127646192908287: 'machine learning',
 0.14107631146907806: 'data',
 0.16517511010169983: 'co-founder ceo anthony',
 0.36291754245758057: 'google cloud',
 0.38278794288635254: 'google cloud platform',
 0.42713454365730286: 'service',
 0.6460312008857727: 'platform',
 0.9148185849189758: 'google'}

Inference Time on CPU:
```
CPU times: user 3.49 s, sys: 48.4 ms, total: 3.54 s
Wall time: 3.55 s
```
Inference Time on GPU:
```
CPU times: user 117 ms, sys: 51.6 ms, total: 169 ms
Wall time: 634 ms
```

### References:
* https://joeddav.github.io/blog/2020/05/29/ZSL.html
* https://colab.research.google.com/drive/1jocViLorbwWIkTXKwxCOV9HLTaDDgCaw?usp=sharing