#### Introduction
In this kernal I will focus on keyphrase extraction from NLP research papers using graph-based algorithms implemented in pke package. Later will also do some experiments using these keyphrases.

In [None]:
#Installing pke

!pip install git+https://github.com/boudinfl/pke.git

In [None]:
import numpy as np 
import pandas as pd 
import os
import seaborn as sns
import nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.cluster import KMeans
from sklearn.cluster import AffinityPropagation
import re
import pke

In [None]:
papers = pd.read_csv('../input/201812_CL_Github.csv')

In [None]:
papers.head()

In [None]:
papers.shape
#Total 106 papers given

#### Keyphrase Extraction

In [24]:
#Keyphrase extraction(top 10) from abstracts using textrank algorithm

def extract_keyphrases(caption, n):
    extractor = pke.unsupervised.TextRank() 
    extractor.load_document(caption)
    extractor.candidate_selection()
    extractor.candidate_weighting()
    keyphrases = extractor.get_n_best(n=n, stemming=False)
    print(keyphrases,"\n")
    return(keyphrases)
    
papers['Abstract_Keyphrases'] = papers.apply(lambda row: (extract_keyphrases(row['Abstract'],10)),axis=1)

[('discourse dependency structures', 0.12811645856795542), ('multi - party dialogues', 0.10025102656641602), ('various nlp tasks such', 0.10025068656641602), ('dependency relations', 0.08458330261772845), ('deep sequential model', 0.08274360188426316), ('discourse structures', 0.08199330792172), ('elementary discourse units', 0.0742503515007769), ('edu sequence', 0.06523762720211033), ('concerned edus', 0.06523751720211032), ('previous edu', 0.06523695720211033)] 

[('useful speaker representations', 0.06972180037381355), ('learning good representations', 0.06086258764306304), ('different objective functions', 0.05430091253393665), ('- supervised settings', 0.054300812533936656), ('raw speech waveform', 0.05430000253393665), ('generative adversarial networks', 0.054299482533936655), ('high dimensional spaces', 0.05429918253393665), ('speaker identification', 0.05007968313152715), ('speaker identities', 0.05007866313152715), ('architecture similar', 0.04699610630452334)] 

[('current fa

[('directional recurrent networks such', 0.17094058196266732), ('single backward pass', 0.14095169201206734), ('art methods', 0.11096154206146733), ('new state', 0.09566736303231382), ('current state', 0.09566655303231383), ('art results', 0.08547125598133366), ('single feed', 0.08547105598133366), ('entire sequence', 0.08547097598133367), ('video summarization', 0.08547092598133367), ('attention mechanism', 0.08547037598133367)] 

[('sophisticated natural language processing', 0.08634600528969595), ('several language information sources', 0.07929096290687919), ('user activity logs', 0.06537111610187978), ('time capable named entity', 0.06334288098961187), ('knowledge work support systems', 0.06334152098961186), ('explicit user input', 0.05606995042305281), ('german language', 0.05428396971510245), ('high speed methods', 0.052231381983551366), ('information extraction task', 0.0512422307006888), ('much runtime performance', 0.0475087382422089)] 

[('mean reciprocal rank', 0.12588833010

[('pre - trained word embeddings', 0.1143873901809849), ('softmax layer', 0.06417792048361605), ('novel probabilistic loss', 0.06389857357827479), ('large memory footprint', 0.06389830357827478), ('reference translations', 0.055407344854719376), ('translation quality', 0.05540685485471938), ('machine translation', 0.05540649485471938), ('slowest layer', 0.051972096707184244), ('final layer', 0.051971856707184245), ('training time', 0.0494610521389667)] 

[('neural machine translation tasks', 0.11722045008676071), ('non - smooth prediction', 0.11005002857973448), ('multi - class classification', 0.11004995857973447), ('english translation task', 0.10212226878506919), ('text summarization task show', 0.09472568991947442), ('non - smoothness', 0.09467991454713402), ('wise regularization method', 0.07919062041404971), ('promising bleu scores', 0.07203063152460983), ('conventional mle loss', 0.07203053152460982), ('target token', 0.062341035461953014)] 

[('concordance correlation coefficie

[('conditional bert contextual augmentation', 0.18745129641720287), ('conditional masked language model', 0.17670879612467263), ('deep bidirectional language model', 0.14909651505476265), ('various different text classification tasks', 0.14662942597845632), ('masked language model', 0.12109322931212022), ('contextual augmentation augments', 0.12086360481673163), ('novel data augmentation method', 0.11444277054281826), ('conditional bert', 0.11236815396973727), ('unidirectional language model', 0.10925064113662006), ('data augmentation methods', 0.09596676904458763)] 

[('deep latent variable', 0.13660748368590403), ('latent variable models', 0.12025122338342067), ('posterior inference intractable', 0.1035787632701033), ('latent variable objectives', 0.10329498934632711), ('non - differentiability', 0.09389781361502346), ('powerful function approximators', 0.09389725361502345), ('variational inference', 0.08196130838684205), ('deep parameterizations', 0.07171002612961813), ('deep learni

[('associated training data consists', 0.11811418295102571), ('caption data', 0.08334931745847626), ('training data', 0.08334733745847626), ('truth object detections', 0.07825179396185536), ('alternative data sources', 0.07600846554664911), ('object classes', 0.07560246552349273), ('open images image', 0.0749615876380453), ('limited visual concepts', 0.06818193818177588), ('many more classes', 0.06690351639276874), ('novel object', 0.06431094522566906)] 

[('robust structural representations', 0.15704979097131203), ('rnn autoencoder representations', 0.1377812346117322), ('tensor product representations', 0.13506021773408644), ('continuous vector representations', 0.12629768842240993), ('sentence representation learning', 0.11291875126796211), ('sequence representations', 0.11210879329969066), ('vector representations', 0.11210762329969066), ('sensitive representations', 0.10010730750443916), ('tensor product decomposition networks', 0.09746600884591206), ('interpretable compositional 

[('global anchor method recovers', 0.12222159627487433), ('global anchor method', 0.10379303645975081), ('disparate text corpora', 0.0843901956539987), ('graph laplacian technique', 0.08439002565399871), ('level language shifts', 0.0843888556539987), ('distributional tools such', 0.0843886356539987), ('active research area', 0.0843884356539987), ('alignment method', 0.0688399949647354), ('domain clustering', 0.06280454524494689), ('domain adaptation', 0.06280432524494689)] 

[('step random walks', 0.11811124621031645), ('most current word', 0.11811059621031646), ('natural language processing', 0.11811040621031645), ('various queries', 0.07874157747354428), ('experiment results', 0.07874136747354428), ('input neighborhoods', 0.07874105747354428), ('minimal discrepancy', 0.07874101747354428), ('novel neighbor', 0.07874077747354429), ('many text', 0.07874039747354428), ('important starting', 0.07874027747354428)] 

[('language speech emotion database', 0.20925917398703983), ('cross - ling

[('peters et al', 0.14679619063095162), ('pre - training', 0.13333352333333215), ('devlin et', 0.11581431348412706), ('art results', 0.08888986888888806), ('new state', 0.08888979888888807), ('model capacity', 0.08888972888888806), ('large part', 0.08888964888888806), ('outperforms elmo', 0.08888961888888806), ('additional languages', 0.08888912888888806), ('previous work', 0.08888892888888807)] 



In [26]:
#titles & keyphrases

papers.loc[:,['Title','Abstract_Keyphrases']]

Unnamed: 0,Title,Abstract_Keyphrases
0,A Deep Sequential Model for Discourse Parsing ...,"[(discourse dependency structures, 0.128116458..."
1,Learning Speaker Representations with Mutual I...,"[(useful speaker representations, 0.0697218003..."
2,"Fake News: A Survey of Research, Detection Met...","[(current fake news research, 0.15535480924421..."
3,A Study on Dialogue Reward Prediction for Open...,"[(training dialogue reward predictors, 0.17872..."
4,Improved and Robust Controversy Detection in G...,"[(cross - domain performance, 0.09234450374810..."
5,Learning Representations of Social Media Users,"[(distant user information, 0.0709560415423658..."
6,Clinical Document Classification Using Labeled...,"[(supervised learning pipeline, 0.094264009320..."
7,Building Sequential Inference Models for End-t...,"[(general pre - trained word embeddings, 0.156..."
8,Toward Scalable Neural Dialogue State Tracking...,[(accurate neural dialogue state tracking mode...
9,A Survey on Semantic Parsing,"[(semi - structured knowledge bases, 0.1503760..."


> #### Titles Clustering

In [27]:
titles = papers['Title']

In [28]:
titles[1]

'Learning Speaker Representations with Mutual Information'

In [29]:
count_vectorizer = CountVectorizer()
counts = count_vectorizer.fit_transform(titles)
tfidf_vectorizer = TfidfTransformer().fit(counts)
tfidf_titles = tfidf_vectorizer.transform(counts)


In [30]:
tfidf_titles

<106x479 sparse matrix of type '<class 'numpy.float64'>'
	with 967 stored elements in Compressed Sparse Row format>

In [31]:
#Affinity Propogation
X = tfidf_titles
clustering = AffinityPropagation().fit(X)
clustering 

content_affinity_clusters = list(clustering.labels_)
content_affinity_clusters

[14,
 1,
 0,
 10,
 16,
 1,
 17,
 18,
 5,
 14,
 2,
 17,
 19,
 2,
 3,
 0,
 13,
 19,
 14,
 7,
 3,
 16,
 4,
 11,
 4,
 17,
 18,
 1,
 5,
 10,
 12,
 6,
 6,
 18,
 10,
 3,
 3,
 7,
 7,
 3,
 4,
 10,
 4,
 19,
 18,
 15,
 11,
 14,
 14,
 19,
 5,
 6,
 8,
 9,
 2,
 11,
 2,
 4,
 2,
 7,
 17,
 10,
 11,
 15,
 20,
 6,
 2,
 6,
 12,
 15,
 16,
 1,
 5,
 0,
 1,
 5,
 13,
 13,
 14,
 19,
 12,
 2,
 17,
 4,
 15,
 8,
 15,
 16,
 6,
 6,
 11,
 20,
 20,
 19,
 10,
 20,
 20,
 5,
 17,
 18,
 19,
 6,
 20,
 17,
 4,
 20]

In [32]:
papers['title_cluster'] = content_affinity_clusters

In [50]:
#Let's check all papers in cluster 11

papers_cluster11 = papers.loc[papers['title_cluster']==11,['Title','Abstract_Keyphrases']]

In [51]:
papers_cluster11

Unnamed: 0,Title,Abstract_Keyphrases
23,The USTC-NEL Speech Translation system at IWSL...,"[(speech recognition output style text, 0.2035..."
46,Speech and Speaker Recognition from Raw Wavefo...,"[(novel convolutional neural network, 0.096922..."
55,Fully Convolutional Speech Recognition,"[(external convolutional language model, 0.139..."
62,wav2letter++: The Fastest Open-source Speech R...,"[(source speech recognition systems, 0.1420607..."
90,Cross Lingual Speech Emotion Recognition: Urdu...,"[(language speech emotion database, 0.20925917..."


In [52]:
dict(sorted(papers_cluster11.values.tolist())) 

{'Cross Lingual Speech Emotion Recognition: Urdu vs. Western Languages': [('language speech emotion database',
   0.20925917398703983),
  ('cross - lingual emotion recognition', 0.17286046093582683),
  ('lingual speech emotion recognition', 0.1662986326454596),
  ('automatic speech emotion recognition systems', 0.1614584469281196),
  ('different western languages', 0.1410688654169332),
  ('such limited languages', 0.13859719235951926),
  ('speech emotion recognition', 0.13197124764196375),
  ('urdu language', 0.13075541803609486),
  ('unseen language such', 0.12844079180356072),
  ('adaptive emotion recognition system', 0.11676755469865711)],
 'Fully Convolutional Speech Recognition': [('external convolutional language model',
   0.13910843890799893),
  ('art speech recognition systems', 0.13138509205291768),
  ('times more acoustic data', 0.12063729554379947),
  ('convolutional neural networks', 0.11552568542748626),
  ('convolutional approach', 0.0849454040106537),
  ('wall street jo

Above cluster seems to contain papers on speech processing. We can also see that top 3 keyphrases extracted using textrank algorithm have a good correspondence with paper titles.

_Thanks for reading this notebook. Please share your valuable feedback & upvote if you learn something new today from this analysis. Till now focus was on textrank algorithm only, I will also add comparision with other keyphrase extraction algorithms along with small description on how these graph-based keyphrase extraction algorithms work._