# TextRank by Gensim
- Install: https://pypi.org/project/gensim/#files
- Tutorial: https://rare-technologies.com/text-summarization-with-gensim/
- Documentation: https://radimrehurek.com/gensim/summarization/keywords.html

In [1]:
import pandas as pd
# load the dataset
df = pd.read_csv('papers.csv')
df.head()

Unnamed: 0,id,year,title,event_type,pdf_name,abstract,paper_text
0,1,1987,Self-Organization of Associative Database and ...,,1-self-organization-of-associative-database-an...,Abstract Missing,767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABA...
1,10,1987,A Mean Field Theory of Layer IV of Visual Cort...,,10-a-mean-field-theory-of-layer-iv-of-visual-c...,Abstract Missing,683\n\nA MEAN FIELD THEORY OF LAYER IV OF VISU...
2,100,1988,Storing Covariance by the Associative Long-Ter...,,100-storing-covariance-by-the-associative-long...,Abstract Missing,394\n\nSTORING COVARIANCE BY THE ASSOCIATIVE\n...
3,1000,1994,Bayesian Query Construction for Neural Network...,,1000-bayesian-query-construction-for-neural-ne...,Abstract Missing,Bayesian Query Construction for Neural\nNetwor...
4,1001,1994,"Neural Network Ensembles, Cross Validation, an...",,1001-neural-network-ensembles-cross-validation...,Abstract Missing,"Neural Network Ensembles, Cross\nValidation, a..."


In [3]:
import gensim

### Try with small text to understand how it works

In [5]:
text = "Non-negative matrix factorization (NMF) has previously been shown to " + \
"be a useful decomposition for multivariate data. Two different multiplicative " + \
"algorithms for NMF are analyzed. They differ only slightly in the " + \
"multiplicative factor used in the update rules. One algorithm can be shown to " + \
"minimize the conventional least squares error while the other minimizes the  " + \
"generalized Kullback-Leibler divergence. The monotonic convergence of both  " + \
"algorithms can be proven using an auxiliary function analogous to that used " + \
"for proving convergence of the Expectation-Maximization algorithm. The algorithms  " + \
"can also be interpreted as diagonally rescaled gradient descent, where the  " + \
"rescaling factor is optimally chosen to ensure convergence."

# Get keywords function
gensim.summarization.keywords(text,
                             ratio=0.5,
                             words=None,
                             split=True,
                             scores=True,
                             pos_filter=('NN', 'JJ'),
                             lemmatize=True,
                             deacc=True)

[('factor', 0.3066938262406012),
 ('convergence', 0.30655322603233187),
 ('rescaling', 0.24358369124679402),
 ('multiplicative', 0.23877639442201237),
 ('function', 0.23315315782740745),
 ('kullback', 0.2073987479309427),
 ('gradient', 0.17745105527488156),
 ('algorithm', 0.16886390349571123),
 ('matrix', 0.1654048954497839),
 ('useful', 0.1597530896224831),
 ('squares', 0.15975308962248302),
 ('optimally chosen', 0.159753089622483),
 ('rules', 0.15975308962248294)]

In [6]:
# Get SUMMARY function
print('SUMMARY: ', gensim.summarization.summarize(text,
                                                 ratio=0.5,
                                                 split=True))

SUMMARY:  ['Non-negative matrix factorization (NMF) has previously been shown to be a useful decomposition for multivariate data.', 'Two different multiplicative algorithms for NMF are analyzed.', 'They differ only slightly in the multiplicative factor used in the update rules.']


### Try with larger text

In [7]:
def get_keywords_gensim(idx, docs):
    keywords=gensim.summarization.keywords(docs[idx],
                                           ratio=None,
                                           words=10,
                                           split=True,
                                           scores=False,
                                           pos_filter=None,
                                           lemmatize=True,
                                           deacc=True)
    return keywords

def print_results(idx,keywords, df):
    # now print the results
    print("\n=====Title=====")
    print(df['title'][idx])
    print("\n=====Abstract=====")
    print(df['abstract'][idx])
    print("\n===Keywords===")
    for k in keywords:
        print(k)

In [8]:
idx=941
keywords=get_keywords_gensim(idx, df['paper_text'])
print_results(idx,keywords, df)


=====Title=====
Algorithms for Non-negative Matrix Factorization

=====Abstract=====
Non-negative matrix factorization (NMF) has previously been shown to 
be a useful decomposition for multivariate data. Two different multi- 
plicative algorithms for NMF are analyzed. They differ only slightly in 
the multiplicative factor used in the update rules. One algorithm can be 
shown to minimize the conventional least squares error while the other 
minimizes the generalized Kullback-Leibler divergence. The monotonic 
convergence of both algorithms can be proven using an auxiliary func- 
tion analogous to that used for proving convergence of the Expectation- 
Maximization algorithm. The algorithms can also be interpreted as diag- 
onally rescaled gradient descent, where the rescaling factor is optimally 
chosen to ensure convergence. 

===Keywords===
factorizations
constraint
algorithm
matrix
functions
use
optimization
computing
theorems
data
