Skip to content

danielpreotiuc/textrank

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

TextRank

Description

This is an implementation of the TextRank algorithm for keyword extraction from documents. It adapts the PageRank algorithm to documents and was originally published in this article.

Intuitively, it builds a graph of words which are linked by the number of times they appear in the same context (here, same sentence). Then, it finds the words that most central in this graph, i.e. appear in context with as many other words from separate parts of the graph. The further refine, it performes part-of-speech tagging on all the debates and took into account only nouns as these are known to be most distinctive for summarization purposes. Then, a chunker identifies names like ‘Wall Street’ or ‘New York’ and collocations such as ‘ballistic missile’ or ‘coal miner’. Finally, it outputs lemmatized words in order to merge words with the same lemma such as ‘republican’ - ‘republicans’.

For the script to run, you need to install NLTK.

Usage

textrank.py

    python textrank.py folder

folder - folder with the documents to extract keywords

Output: a folder 'keywords-folder-textrank' with the keywords and their score, one per line, separated by a colon. This format can be used to generate word clouds using Wordle

Examples

Find the most central words from the US primary debate speeches.

python textrank.py candidates
Bernie Sanders' primary Debate Speeches keywords generated using Wordle: ![Sanders' keywords](http://www.sas.upenn.edu/~danielpr/sanders-trsentw.png)

Find the most central words from the NLP conferences accepted papers.

python textrank.py conferences
ACL 2015 titles ![ACL 2015](http://www.sas.upenn.edu/~danielpr/acl15.png) EMNLP 2015 titles ![EMNLP 2015](http://www.sas.upenn.edu/~danielpr/emnlp15.png) NAACL 2016 titles ![NAACL 2016](http://www.sas.upenn.edu/~danielpr/naacl16.png) ACL 2016 Short Paper titles ![ACL 2016 Short Papers](http://www.sas.upenn.edu/~danielpr/acl16short.png)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages