No description, website, or topics provided.
Branch: master
Clone or download
Daniel Preotiuc
Daniel Preotiuc added conferences
Latest commit 19f43df Apr 26, 2016
Type Name Latest commit message Commit time
Failed to load latest commit information.
candidates added sample files Apr 19, 2016
conferences added conferences Apr 26, 2016 added conferences Apr 26, 2016 .. Apr 19, 2016



This is an implementation of the TextRank algorithm for keyword extraction from documents. It adapts the PageRank algorithm to documents and was originally published in this article.

Intuitively, it builds a graph of words which are linked by the number of times they appear in the same context (here, same sentence). Then, it finds the words that most central in this graph, i.e. appear in context with as many other words from separate parts of the graph. The further refine, it performes part-of-speech tagging on all the debates and took into account only nouns as these are known to be most distinctive for summarization purposes. Then, a chunker identifies names like ‘Wall Street’ or ‘New York’ and collocations such as ‘ballistic missile’ or ‘coal miner’. Finally, it outputs lemmatized words in order to merge words with the same lemma such as ‘republican’ - ‘republicans’.

For the script to run, you need to install NLTK.


    python folder

folder - folder with the documents to extract keywords

Output: a folder 'keywords-folder-textrank' with the keywords and their score, one per line, separated by a colon. This format can be used to generate word clouds using Wordle


Find the most central words from the US primary debate speeches.

python candidates
Bernie Sanders' primary Debate Speeches keywords generated using Wordle: ![Sanders' keywords](

Find the most central words from the NLP conferences accepted papers.

python conferences
ACL 2015 titles ![ACL 2015]( EMNLP 2015 titles ![EMNLP 2015]( NAACL 2016 titles ![NAACL 2016]( ACL 2016 Short Paper titles ![ACL 2016 Short Papers](