Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
doc
 
 
 
 
 
 
 
 
 
 
 
 

README.md

takahe

takahe is a multi-sentence compression module. Given a set of redundant sentences, a word-graph is constructed by iteratively adding sentences to it. The best compression is obtained by finding the shortest path in the word graph. The original algorithm was published and described in:

  • Katja Filippova, Multi-Sentence Compression: Finding Shortest Paths in Word Graphs, Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 322-330, 2010.

A keyphrase-based reranking method can be applied to generate more informative compressions. The reranking method is described in:

  • Florian Boudin and Emmanuel Morin, Keyphrase Extraction for N-best Reranking in Multi-Sentence Compression, Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2013), 2013.

Dependancies

As of today, takahe is built for Python 2.

You may need to install the following libraries :

Example

A typical usage of this module is:

import takahe
    
# Create a word graph from the set of sentences with parameters :
# - minimal number of words in the compression : 6
# - language of the input sentences : en (english)
# - POS tag for punctuation marks : PUNCT
compresser = takahe.word_graph( sentences, 
							    nb_words = 6, 
	                            lang = 'en', 
	                            punct_tag = "PUNCT" )

# Get the 50 best paths
candidates = compresser.get_compression(50)

# 1. Rerank compressions by path length (Filippova's method)
for cummulative_score, path in candidates:

	# Normalize path score by path length
	normalized_score = cummulative_score / len(path)

	# Print normalized score and compression
	print round(normalized_score, 3), ' '.join([u[0] for u in path])

# Write the word graph in the dot format
compresser.write_dot('test.dot')

# 2. Rerank compressions by keyphrases (Boudin and Morin's method)
reranker = takahe.keyphrase_reranker( sentences,  
									  candidates, 
									  lang = 'en' )

reranked_candidates = reranker.rerank_nbest_compressions()

# Loop over the best reranked candidates
for score, path in reranked_candidates:
	
	# Print the best reranked candidates
	print round(score, 3), ' '.join([u[0] for u in path])

About

takahe is a multi-sentence compression module

Resources

License

Releases

No releases published

Packages

No packages published

Languages

You can’t perform that action at this time.