Skip to content

Latest commit

 

History

History
40 lines (27 loc) · 3.21 KB

index.md

File metadata and controls

40 lines (27 loc) · 3.21 KB

Abstract

Information retrieval in scientific digital libraries is a time consuming and tedious task, because of an often incomplete indexing scientific articles. Accelerated by the uptake of open access to scientific publications, full-text indexing has not yet led to the expected improvement. Rather, exploiting full-text articles requires the use of complex and error-prone Natural Language Processing techniques which may degrade indexing. In previous work, these techniques are often unstated and their impact on the retrieval effectiveness remains unclear. The purpose of the TALIAS project is to re-assess and compare state-of-the-art keyphrase extraction models at increasingly sophisticated levels of document preprocessing. In doing so, we determine to what extend performance variation across keyphrase extraction systems is a function of the effectiveness of document preprocessing, and study their robustness over noisy text.

Results

  • We showed that performance variation across keyphrase extraction systems is, at least in part, a function of the (often unstated) effectiveness of document preprocessing.

  • We empirically showed that supervised models are more resilient to noise, and pointed out that the performance gap between baselines and top performing systems is narrowing with the increase in preprocessing effort.

  • We compared the previously reported results of several keyphrase extraction models with that of our re-implementation, and observed that baseline performance is underestimated because of the inconsistence in document preprocessing.

  • We released both a new version of the SemEval-2010 dataset with preprocessed documents and our implementation of the state-of-the-art keyphrase extraction models using the pke toolkit for use by the community.

Participants

Publications