EMNLP 2015 Tutorial -- Semantic Similarity Frontiers: From Concepts to Documents
David Jurgens and Mohammad Taher Pilehvar
Semantic similarity forms a central component in many NLP systems, from lexical semantics, to part of speech tagging, to social media analysis. Recent years have seen a renewed interest in developing new similarity techniques, buoyed in part by work on embeddings and by SemEval tasks in Semantic Textual Similarity and Cross-Level Semantic Similarity. The increased interest has led to hundreds of techniques for measuring semantic similarity, which makes it difficult for practitioners to identify which state-of-the-art techniques are applicable and easily integrated into projects and for researchers to identify which aspects of the problem require future research.
This tutorial synthesizes the current state of the art for measuring semantic similarity for all types of conceptual or textual pairs and presents a broad overview of current techniques, what resources they use, and the particular inputs or domains to which the methods are most applicable. We survey methods ranging from corpus-based approaches operating on massive or domains-specific corpora to those leveraging structural information from expert-based or collaboratively-constructed lexical resources. Furthermore, we review work on multiple similarity tasks from sense-based comparisons to word, sentence, and document-sized comparisons and highlight general-purpose methods capable of comparing multiple types of inputs. Where possible, we also identify techniques that have been demonstrated to successfully operate in multilingual or cross-lingual settings.
Our tutorial provides a clear overview of currently-available tools and their strengths for practitioners who need out of the box solutions and provides researchers with an understanding of the limitations of current state of the art and what open problems remain in the field. Given the breadth of available approaches, participants will also receive a detailed bibliography of approaches (including those not directly covered in the tutorial), annotated according to the approaches abilities, and pointers to when open-source implementations of the algorithms may be obtained.
Along with the tutorial slides, we have provided an extended bibliography with all the papers mentioned in the slides and many papers that we wished we had time to discuss. Our hope this that this bibliography serves as an living document capturing much of the current work in the space and we welcome contributions and suggestions for new works to add to it. We especially welcome git pull requests that directly update the .bib and .tex files, which we can use to regenerate the .pdf