Knowledge extraction framework built with extensibility and multilinguality in mind.
Java
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
src
.gitignore
LICENSE
README.md
bean.xml
dependency-reduced-pom.xml
licenseheader.txt
nb-configuration.xml
pom.xml

README.md

Distiller-CORE library

Distiller is a framework to extract and infer knowledge from texts. Distiller takes its roots from DIKpE [1] and further evolutions [4], but it is improved with multilanguage support [5], entity linking with [2] and concept inference. By now, Distiller supports only keyphrase extraction in Italian and English; we plan to include support for keyphrase extraction in other languages.

The default Distiller pipeline works on keyphrase extraction; anyways, since the framework is built with extensibility in mind, it's possible to extend it and write pipelines for any high-level NLP task. As an example, we include a simple Sentiment Analysis module, based on M.L. Jokers' Syuzhet library [3].

Architecture

The architecture of the framework and its usage is described in "Introducing Distiller: a unifying framework for Knowledge Extraction" , 1st AI*IA Workshop on Intelligent Techniques At Libraries and Archives, 2015 (download link).

How to build and use the Distiller

Inside the Wiki we have some guides on downloading, building and using Distiller.

Distiller, by now, is distibuted in source code form only. You can open it in your favourite IDE or compile it yourself, since it's just a simple Maven project). When we'll reach a stable enough codebase, we'll also publish Distiller to Maven Central or another Maven repository, to make it easier to use it in your projects.

Please note that to use some features of Distiller you should install also R.

Acknowledgements

The "dirty work" in the library is handled mainly by three libraries:

The Italian language implementation of the Distiller is made possible by:

Citing

If you use Distiller, please cite this paper:

@inproceedings{distillerintroducing,
  title={Introducing Distiller: a unifying framework for Knowledge Extraction},
  author={Basaldella, Marco and De Nart, Dario and Tasso, Carlo},
  year={2015},
  booktitle={Proceedings of 1st AI*IA Workshop on Intelligent Techniques At Libraries and Archives co-located with XIV Conference of the Italian Association for Artificial Intelligence (AI*IA 2015)},
  organization={Associazione Italiana per l'Intelligenza Artificiale},  
  year={2015}
}

License

This program is free software; you can redistribuite it and/or modify it under the terms of the GNU/General Pubblic License as published the Free software Foundation; either version 2 of the License, or (at your opinion) any later version.

References

[1] Pudota, Nirmala, et al. "Automatic keyphrase extraction and ontology mining for content‐based tag recommendation." International Journal of Intelligent Systems 25.12 (2010): 1158-1186.

[2] Paolo Ferragina, Ugo Scaiella. "Fast and Accurate Annotation of Short Texts with Wikipedia Pages". IEEE Software 29(1): 70-75 (2012).

[3] https://github.com/mjockers/syuzhet

[4] De Nart, Dario, and Carlo Tasso. "A domain independent double layered approach to keyphrase generation." WEBIST 2014-Proceedings of the 10th International Conference on Web Information Systems and Technologies. 2014.

[5] Dante Degl'Innocenti, Dario De Nart, Carlo Tasso. "A New Multi-lingual Knowledge-base Approach to Keyphrase Extraction for the Italian Language". KDIR 2014: 78-85