R package to perform Parts of Speech tagging and morphological tagging based on the Ripple Down Rules-based Part-Of-Speech Tagger (RDRPOS) available at https://github.com/datquocnguyen/RDRPOSTagger. RDRPOSTagger supports pre-trained POS tagging models for 45 languages.
The R package allows you to perform 3 types of tagging.
- UniversalPOS annotation where a reduced Part of Speech and globally used tagset which is consistent across languages is used to assign words with a certain label. This type of tagging is available for the following languages: Ancient_Greek, Ancient_Greek-PROIEL, Arabic, Basque, Belarusian, Bulgarian, Catalan, Chinese, Coptic, Croatian, Czech, Czech-CAC, Czech-CLTT, Danish, Dutch, Dutch-LassySmall, English, English-LinES, English-ParTUT, Estonian, Finnish, Finnish-FTB, French, French-ParTUT, French-Sequoia, Galician, Galician-TreeGal, German, Gothic, Greek, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Italian-ParTUT, Japanese, Korean, Latin, Latin-ITTB, Latin-PROIEL, Latvian, Lithuanian, Norwegian-Bokmaal, Norwegian-Nynorsk, Old_Church_Slavonic, Persian, Polish, Portuguese, Portuguese-BR, Romanian, Russian, Russian-SynTagRus, Slovak, Slovenian, Slovenian-SST, Spanish, Spanish-AnCora, Swedish, Swedish-LinES, Tamil, Turkish, Urdu, Vietnamese.
- POS for doing Parts of Speech annotation based on an extended language/treebank-specific POS tagset. for This type of tagging is available for the following languages: English, French, German, Hindi, Italian, Thai, Vietnamese
- MORPH with very detailed morphological annotation. This type of tagging is available for the following languages: Bulgarian, Czech, Dutch, French, German, Portuguese, Spanish, Swedish
This is based on corpora collected and made available at http://universaldependencies.org.
Examples on Parts of Speech tagging
The following shows how to use the package
library(RDRPOSTagger) models <- rdr_available_models() models$MORPH$language models$POS$language models$UniversalPOS$language x <- c("Oleg Borisovich Kulik is a Ukrainian-born Russian performance artist") tagger <- rdr_model(language = "English", annotation = "POS") rdr_pos(tagger, x = x) x <- c("Dus godvermehoeren met pus in alle puisten, zei die schele van Van Bukburg.", "Er was toen dat liedje van tietenkonttieten kont tieten kontkontkont", " ", "", NA) tagger <- rdr_model(language = "Dutch", annotation = "MORPH") rdr_pos(tagger, x = x) tagger <- rdr_model(language = "Dutch", annotation = "UniversalPOS") rdr_pos(tagger, x = x)
The output of the POS tagging shows the following elements:
doc_id token_id token pos d1 1 Dus ADV d1 2 godvermehoeren VERB d1 3 met ADP d1 4 pus NOUN d1 5 in ADP d1 6 alle PRON d1 7 puisten NOUN d1 8 , PUNCT d1 9 zei VERB d1 10 die PRON d1 11 schele ADJ d1 12 van ADP d1 13 Van PROPN d1 14 Bukburg PROPN d1 15 . PUNCT d2 1 Er ADV d2 2 was AUX d2 3 toen SCONJ d2 4 dat SCONJ d2 5 liedje NOUN d2 6 van ADP d2 7 tietenkonttieten VERB d2 8 kont PROPN d2 9 tieten VERB d2 10 kontkontkont PROPN d3 0 <NA> <NA> d4 0 <NA> <NA> d5 0 <NA> <NA>
More information about the model and the tagging can be found at https://github.com/datquocnguyen/RDRPOSTagger
The general architecture and experimental results of RDRPOSTagger can be found in the following papers:
Dat Quoc Nguyen, Dai Quoc Nguyen, Dang Duc Pham and Son Bao Pham. RDRPOSTagger: A Ripple Down Rules-based Part-Of-Speech Tagger. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2014, pp. 17-20, 2014. [.PDF] [.bib]
Dat Quoc Nguyen, Dai Quoc Nguyen, Dang Duc Pham and Son Bao Pham. A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-Of-Speech Tagging. AI Communications (AICom), vol. 29, no. 3, pp. 409-422, 2016. [.PDF] [.bib]
Installation can easily be done as follows.
install.packages("rJava") install.packages("data.table") install.packages("RDRPOSTagger", repos = "http://www.datatailor.be/rcube", type = "source")
Or with devtools
devtools::install_github("bnosac/RDRPOSTagger", build_vignettes = TRUE)
More details in the package documentation and package vignette
vignette("rdrpostagger-overview", package = "RDRPOSTagger")
The package is licensed under the GPL-3 license as described at http://www.gnu.org/licenses/gpl-3.0.html.
Support in text mining
Need support in text mining. Contact BNOSAC: http://www.bnosac.be