The R function treetag.fertilizer calls a local installation of TreeTagger (Schmid 1994) and identifies sentences in the parsed corpus. The output is a dataframe with tokens in rows and annotations in columns.
treetag.fertilizer(pathToTreetagger, pathToCorpus, language, sentence_delim = c("!", "?", ".", ":", ";"))
pathToTreeTagger | Character string specifying the path to the local installation of TreeTagger, e.g., "/home/user/TreeTagger/" |
pathToCorpus | Character string specifying the path to the corpus files, e.g., "./data/" |
language | Character string naming the language of the corpus, e.g., "english" or "german" |
sentence_delim | Character vector with tokens specifying sentence delimiters |
Note that the function calls TreeTagger with default configuration and has no options to change that. The language argument specifies the language-specific tagging script that TreeTagger uses to parse the corpus. The language label should match the last part of the respective file name (e.g., tree-tagger-german for parsing German-language data). The function uses !?.:; as sentence delimiters by default. Enter a character vector with different elements to change sentence delimiters.
The function is similar to the function treetag in the koRpus package (Michalke 2018), however, much faster. TreeTagger itself is very quick but sentence identification with treetag is slow and slows down with increasing corpus size. treetag.fertilizer speeds up sentence identification and is much less affected by corpus size (see Figure below) while using the most simplistic approach you have ever seen (see R code below).
treetag.fertilizer<-function(pathToTreeTagger, pathToCorpus, language, sentence_delim = c("!", "?", ".", ":", ";")){
# path to language-specific treetagger executable
treetagger<-paste(pathToTreeTagger, "cmd/", "tree-tagger-", language, sep = "")
# generate system command
systemCmd<-paste(treetagger, pathToCorpus)
# launch treetagger
corpus<-as.data.frame(do.call(rbind, strsplit(system(systemCmd, intern = TRUE), "\t")), stringsAsFactors = FALSE)
# set column names
colnames(corpus)<-c("TOKEN", "POS", "LEMMA")
# if corpus files does not end on sentence delimiter...
if(!(corpus$TOKEN[nrow(corpus)] %in% sentence_delim)){
corpus<-rbind(corpus, c(".", "$.", ".")) # ...add full stop at end of corpus file
}
# find position of sentence boundaries
x<-which(corpus$TOKEN %in% sentence_delim)
# calculate length of each sentence
y<-x[-length(x)]
y<-append(0,y)
# length of each sentence
z<-x-y
# produce vector of sequences of numbers for each sentence length
vc<-rep(1:length(z), z)
# assign vector to tagged corpus
corpus$SENTENCE<-vc
return(corpus)
}
Daniel Jach <danieljach@protonmail.com>
© Daniel Jach, University of Zhengzhou, China
Licensed under the MIT License.
Michalke, Meik. 2018. KoRpus: An R Package for Text Analysis. https://reaktanz.de/?c=hacking&s=koRpus.
Schmid, Helmut. 1994. “Probabilistic Part-of-Speech Tagging Using Decision Trees.” In Proceedings of International Conference on New Methods in Language Processing. Manchester, England. http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/.