Skip to content

daniel-jach/treetag-fertilizer

Repository files navigation

R function treetag.fertilizer

Description

The R function treetag.fertilizer calls a local installation of TreeTagger (Schmid 1994) and identifies sentences in the parsed corpus. The output is a dataframe with tokens in rows and annotations in columns.

Usage

treetag.fertilizer(pathToTreetagger, pathToCorpus, language, sentence_delim = c("!", "?", ".", ":", ";"))

Arguments

pathToTreeTagger Character string specifying the path to the local installation of TreeTagger, e.g., "/home/user/TreeTagger/"
pathToCorpus Character string specifying the path to the corpus files, e.g., "./data/"
language Character string naming the language of the corpus, e.g., "english" or "german"
sentence_delim Character vector with tokens specifying sentence delimiters

Details

Note that the function calls TreeTagger with default configuration and has no options to change that. The language argument specifies the language-specific tagging script that TreeTagger uses to parse the corpus. The language label should match the last part of the respective file name (e.g., tree-tagger-german for parsing German-language data). The function uses !?.:; as sentence delimiters by default. Enter a character vector with different elements to change sentence delimiters.

Comparison to treetag in the koRpus package

The function is similar to the function treetag in the koRpus package (Michalke 2018), however, much faster. TreeTagger itself is very quick but sentence identification with treetag is slow and slows down with increasing corpus size. treetag.fertilizer speeds up sentence identification and is much less affected by corpus size (see Figure below) while using the most simplistic approach you have ever seen (see R code below).

R code

treetag.fertilizer<-function(pathToTreeTagger, pathToCorpus, language, sentence_delim = c("!", "?", ".", ":", ";")){
  
  # path to language-specific treetagger executable
  treetagger<-paste(pathToTreeTagger, "cmd/", "tree-tagger-", language, sep = "") 
  
  # generate system command
  systemCmd<-paste(treetagger, pathToCorpus) 
  
  # launch treetagger 
  corpus<-as.data.frame(do.call(rbind, strsplit(system(systemCmd, intern = TRUE), "\t")), stringsAsFactors = FALSE) 
  
  # set column names
  colnames(corpus)<-c("TOKEN", "POS", "LEMMA") 
  
  # if corpus files does not end on sentence delimiter...
  if(!(corpus$TOKEN[nrow(corpus)] %in% sentence_delim)){ 
    corpus<-rbind(corpus, c(".", "$.", ".")) # ...add full stop at end of corpus file
  }
  # find position of sentence boundaries
  x<-which(corpus$TOKEN %in% sentence_delim) 
  
  # calculate length of each sentence
  y<-x[-length(x)] 
  y<-append(0,y)
  
  # length of each sentence
  z<-x-y
  
  # produce vector of sequences of numbers for each sentence length
  vc<-rep(1:length(z), z) 
  
  # assign vector to tagged corpus
  corpus$SENTENCE<-vc 
  
  return(corpus)
}

Author

Daniel Jach <danieljach@protonmail.com>

License and Copyright

© Daniel Jach, University of Zhengzhou, China

Licensed under the MIT License.

References

Michalke, Meik. 2018. KoRpus: An R Package for Text Analysis. https://reaktanz.de/?c=hacking&s=koRpus.

Schmid, Helmut. 1994. “Probabilistic Part-of-Speech Tagging Using Decision Trees.” In Proceedings of International Conference on New Methods in Language Processing. Manchester, England. http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published